Measuring Code Generation Abilities of GPT-4 in 10+ Languages

Recap: Coding with ChatGPT-4

Over the past week we have seen tons of examples regarding GPT-4’s code generation abilities. Here’s a quick recap with three of my favorite examples.

Evaluating Code Generation in 10+ Programming Languages

After gaining access to GPT-4, I was thrilled to put it to the test with the code generation benchmarks multi-lingual humaneval and mbxp. The evaluation covered a wide range of programming languages and yielded impressive results, helping to quantify the model’s performance in each language.

Overall, the performance improvement from the previous models is quite expected. However, we observed much high scores than the reported number in the GPT-4 paper. (more details below)

  code-davinci-02 text-davinci-003 ChatGPT-3.5 (1 shot) ChatGPT-4 (1 shot)
Multi-lingual HumanEval        
Python 46.3% 56.7% 73.2% 83.5%
Java 49.1% 52.2% 60.9% 78.3%
JavaScript 51.6% 58.4% 66.5% 71.4%
TypeScript 50.9% 55.9% 64.6% 78.9%
C# 45.3% 50.9% 32.3% 6.8%
Go 21.9% 35.0% 34.4% 50.0%
Kotlin 39.8% 50.3% 59.0% 68.9%
PHP 52.8% 58.4% 63.4% 74.5%
Perl 36.0% 34.2% 55.3% 68.3%
Ruby 39.8% 62.1% 13.0% 80.7%
Scala 45.3% 46.0% 57.1% 28.0%
Swift 24.8% 39.1% 48.4% 61.5%
  42.0% 49.9% 52.34% 62.58%

Finding Highlights

Here are some of the key observations.

Few-shot prompting can matter a lot for code generation

  ChatGPT-3.5 (0 shot) ChatGPT-4 (0 shot) GPT-4 (0 shot, reported)
Python 62.2% 65.2% 67.0%

GPT-4 consistently outperformed its predecessor, GPT-3 Davinci as well as GPT-3.5 (ChatGPT), in most programming languages.

Examples

Prompt

from typing import List

def separate_paren_groups(paren_string: str) -> List[str]:
    """ Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
    separate those group into separate strings and return the list of those.
    Separate groups are balanced (each open brace is properly closed) and not nested within each other
    Ignore any spaces in the input string.
    >>> separate_paren_groups('( ) (( )) (( )( ))')
    ['()', '(())', '(()())']
    """

Completion

```
def separate_paren_groups(paren_string: str) -> List[str]:
    """
    Separates groups of nested parentheses into separate strings and returns the list of those strings.
    
    Args:
    paren_string: A string containing multiple groups of nested parentheses.
    
    Returns:
    A list of strings, where each string is a separate group of nested parentheses.
    
    Example:
    >>> separate_paren_groups('( ) (( )) (( )( ))')
    ['()', '(())', '(()())']
    """
    stack = []
    result = []
    current = ''
    for char in paren_string:
        if char == '(':
            stack.append(char)
        elif char == ')':
            stack.pop()
        current += char
        if not stack:
            result.append(current.replace(' ', ''))
            current = ''
    return result
```


The syntax and code does seem correct in this case. However, repeating the function signature again makes it more susceptible to error.

completion given 1 shot prompt

In this case, we provided a 1-shot prompt (not shown) where the completion contains only the function body. During evaluation, we see below that the generated code contains only the desired function body and is a continuation of the function signature prompt.

    stack = []
    groups = []
    current_group = ''
    for char in paren_string.replace(' ', ''):
        if char == '(':
            stack.append(char)
            current_group += char
        elif char == ')':
            stack.pop()
            current_group += char
            if not stack:
                groups.append(current_group)
                current_group = ''
    return groups

Getting Started with Code Generation Evaluation

If you are interested in experimenting with code generation and want to perform evaluations similar to the ones presented in this blog post, you can access the code at the following GitHub repository: https://github.com/amazon-science/mxeval. This repository provides a comprehensive set of tools to perform code generation evaluations across a variety of programming languages.

Our datasets are also available on Huggingface Datasets.

https://huggingface.co/datasets/mxeval/multi-humaneval
https://huggingface.co/datasets/mxeval/mbxp

Part of the evaluation code used for this blog is also available in a fork of OpenAI evals https://github.com/benathi/evals/tree/multi-humaneval-pr. One can run the evaluation by running oaievals multi-humaneval-js for javascript, for example.

Note that we built upon the Python-only HumanEval benchmark developed by OpenAI, as well as the MBPP benchmark created by Google, to expand the scope of evaluation to over 10 programming languages. We gratefully acknowledge the pioneering work of OpenAI and Google in this area.