Over the past week we have seen tons of examples regarding GPT-4’s code generation abilities. Here’s a quick recap with three of my favorite examples.
I asked gpt4 to prototype this animation in swiftui. didn't write anything. pic.twitter.com/pMPgsu5CNR
— Meng To (@MengTo) March 16, 2023
Finally, I can again include tikZ figures in my lectures, talks and papers without wasting the precious time given to me on this earth 😅 pic.twitter.com/ncbgdK0jW4
— Tim Rocktäschel (@_rockt) March 16, 2023
With the assistance of GPT-4, I have successfully created a Space Invaders-like game that runs smoothly in my browser! The interactive learning process took only 10 minutes! pic.twitter.com/LkiztADO15
— Keisuke Sakaguchi (@KeisukeS_) March 16, 2023
After gaining access to GPT-4, I was thrilled to put it to the test with the code generation benchmarks multi-lingual humaneval and mbxp. The evaluation covered a wide range of programming languages and yielded impressive results, helping to quantify the model’s performance in each language.
Overall, the performance improvement from the previous models is quite expected. However, we observed much high scores than the reported number in the GPT-4 paper. (more details below)
code-davinci-02 | text-davinci-003 | ChatGPT-3.5 (1 shot) | ChatGPT-4 (1 shot) | |
---|---|---|---|---|
Multi-lingual HumanEval | ||||
Python | 46.3% | 56.7% | 73.2% | 83.5% |
Java | 49.1% | 52.2% | 60.9% | 78.3% |
JavaScript | 51.6% | 58.4% | 66.5% | 71.4% |
TypeScript | 50.9% | 55.9% | 64.6% | 78.9% |
C# | 45.3% | 50.9% | 32.3% | 6.8% |
Go | 21.9% | 35.0% | 34.4% | 50.0% |
Kotlin | 39.8% | 50.3% | 59.0% | 68.9% |
PHP | 52.8% | 58.4% | 63.4% | 74.5% |
Perl | 36.0% | 34.2% | 55.3% | 68.3% |
Ruby | 39.8% | 62.1% | 13.0% | 80.7% |
Scala | 45.3% | 46.0% | 57.1% | 28.0% |
Swift | 24.8% | 39.1% | 48.4% | 61.5% |
42.0% | 49.9% | 52.34% | 62.58% |
Here are some of the key observations.
ChatGPT-3.5 (0 shot) | ChatGPT-4 (0 shot) | GPT-4 (0 shot, reported) | |
---|---|---|---|
Python | 62.2% | 65.2% | 67.0% |
The syntax and code does seem correct in this case. However, repeating the function signature again makes it more susceptible to error.
In this case, we provided a 1-shot prompt (not shown) where the completion contains only the function body. During evaluation, we see below that the generated code contains only the desired function body and is a continuation of the function signature prompt.
If you are interested in experimenting with code generation and want to perform evaluations similar to the ones presented in this blog post, you can access the code at the following GitHub repository: https://github.com/amazon-science/mxeval
. This repository provides a comprehensive set of tools to perform code generation evaluations across a variety of programming languages.
Our datasets are also available on Huggingface Datasets.
https://huggingface.co/datasets/mxeval/multi-humaneval
https://huggingface.co/datasets/mxeval/mbxp
Part of the evaluation code used for this blog is also available in a fork of OpenAI evals https://github.com/benathi/evals/tree/multi-humaneval-pr
. One can run the evaluation by running oaievals multi-humaneval-js
for javascript, for example.
Note that we built upon the Python-only HumanEval benchmark developed by OpenAI, as well as the MBPP benchmark created by Google, to expand the scope of evaluation to over 10 programming languages. We gratefully acknowledge the pioneering work of OpenAI and Google in this area.