Unreasonable Effectiveness of LLMs for Code Generation

At this point, we are no longer surprised about what language models can do. However, it is still unclear how language models derive such amazing abilities especially in the area of code generation. This blog discusses the highlights from the paper Multilingual Evaluation of Code Generation Models which give some clue as to how LLMs are so great at coding.

Out of Domain Generalization

If we train a model on one programming language, it turns out that such a model can also write code in different programming languages, especially when the model is large enough! Let’s look at the results and sample generations.

Here, we train a decoder model on three languages: Python, Java, JavaScript. We use the model to sample and generate many versions of code and evaluate with the pass@k metric (one can think of it as accuracies given k chances). The result in Figure 1 shows that not only does it perform well on all languages that are trained on, the model also performs well on unseen languages (PHP, Ruby, Kotlin). How is this possible?

Figure 1: pass@k scores (accuracy) versus sampling budget k

Natural Co-Occurrences of Multi-lingual Knowledge

It turns out that the natural occurrences of code data are quite common. Take the following code for example, which is a Python code that has JavaScript wrapped as a string. This piece of data counts as Python data since it parses the Python interpreter, as well as being from a .py file. We refer to such multi-lingual occurrences of programming languages as knowledge spillver. Such spillover explains why training language models on Python yields a model that can write JavaScript.

The previous result shows the generalization of multi-lingual model trained on three languages. Mono-lingual models can also generalize.

Figure 2: JavaScript as a Python string representing cross-programming-language knowledge spillover.

Multi-ligual versus Mono-lingual

Figure 3: pass@k scores (accuracy) versus model size

Figure 2 represents the results including results comparing multi- and mono-lingual models. There are a lot going on, but let’s break it down.

Figure 4: Different programming language's knowledge composition in each primary's language data due to the natural occurrence of data spillover.

Large Multi-Lingual Models Really Shine

Zero-Shot Translation

Figure 4: Example of function completion with and without translation.
(a) Evaluation results on translation, illustrating that with access to reference solutions, the model can generate more correct functions compared to baseline without translations (indicated by dots)
(b) Tasks that are previously difficult (low solve rate for the baseline) can become easily solvable with translation. For each task within MBXP (MBKP in this case), we show a fraction of generations that pass the tests over the total number of samples (solve rate), where the task indices are ranked to show increasing difficulty. The translation solve rate can be perfect (solve rate 1) for some tasks that originally have 0 solve rate.

Few-Shot Prompts Helps LLMs on Out-of-Domain Languages

(a) Few-shot prompting: Improvement on out-of-domain evaluation due to few-shot prompting, where the examples help guide the model to generate more correct code in the given language. (b) Few-shot prompts results in lower non-assertion (compile, parsing, syntax) errors on out-of-domain (ood) evaluation but has little effect on in-domain (id), consistent with the results in (a).

Evaluation Datasets

All of the above analyses require evaluation datasets in different programming languages. In our work Multilingual Evaluation of Code Generation Models, we outlined how we obtain such datasets via transpiling the original HumanEval and MBPP into HumanEvalX and MBXP. We also compose such datasets for different types of evaluation such as Code Insertion evaluation or Code Robustness evaluation.

Figure : Evaluation Data Synthesis in 10+ Programming Languages.
Figure : Example of Dataset Language Conversion from Python to Java.

Appendix

Codex Performance

It is unclear what data and how much the Codex models are trained on. However, a viable guess would be that they’re trained on as much code data as possible with sufficient amount of steps until the performance plateaus.

Below, we show the result of code-cushman-001 and code-davinci-002 for reference. We can observe that the model performs quite well in all languages.

For the evaluation code, see (link to repo).


Table 1: Codex Performance on MBXP and HumanEvalX with pass@1 and greedy decoding.

  code-cushman-001 code-davinci-002
MBXP    
Python
43.7%
58.7%
Java 45.1%
61.0%
JavaScript 46.4%
62.3%
TypeScript 46.0% 58.9%
C# 46.2% 57.6%
C++ 49.3% 65.7%
Go 32.7% 49.2%
Kotlin 44.6% 60.5%
PHP 44.4% 60.7%
Perl 34.1% 44.0%
Ruby 43.7% 56.3%
Scala 41.9% 59.8%
Swift 31.3% 43.5%
HumanEvalX    
Python
32.3%
46.3%
Java 32.9%
49.1%
JavaScript 28.0%
51.6%
Typescript 34.8% 50.9%
C# 34.8% 45.3%
C++    
Go 16.3% 21.9%
Kotlin 23.0% 39.8%
PHP 31.1% 52.8%
Perl 14.9% 36.0%
Ruby 29.8% 39.8%
Scala 24.2% 45.3%
Swift 14.9% 24.8%

Unabridged Example of Knowledge Spillover

Below we show a full code snippet of a Python file where JS code is wrapped in a string.

"""Create a Javascript script to encode / decode for a specific encoding
described in a file available at
http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/<ENCODING>.TXT
"""

import os
import re
import json
import urllib.request

line_re = re.compile("^(0x[A-Z0-9]+)\s+(0x[A-Z0-9]+)*", re.M)

tmpl = "http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/{}.TXT"
encoding = input("Encoding name: ")
req = urllib.request.urlopen(tmpl.format(encoding.upper()))
data = req.read().decode("ascii")

root_dir = os.path.dirname(os.path.dirname(__file__))
libs_dir = os.path.join(root_dir, "www", "src", "libs")
filename = os.path.join(libs_dir, f"encoding_{encoding.lower()}.js")
with open(filename, "w", encoding="utf-8") as out:
    out.write("var _table = [")
    for line in data.split("\n"):
        mo = line_re.match(line)
        if mo:
            key, value = mo.groups()
            out.write(f"{key}, {value or -1},")
    out.write("]\n")
    out.write("var decoding_table = [],\n    encoding_table = []\n")
    out.write("""for(var i = 0, len = _table.length; i < len; i += 2){
var value = _table[i + 1]
if(value !== null){
    encoding_table[value] = _table[i]
}
decoding_table[_table[i]] = _table[i + 1]
}
$module = {encoding_table, decoding_table}
""")