On the Naturalness of Software

A paper in ICSE 2012 by Abram Hindle, Earl Barr, Zhendong Su, Mark Gabel, and Prem Devanbu. Abstract:

Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension.

We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations—and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether a) code can be usefully modeled by statistical language models and b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very repetitive, and in fact even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse’s built-in completion capability. We conclude the paper by laying out a vision for future research in this area.

More a visionary paper, but seems to be the best work I've seen so far on how to apply ML to programming.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Link points to LtU

Could you correct it please?

Fixed. Sorry, I shouldn't

Fixed. Sorry, I shouldn't post before bed.

Broken link...

Is it this paper?

Quality

Naturalness of language within a project indicates repeating / similar structures in the stream of tokens. Repeating pieces of code is generally not a good way to write software, indeed the authors emphasise that this is how code is written at scale, and under pressure, rather than in ideal environments. I wonder if their technique for measuring cross-entropy can be used to indicate where code can be improved? That is, using the cross-entropy in particular regions of the program as estimates of the "quality" of those regions. Verification and testing are both useful tools for improving software. This may be an orthogonal approach for finding where poor code may have problems, rather than looking directly for problems. This is roughly how I would interpret their comments in the final subsection of Future Directions.

quality etc

Thanks for your interest. We did investigate whether code cross-entropy had any association with defective code in our corpora, as identified by bug-fixing reports in the software repos. There was some suggestion of a relationship in some plots, but we disagreed on methodology; we'll need to do more work before this is ready for peer-review.


There's a fuller version of the paper with a longer future work section (including some interesting possibilities of using language models to improve accessibility for disabled programmers) The link uptop points to this version now.

The draft watermark kind of

The draft watermark kind of detracts from the reading experience...

Folk theorems and killer heuristics

The core observation here is spot on and has use for some pretty basic optimizations. For example, a few years ago, we parallelized our lexers and parsers by speculating based on likely sequences, and later found that some of the Java guys had done something similar. Recently, we've been looking at using these to even vectorizing FSMs/graphs by speculating upon the hotpaths. Likewise, Ben Livshits had some cool JavaScript compression experiments (including finding that gzip algorithms have similar efficacy to language-aware ones!).

I think the surprise here is showing that a simple model may be effective for cutting down choices on code completion, but that highlights a communal problem: I can't tell! For example, a big part of Dave Mandelin's work on jungloid mining was getting the machine learning algorithms to give good recommendations based on likely paths through the type lattice, and Sumit's PLDI paper this year is almost entirely about that. If I were to implement a code completion heuristic, after having read all three works, I still don't know which algorithm I should pick / mix. ML often comes down to heuristics (and mixing them), and we might take a further lesson from them: public corpuses and test sets sheds light. Getting a Netflix challenge of code completion -- at least for these fully automated algorithms -- seems important if these ideas are to build upon each other.

I wonder how much of the

Interesting paper! I wonder how much of the effect in figure 3 is simply because the same identifiers appear multiple times in a project. How much remains of the effect if all project local identifiers are artificially made indistinguishable?

For practical purposes I think it's more interesting how much statistical techniques can add to deterministic semantic techniques, instead of how much they can do on their own. For example if you already take the syntax and the lexical environment into account, and perhaps also type or even unit test information, how much can statistical techiniques improve accurracy on top of that?

Perhaps it's also interesting to complete on the AST level (especially in a structured editor ;)). Given an AST with a hole, infer the probability distribution over nodes to fill that hole. For each possible completion, you compute semantic information like type correctness, and feed that information as additional features to the model.

Hybrid approaches

I have yet to learn about any kind of learning technique that works over a tree-shaped corpus. At best, I think you can take the AST and translate it back into a stream that emphasizes and annotates the features you care about. But I'm not really that familiar with this area.

You could easily do some

You could easily do some analog to n-grams. An 1-gram is an individual node, a 2-gram is a node plus its children, a 3-gram is a node plus its children plus their children, etc. I think this captures the structure of programs much better than token n-grams. For example if you have a node N(A(B,C),_) and we're trying to complete the _. The tokens here are `N` LPAREN `A` LPAREN `B` COMMA `C` RPAREN `COMMA` _ RPAREN. The `N` is very important information for completing the _, but you already have to use 10-grams to even capture the `N` token in your consideration of _. In general the parent of a node is much more relevant to a completion than whatever tokens are directly to the left of it (or to the right, but I don't think that this paper considers tokens to the right?).

But the biggest advantage is that you automatically capture the grammar by using ASTs, and that it's easy to extract contextual information. For example if you build a model on top of characters (or even tokens), you'll have to have a strong statistical model to even begin to capture the grammar of the language so that you'll not present syntactically invalid completions (assuming you want to complete more than just identifier names). In general if you have structure that's already there, it's better to incorporate it directly into your model than to try to have a statistical model infer that already known structure. It's quite unlikely that even very sophisticated machine learning techniques are going to infer from a corpus how to to type based completion.

. In general if you have

. In general if you have structure that's already there, it's better to incorporate it directly into your model than to try to have a statistical model infer that already known structure.

I don't think the ML community would necessarily agree. From what I understand, seeding the system with already-known structure can have adverse effects, biasing the learning process; better to let the system relearn what you already know on its own. Of course, that isn't always practical (a system can't learn how to program on its own yet with current ML technology).

It depends on the situation.

It depends on the situation. If (1) you have a lot of training data, and (2) the structure you're adding is only approximately true, then it can be a bad idea. If you don't have a lot of training data, it can help to incorporate structure into a model even though it's not necessarily always true. For example in text classification you can turn all text into lowercase, which models our prior belief that upper case and lower case doesn't matter for the meaning of text (which is only approximately true). If you only have a small training set, this will probably improve performance. Similarly, if you do have a virtually unlimited training set, it might be a bad idea to split words at whitespace boundaries. A learning algorithm can potentially figure that out on its own, and then it will also work in edge cases where words are not split in the way the algorithm designer expected. But if the structure is perfectly followed, then it cannot possibly hurt to incorporate it in the model.

In this case, I think it is, i.e. you don't ever want to autocomplete syntactically invalid tokens (this is true by definition in a structured editor, btw). A learning algorithm will also figure that out on its own if the model is expressive enough to capture the grammar of the language, the corpus of training data is big enough, and the code in the corpus is syntactically valid (which it most likely is). So in the limit of a very strong model and unlimited training data, you'll arrive at the same solution anyway. But note that while the corpus might be big enough, the model is probably not expressive enough to capture a programming language's grammar (token n-grams certainly aren't).

We can include known

We can include known structure either on the backend (ML inputs) or frontend (ML outputs). First, we can say include it in the corpus somehow, or bias the ML algorithm with knowledge about it. Second, we could use it to filter results; e.g., the ML algorithm proposes "incorrect" completions, we just ignore those and pick the highest probably correct completions.