Structured Generative Models of Natural Source Code

Interesting/weird paper on Arxiv, abstract:

We study the problem of building generative models of natural source code (NSC); that is, source code written and understood by humans. Our primary contribution is to describe a family of generative models for NSC that have three key properties: First, they incorporate both sequential and hierarchical structure. Second, we learn a distributed representation of source code elements. Finally, they integrate closely with a compiler, which allows leveraging compiler logic and abstractions when building structure into the model. We also develop an extension that includes more complex structure, refining how the model generates identifier tokens based on what variables are currently in scope. Our models can be learned efficiently, and we show empirically that including appropriate structure greatly improves the models, measured by the probability of generating test programs.

This field of "natural source code" is completely new to me, and seems to arise out of this paper, abstract:

Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension. We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations---and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether a) code can be usefully modeled by statistical language models and b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very repetitive, and in fact even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's completion capability. We conclude the paper by laying out a vision for future research in this area.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

first they came for the blue collars

and i did not speak out.
then they came for the white collars...

Statistical program synthesis

Program synthesis has been going in this direction for awhile now. The synthesizer generates real or logical ASTs and, for whatever testing/verification/enumeration procedure, biases the selection by training on a corpus. The models, depending on the sophistication of the team, will pull from typical ML ones. Just like the paper, heuristics are also introduced, and they adapt to the project.

I like the statistical approach for two reasons: the synthesized output will be stylistically correct (imagine giving it to new or junior devs!), and we can start answering questions about programming that we traditionally veer from, e.g., the nature of evolution.