LtU Forum

Structured Generative Models of Natural Source Code

Interesting/weird paper on Arxiv, abstract:

We study the problem of building generative models of natural source code (NSC); that is, source code written and understood by humans. Our primary contribution is to describe a family of generative models for NSC that have three key properties: First, they incorporate both sequential and hierarchical structure. Second, we learn a distributed representation of source code elements. Finally, they integrate closely with a compiler, which allows leveraging compiler logic and abstractions when building structure into the model. We also develop an extension that includes more complex structure, refining how the model generates identifier tokens based on what variables are currently in scope. Our models can be learned efficiently, and we show empirically that including appropriate structure greatly improves the models, measured by the probability of generating test programs.

This field of "natural source code" is completely new to me, and seems to arise out of this paper, abstract:

Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension. We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations---and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether a) code can be usefully modeled by statistical language models and b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very repetitive, and in fact even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's completion capability. We conclude the paper by laying out a vision for future research in this area.

Booleans vs strings

I'm working on a simple, dynamically-typed language for business users. We're trying to have as few datatypes as possible to reduce the number of things they have to learn (currently just strings, numbers and tables). I'm considering just using "true" and "false" strings instead of boolean values. I can't see any arguments against it but I can't help feeling that I'm going to regret it.

Is this sensible? The only language I'm aware off that does something similar is TCL, which also allows yes/no/on/off.

EDIT: I suppose erlangs true/false atoms are similar. That gives me some encouragement.

Designing an alternative to s-expressions for language extensibility

While designing my own programming language, I experimented a lot with syntax, but I quickly determined that there are three things I really wanted:

  • I wanted the language to be as extensible as possible: if I wanted to add some control structure like "unless" or transform code arbitrarily with macros, then I should be able to.
  • I wanted syntax as regular as possible and a simple program representation, so that extensions could focus on semantics rather than grammar rules.
  • I wanted the language to look clean and to have little punctuation, similar to Python. I wanted infix operators. S-expressions are regular and simple, but I think they involve too much punctuation and I wasn't willing to compromise on aesthetic preference.

This was an incremental process that I barely documented as I went through it, but in the end I think I came up with something interesting, a kind of syntax that I call o-expressions. I have thought about it a good deal post hoc so that I could explain and justify my findings and I am at a point where I would appreciate some external feedback and criticism. I wrote two blog posts about it:

First, this post describes Liso, a translation of my syntax to s-expressions. It contains many examples, so you can get an intuitive grasp of it. It demonstrates a concrete use as syntactic sugar for Lisp-like languages and I implemented it as a Racket #lang, so you can try it out if you want (source).

My second post is more theoretical. In it I build o-expressions from scratch using principles that I believe are essential to language extensibility.

I plan to keep working on the syntax and port it to other languages, so feedback would be very welcome :)

Minimal implementation of state machines

Dear all, I fear this may be a really strange question, but LtU seems the best place to ask it anyway. I am working on a programming language, Casanova, built for interactive applications and games in particular. The language is built in order to enable applications to implicitly update their state with continuously run "rules" which can also be suspended easily. For example, a lightswitch would be implemented quite easily as:

world LightSwitch = {
  Sprite  : Sprite
  Pressed : bool
  rule Pressed = 
    when(IsKeyDown(Keys.Space))
    yield true
    yield false
    when(IsKeyUp(Keys.Space))
  rule S.Color = 
    yield Color.White
    when Pressed
    yield Color.Black
    when Pressed
}

I would like to try and formalize why this approach is better than existing patterns commonly found in the literature for encoding state machines. I do not wish to do a survey of existing techniques and count lines of code or number of statements. This is clearly a viable approach, but for once I would like to go formal instead of empirical.

My goal is to show a way to encode the program above in a simple imperative language. I would then like to prove that this encoding is minimal with respect to the number of statements used. Finally, I would like to show that the encoding produces code which contains far more statements than the Casanova implementation.

The issues I have are: I have no clue how to go with a proof of minimality for some encoding, and I wonder if the overall approach sounds convincing to you guys.

And the Academy Award goes to... a literate program

Matt Pharr, Greg Humphreys, and Pat Hanrahan have recently been given an Academy Award for Technical Achievement, for the book Physically Based Rendering. This is the first time the award has been given to a book and (more relevant to LtU) the first time a literate program has won an Academy Award.

Examples for benefitfs of dynamic programming languages

Hi,
we are currently working on the design of a controlled experiment, where we hope to measure a benefit of dynamically typed programming languages compared to statically typed ones.

Could anyone of you imagine concrete situation, where the static type system is more an obstacle than a helpful tool?
To clarify my intent: I do not want to start another "which one is better"-discussion, but hope to get concrete (code-) examples of situations where dynamically typed languages are superior.

POPL 2014 proceedings available freely for all

The proceedings can be downloaded from the POPL webpage.

I find this extremely exciting (not only because I didn't get funds to attend POPL this year). To my knowledge, this is the first time that this is done in POPL/ICFP/PLDI; electronic proceedings were previously only delivered to attendees, with an explicit request not to share them.

I am not sure what is the reasoning that make which people decide to do it this year, or not to do it before. I hope that the proceedings will remain available after the conference (next week), and that this idea will be adopted for the years to come.

ACPUL - Another CPU Language - a{};

Hi there,

I create new platform & language for mobile development. This can run on mobile like Codea or AIDE.
New view for game & demo making.

Here is IDE screenshots
http://acpul.org/

Github & source code (compiler & language)
https://github.com/d08ble/acpu/blob/master/samples/TestProject/1017_sys.box2d.acpul

Very simple minimal math language with ideas like STEPS

I'm looking for somebody who can help with beta-testing & demo making, language update, etc.

Pragmatic aspects of dimension types, and the problem of angles

Dimension types are extremely useful for preventing errors in programs that manipulate physical quantities. There are several Haskell libraries for them, F# has a built-in version, etc. (I've read about a nice Python package that addresses the same issues dynamically, which may be a good source of inspiration, though my personal interest is in static approaches.)

Of course there is the thorny issue of selecting a basis with which to express dimensions. The dimensions of the SI base units are a common choice. The dimensions of the cgs units are another appealing choice in some respects, though you do need to deal with fractional powers (e.g. 1 esu = 1 g^1/2 cm^3/2 s^-1, so the dimension of a charge is Mass^1/2 Distance^3/2 Time^-1, whereas in the SI system it is Current^1 Time^1). There is also some appeal to allowing the creation of additional user-defined basic dimensions in addition, to allow dimensions like Vehicle^1 Time^-1 for the rate at which traffic is flowing.

Since my target users are engineers, I am inclined to choose the dimensions of the SI base units. For simplicity I will set aside the question of additional user-defined basic dimensions, because it is only slightly related to my primary question.

There is broad agreement (for decent reasons) that angles are dimensionless, and there is a long history of mostly treating them that way in the mathematics, physics, and engineering literature. Yet pragmatically, serious errors can arise when revolutions are mistaken for radians or angular velocities are confused with frequencies. Factors of 2 pi are liberally sprinkled all over the place, and it is pretty easy to forget or misplace one.

This form of confusion and traditional resort to name-based disciplines seems very much like the type of confusion and lack of formality that dimension types are intended to address. I find that I much prefer my toy system where angles are treated as dimensions to the version in which they are not, because it requires me to annotate all the places where numbers turn into angles (and vice versa) with the "units" that I am intending to use. This naturally leads to extremely good documentation of interfaces, serialization formats, and interoperation with external code.

There is a related problem where torques and energies share the same SI dimension (Mass^1 Distance^2 Time^-2), which can be helped by pretending that angles have a dimension and treating the dimension of torques is Mass^1 Distance^2 Time^-2 Angle^-1. This does lead to some problems, though. For example, torque is no longer equal to the cross product of displacement and force. (Or is it? Both this use of the cross product and the use in defining curl seem to involve angles in the same way, so for a minute I considered changing the type of my cross product operator to introduce a factor of Angle^-1. I don't think it is helpful to do this, though, because I can't figure out how to stand on my head and imagine the use in the Lorentz force equation to involve implicit introduction and elimination of an angle, so it appears that this angular nature is not an inherent property of the cross product operator.) Another approach (not entirely an exclusive choice) is to distinguish through the vector nature of torque and the scalar nature of energy; now taking the norm is where your turns/degrees/radians ambiguity arises, and also it is not clear to me how this approach would work in fewer dimensions.

Should angles be modeled as a dimension for software engineering purposes? Should they be modeled in some other way? Should they be left implicit?

(To throw another wrench in the works, similar remarks apply to solid angles, although they may arise considerably less often for my target users. It also feels arbitrary to stop after 3 dimensions, and I assume--although I am not familiar with one--that there is some sort of generalization of solid angles to n-dimensional space.)

PuzzleScript

I haven't seen this discussed here yet: http://www.puzzlescript.net/

It is an HTML5-based puzzle game engine that uses a simple language for patterns and substitutions to describe game rules. For example (taken from their introduction), the basic block-pushing logic of a Sokoban game can be given as:

[ > Player | Crate ] -> [ > Player | > Crate ]

This line says that when the engine sees the pattern to the left of ->, it should replace it with the pattern on the right. In this case, the rule can be read as something like: when there is a row or column ([]) that contains a player object (Player) next to (|) a crate object (Crate), and the player is trying to move toward the crate (>), then (->) make the crate move in the same direction.

Rules are matched and applied iteratively at each step (i.e., when the player acts), until there are no more matches, and then movement takes place. There are mechanisms for influencing the order in which rules are run, and for forcing subsets of the rules to be iterated. By default, rules apply to both rows and columns, but horizontal- or vertical-only rules can be created.

It is an interesting example of a very narrowly-focused DSL, based on a (relatively) uncommon model of computation. It's also very fun to play with!

XML feed