Programming with a Differentiable Forth Interpreter

On arxiv. Abstract:

There are families of neural networks that can learn to compute any function, provided sufficient training data. However, given that in practice training data is scarce for all but a small set of problems, a core question is how to incorporate prior knowledge into a model. Here we consider the case of prior procedural knowledge such as knowing the overall recursive structure of a sequence transduction program or the fact that a program will likely use arithmetic operations on real numbers to solve a task. To this end we present a differentiable interpreter for the programming language Forth. Through a neural implementation of the dual stack machine that underlies Forth, programmers can write program sketches with slots that can be filled with learnable behaviour. As the program interpreter is end-to-end differentiable, we can optimize this behaviour directly through gradient descent techniques on user specified objectives, and also integrate the program into any larger neural computation graph. We show empirically that our interpreter is able to effectively leverage different levels of prior program structure and learn complex transduction tasks such as sequence sorting or addition with substantially less data and better generalisation over problem sizes. In addition, we introduce neural program optimisations based on symbolic computation and parallel branching that lead to significant speed improvements.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Problem domain mismatch?

I think that there is a serious impedance mismatch here between the solution methodology and the problem domain. Differentiable or subsymbolic methods are good for continuous function approximation, estimation, or what we like to consider "judgment calls" where we can't really articulate a precise method by which we arrive at a decision. Using them to do discrete mathematics (Forth interpretation, Fizzbuzz, etc) doesn't play to their strengths.

The problem with differentiable models is that they are inherently imprecise. As Grus pointed out in the fizzbuzz post, the machine-learning approach got some of the outputs (just a few, but still) wrong. And as the paper on Neural Turing Machines shows, the NTM makes occasional mistakes (miscounting indices or gradually forgetting data that is held for a long time while other steps execute, for example) in executing the algorithms it has learned, even when those algorithms are apparently correct. These are definite, discrete-mathematics problems and using an approximate method to solve them is inappropriate.

That's not to say that approximate methods are entirely useless for discrete problems, but their application requires a more sophisticated approach. One could use a differentiable model with additional inputs that inform it how far its current approximations and outputs are from precise integers, in the belief that it could use that additional information to learn to compensate. Or one could attempt to use a differentiable method to transform problem statements into sequences of symbols expressing discrete solution methods.

In this case it would probably be more appropriate to ask a machine-learning method (perhaps differentiable, perhaps something else like a genetic algorithm) to produce a sequence of discrete symbols to be executed on a non-differentiable Forth interpreter. That would give a repeatable, precise answer that is either definitely correct, or definitely flawed.

I am really fascinated by this topic, btw....

The idea of evolving procedural programs seems extraordinarily rich to me, and I anticipate it being a major part of the future of code. Right now we are restricted to very trivial examples, in large part by the impedance mismatch I outlined above. But my impression is that this is the sort of thing that might grow drastically more powerful with some significant insights, bootstrapping techniques, and figuring out where and how to apply it.

I would love to participate in a discussion about it among knowledgeable people.

In particular I'm interested in the possibilities of trainable systems for code deduplication and program optimization. Instead of just trying to spit out things that are sort-of-associated-with the concepts expressed by the input, and then training to gradually get the meaning precision better the way we do with natural language, operations on code must produce meaningful programs that satisfy objective criteria of correctness.

To achieve a differentiable system it's necessary to be able to interpolate between things. But interpolation among opcode values for opcodes with fundamentally different meanings (w/r/t flow of control in particular) is at best difficult. The differentiable forth interpreter above is horrifyingly inefficient in memory and time, although it may eventually produce (extremely simple) programs that operate at normal speed in normal interpreters.

I have a few woolly ideas about doing this which I haven't tested yet. I'm fascinated to see papers like this that are investigating such ideas.

word

The main thing that cracks me up / gives me some ironic hope about all this is that somebody would have to specify what a "good" acceptable program behaviour is, and that is something which you'd think we'd already be doing if we were any good at software development, but very very very few people actually do, for various and sundry reasons.