archives

Automatically learning grammars from text

I am interested in developing an app that allows users to feed it arbitrary sample text then write similar text using an auto completion tool. The auto completion tool is based on generating text from a context free grammar and will be considerably more powerful than t9/search-bar auto complete because it can generate entire phrases or blocks of code with adjustable components. I have a mockup of the tool here that generates json: http://nathanathan.com/cfg-gen/ However my mockup doesn't offer suggestions for default values at this time.

I am struggling with the problem of transforming sample text into into a grammar I can use with my autocompletion tool. I read a bit about the sequitur algorithm but it doesn't solve my problem because it generates deterministic grammars. So I'll refer to the algorithm I'm looking for as nonsequitur. Nonsequitur is Intended to create nondeterministic grammars that minimize the decisions the user needs to make to generate the sample text, without removing decisions that are essential complexity. For example, it should make it so the user only needs to make one decision to write an s-expression rather than two decisions to write a right and left parenthesis.

Ideally nonsequitur will make no assumptions about the sample text so it can work with javascript or poetry or maybe even DNA.
As a test case It should be able to convert an input like this (d is for delimiter):
()d(())d(()())d((()))d(()(()))
Into a grammar this
A:AdA
A:B
B:()
B:(B)
B:BB
I chose this test case because it has paired symbols and it can have arbitrary length lists/nesting. I think special rules may be needed to infer if something can be repeated an arbitrary number of times.

The approach I've been considering for implementation is to treat the problem as a search problem and use an algorithm like A*. The cost function would be based on the size of the grammar as well as the size of the input sequence needed to reconstruct the source. The formula might look something like this...

Grammar size * ( magic number + length of input needed to reconstruct sample text / sample text length )

This formula is intended to avoid the special cases such as the grammer containing no structure because it generates all strings and deterministic grammars that fully encode the input sequence.

The search state space is the set of all grammars and input sequences that generate the sample text. I'm not sure what all the transition functions should be. I was thinking replacing duplicate symbol pairs like in sequitur would be one. There will need also be a transition that creates ambiguous rules somehow, perhaps by merging pairs of existing rules.

I hope that by discussing the problem here I can find out if I'm reinventing a wheel, discover flaws in my approach and gather suggestions for improvements.