Automatically learning grammars from text

I am interested in developing an app that allows users to feed it arbitrary sample text then write similar text using an auto completion tool. The auto completion tool is based on generating text from a context free grammar and will be considerably more powerful than t9/search-bar auto complete because it can generate entire phrases or blocks of code with adjustable components. I have a mockup of the tool here that generates json: http://nathanathan.com/cfg-gen/ However my mockup doesn't offer suggestions for default values at this time.

I am struggling with the problem of transforming sample text into into a grammar I can use with my autocompletion tool. I read a bit about the sequitur algorithm but it doesn't solve my problem because it generates deterministic grammars. So I'll refer to the algorithm I'm looking for as nonsequitur. Nonsequitur is Intended to create nondeterministic grammars that minimize the decisions the user needs to make to generate the sample text, without removing decisions that are essential complexity. For example, it should make it so the user only needs to make one decision to write an s-expression rather than two decisions to write a right and left parenthesis.

Ideally nonsequitur will make no assumptions about the sample text so it can work with javascript or poetry or maybe even DNA.
As a test case It should be able to convert an input like this (d is for delimiter):
()d(())d(()())d((()))d(()(()))
Into a grammar this
A:AdA
A:B
B:()
B:(B)
B:BB
I chose this test case because it has paired symbols and it can have arbitrary length lists/nesting. I think special rules may be needed to infer if something can be repeated an arbitrary number of times.

The approach I've been considering for implementation is to treat the problem as a search problem and use an algorithm like A*. The cost function would be based on the size of the grammar as well as the size of the input sequence needed to reconstruct the source. The formula might look something like this...

Grammar size * ( magic number + length of input needed to reconstruct sample text / sample text length )

This formula is intended to avoid the special cases such as the grammer containing no structure because it generates all strings and deterministic grammars that fully encode the input sequence.

The search state space is the set of all grammars and input sequences that generate the sample text. I'm not sure what all the transition functions should be. I was thinking replacing duplicate symbol pairs like in sequitur would be one. There will need also be a transition that creates ambiguous rules somehow, perhaps by merging pairs of existing rules.

I hope that by discussing the problem here I can find out if I'm reinventing a wheel, discover flaws in my approach and gather suggestions for improvements.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

literature pointer(?): Gold's theorm

People on LtU who's recollection is fresher than mine should speak up if I'm wrong here but my suggestion is to begin your literature search with "Gold's theorem" and maybe start by actually reading the paper in which that was first published (then following things that cite it).

I think there are long-established fundamental limits on what you are trying to do that will give you some hint on possibly useful things you can do instead.

NP complete problem with approximations

http://www.cs.virginia.edu/~shelat/papers/GrammarIEEE.pdf

Updates on my attempts at grammar induction

Since I first posted this problem I learned that what I'm trying to do falls into the category of "Grammar induction." Also, I tried to find some additional help with it on the compsci reddit: http://www.reddit.com/r/compsci/comments/1yiyem/learning_grammars_from_text_samples/

I'm struggling to come up with some grammar transformation heuristics that I think are likely enough to work to be worth implementing. I think the heuristics I proposed on reddit have problems. In particular, I think my approach to correlating strings to find categories, like nouns or names, is poor.

I have an idea for a better heuristic, but it's not without it's problems. I hope some of the smart people here can help me figure out if there's a way to make it work. Here it is:

Start by just focusing on chunking the grammar into common patterns. Every repeating string should be replaced with a rule in the grammar. Then build a graph with three node sets, one representing the symbols, and 2 additional sets representing left-hand-side and right-hand-side adjacency to each symbol. Connect every symbol node to the LHS and RHS nodes for every symbol they neighbor somewhere (including symbols their neighboring symbols produce). In the resulting graph, two symbol nodes sharing a lot of neighbors corresponds to them being correlated in where they appear, so they can probably be combined into a single rule with multiple productions. As rules are combined the graph can be rebuilt and stronger correlations may be found. Here's an example that illustrates the process:

Initial grammar:

R -> delicious pizza
R -> delicious salad
R -> good salad
R -> good pizza
R -> good dog
R -> edible salad
R -> Caesar salad

Repeating strings made into rules:

D -> delicious
G -> good
P -> pizza
S -> salad
R -> D P
R -> D S
R -> G S
R -> G P
R -> G dog
R -> edible S
R -> Caesar S

The graph:

P_lhs : D, G
S_lhs : D, G, edible, Caesar
D_rhs : P, S
G_rhs : P, S, dog
dog_lhs : G
edible_rhs : S
Caesar _rhs : S

Now from this graph I would like to be able to infer that pizza can be said to be edible since salad is edible and they are both delicious and good. But how do I do this without inferring that edible can come before dog or that Caesar salad can come before pizza (assuming I have a much larger set of sample phrases)? I would like to be able to generate categories that can share categories of common words like good while maintaining some words that are unique to the category.

Not solvable in general.

Grammar induction is also known by the name "Grammar Inference". You can find articles and other resources under both names. In particular there is a very new textbook that covers the subject well:

Grammatical Inference: Learning Automata and Grammars. Colin de la Higuera, Cambridge University Press, 2010.

There are two parts to any learning problem:

  • The algorithm.
  • The set of data in the domain.

In your initial post and then in your follow-up you've given some details of the algorithm that you are interested in designing. Unfortunately no-one can offer you any advice without knowing the range of data in the domain that you are interested in. Given the jump from a simple bracket matching language to a form of natural language processing it seems as if you want to try to find a solution in general, for any kind of data.

The problem is that what you want to do is not possible in general. Gold's Theorem (as Thomas Lord point to towards earlier) specifically states that you can learn CFGs from positive examples only. Your newer example looks like a context-sensitive language, which is would be a more difficult learning problem than CFGs. In order to make any progress it is necessary to restrict the set of languages to something tractable, and then try and define an algorithm that works in the special case of that set. This is a basic consequence of the No Free Lunch theorem(s?) in Machine Learning. If your goal is to write an algorithm that works equally well on any arbitrary input language (e.g. code, poetry, english etc) then it would be best to try and solve a single problem first.

As a starting point take a look at this recent survey. It gives a good overview of the different approaches currently in use for natural language (context-sensitive languages).

Perhaps you should try and estimate the size of the search space for your problem, and work out how large the set of valuable responses would be to gain a rough understanding of how difficult this search problem would be?

ADIOS algorithm

I'm not looking for an algorithm that can build context-sensitive grammars but rather build CFGs that have rules specific to the context they are used in. For example:

statement -> this [noun] is [general adjective] | this [food] is [food adjective]
general adjective -> good | bad
food adjective -> delicious | gross | [general adjectives]

The survey you linked to is fantastic. It mentions the ADIOS (Automatic DIstillation Of Structure) algorithm (Solan et al. 2005), which has some similarities to the heuristic I proposed, but I think it will produce different symbol equivalency classes. As far as I can tell ADIOS does not have a way to merge classes that that are used in different grammatical structures. For example, I believe it wouldn't be able to form a single adjective class if it had phrases of the form "[adjective] [noun]" and "the [noun] is [adjective]"