User loginNavigation |
Automatically learning grammars from textI am interested in developing an app that allows users to feed it arbitrary sample text then write similar text using an auto completion tool. The auto completion tool is based on generating text from a context free grammar and will be considerably more powerful than t9/search-bar auto complete because it can generate entire phrases or blocks of code with adjustable components. I have a mockup of the tool here that generates json: http://nathanathan.com/cfg-gen/ However my mockup doesn't offer suggestions for default values at this time. I am struggling with the problem of transforming sample text into into a grammar I can use with my autocompletion tool. I read a bit about the sequitur algorithm but it doesn't solve my problem because it generates deterministic grammars. So I'll refer to the algorithm I'm looking for as nonsequitur. Nonsequitur is Intended to create nondeterministic grammars that minimize the decisions the user needs to make to generate the sample text, without removing decisions that are essential complexity. For example, it should make it so the user only needs to make one decision to write an s-expression rather than two decisions to write a right and left parenthesis. Ideally nonsequitur will make no assumptions about the sample text so it can work with javascript or poetry or maybe even DNA. The approach I've been considering for implementation is to treat the problem as a search problem and use an algorithm like A*. The cost function would be based on the size of the grammar as well as the size of the input sequence needed to reconstruct the source. The formula might look something like this... Grammar size * ( magic number + length of input needed to reconstruct sample text / sample text length ) This formula is intended to avoid the special cases such as the grammer containing no structure because it generates all strings and deterministic grammars that fully encode the input sequence. The search state space is the set of all grammars and input sequences that generate the sample text. I'm not sure what all the transition functions should be. I was thinking replacing duplicate symbol pairs like in sequitur would be one. There will need also be a transition that creates ambiguous rules somehow, perhaps by merging pairs of existing rules. I hope that by discussing the problem here I can find out if I'm reinventing a wheel, discover flaws in my approach and gather suggestions for improvements. By nabreit at 2013-11-26 06:42 | LtU Forum | previous forum topic | next forum topic | other blogs | 6354 reads
|
Browse archives
Active forum topics |
Recent comments
22 weeks 6 days ago
22 weeks 6 days ago
22 weeks 6 days ago
45 weeks 23 hours ago
49 weeks 2 days ago
51 weeks 2 hours ago
51 weeks 2 hours ago
1 year 1 week ago
1 year 6 weeks ago
1 year 6 weeks ago