Literature on recovering grammars?

Hi, for awhile now I've been interested in how one can generate a grammar from a corpus of source code. I have a vague idea as to how I might go about it and I was hoping to get pointers to more info.

The basic idea is that we try to find the smallest "regular expression" that can accept samples from our corpus, where "regular expression" actually means "can parse a context-free grammar".

We would start by finding the strings that occur over and over again, and we would match those by a simple string matcher. Then we would attempt to find a regexp that characterises the stuff that occurs between the commonly occuring substrings. I am much less certian how to do this part. the ideas that I have are too vague to describe.

Another thing that I think is important, is that we would address the complexity issue by first running our grammar recoverer on a small sample. The idea is that the small sample would do the bulk of the work in discovering the grammar, while later iterations would do less work because they are only refining the work that was done before. I think this could be done by running the recoverer and seeing how it fails, doing only enough work to refine the recoverer into a working one.

Anyways, I know this is all vague, but can anyone point me to research on how to recover a grammar for a language from samples of text in that language? I am going to a nearby college to check out some books on formal languages but I'm doubtful that I'll find what I'm looking for.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Grammatical inference

It's called "grammar induction" or "grammatical inference".
  • Feldman, Jerome A., James Gips, James J. Horning, and Stephen Reder. 1969. Grammatical complexity and inference. Tech. Rep. 125, Department of Computer Science, Stanford University.
  • Horning, James Jay. 1969. A study of grammatical inference. Ph.D. thesis, Department of Computer Science, Stanford University. Also available as Tech.Rep. CS 139, Department of Computer Science, Stanford University.
  • Horning, James Jay. 1971. A procedure for grammatical inference. In Information Processing '71: Proceedings of the IFIP congress, ed. C. V. Freiman, John E. Griffith, and J. L. Rosenfeld, vol. 1, 519-523. Amsterdam: North-Holland.
  • Stolcke, Andreas. 1994. Bayesian learning of probabilistic language models. Ph.D. thesis, University of California at Berkeley.
  • Stolcke, Andreas, and Stephen M. Omohundro. 1993. Hidden Markov model induction by Bayesian model merging. In Advances in neural information processing systems 5 (1992), ed. Stephen Jose Hanson, Jack D. Cowan, and C. Lee Giles, 11-18. San Francisco: Morgan Kaufmann.
  • Stolcke, Andreas, and Stephen M. Omohundro. 1994. Inducing probabilistic grammars by Bayesian model merging. In ICGI-94: 2nd international colloquium on grammatical inference and applications, ed. Rafael C. Carrasco and José Oncina, 106-118. Lecture Notes in Computer Science 862, Berlin: Springer-Verlag.
Check out Sequitur as well.

see also deMarcken

Carl de Marcken also did some good work on this:

@InProceedings{dema95,
author = {Carl de Marcken},
title = {On The Unsupervised Induction Of Phrase-Structure Grammars},
booktitle = {SIGDAT},
year = 1995,
organization = {ACM}
}

@techreport{dema95b,
author = {Carl de Marcken},
title = {The Unsupervised Acquisition of a Lexicon from Continuous Speech},
institution = {MIT},
year = 1995,
number = {AIM-1558},
url = {citeseer.nj.nec.com/demarcken96unsupervised.html}
}

Syntactic Pattern Recognition

You may be able to find a copy of the book Syntactic Pattern Recognition by Gonzalez and Thomason. Chapter 6: Gramatical Inference,

Hard Problem?

I'm not an expert but from reading about learning automata from examples I would conclude that this is difficult. Learned automata tend to be much more complex than one would like them to be. If I understand the article below correctly, learning the minmal automaton requires interaction between a learner and a teacher that confirms an automaton or provides a counterexample.


R. L. Rivest , R. E. Schapire, Inference of finite automata using homing sequences, Proceedings of the twenty-first annual ACM symposium on Theory of computing, p.411-420, May 14-17, 1989, Seattle, Washington, United States.

Google also finds Learning of Context-Free Languages: A Survey of the Literature, a technical report from Harvard.

Not So Hard

There are a bunch of recent papers on this topic that perform quite well. A very good two part survey is:

@article{thollard05:pfa,
author = {Franck Thollard and Colin de la Higuera and Rafael C. Carrasco},
note = {Member-Enrique Vidal and Member-Francisco Casacuberta},
title = {Probabilistic Finite-State Machines-Part I},
journal = {IEEE Trans. Pattern Anal. Mach. Intell.},
volume = 27,
number = 7,
year = 2005,
issn = {0162-8828},
pages = {1013--1025},
doi = {http://dx.doi.org/10.1109/TPAMI.2005.147},
publisher = {IEEE Computer Society},
address = {Washington, DC, USA},
}

@article{thollard05:pfa2,
author = {Frank Thollard and Colin de la Higuera and Rafael C. Carrasco},
note = {Member-Enrique Vidal and Member-Francisco Casacuberta},
title = {Probabilistic Finite-State Machines-Part II},
journal = {IEEE Trans. Pattern Anal. Mach. Intell.},
volume = {27},
number = {7},
year = {2005},
issn = {0162-8828},
pages = {1026--1039},
doi = {http://dx.doi.org/10.1109/TPAMI.2005.148},
publisher = {IEEE Computer Society},
address = {Washington, DC, USA},
}

The second part deals with the issue of learning structure. It doesn't actually say so much about this issue but does give references into the literature.

Learning of timed systems

Is for example done here:

@InProceedings{GrinchteinJL04,
author = {Olga Grinchtein and Bengt Jonsson and Martin Leucker},
title = {Learning of Event-Recording Automata},
booktitle = {Proceedings of the Joint Conferences {FORMATS} and {FTRTFT}},
year = 2004,
series = {Lecture Notes in Computer Science},
month = sep,
volume = "3253",
}