archives

Literature on recovering grammars?

Hi, for awhile now I've been interested in how one can generate a grammar from a corpus of source code. I have a vague idea as to how I might go about it and I was hoping to get pointers to more info.

The basic idea is that we try to find the smallest "regular expression" that can accept samples from our corpus, where "regular expression" actually means "can parse a context-free grammar".

We would start by finding the strings that occur over and over again, and we would match those by a simple string matcher. Then we would attempt to find a regexp that characterises the stuff that occurs between the commonly occuring substrings. I am much less certian how to do this part. the ideas that I have are too vague to describe.

Another thing that I think is important, is that we would address the complexity issue by first running our grammar recoverer on a small sample. The idea is that the small sample would do the bulk of the work in discovering the grammar, while later iterations would do less work because they are only refining the work that was done before. I think this could be done by running the recoverer and seeing how it fails, doing only enough work to refine the recoverer into a working one.

Anyways, I know this is all vague, but can anyone point me to research on how to recover a grammar for a language from samples of text in that language? I am going to a nearby college to check out some books on formal languages but I'm doubtful that I'll find what I'm looking for.

Dao, a new programming language

Hi,

I started to read LtU regularly since two or three months ago, and found it is very interesting. I am also very interested in programming languages, and create one named Dao. It is still under active developing, to further improve it, I would like to bring it to the attention of LtU users, and to see if I can get some feedbacks from you;-)

Though no stable version of this language has been made available yet, a number of interesting features have been implemented in this language. In particular, IMHO, there following could be interesting for LtU users:

1. Concurrency: coroutine, thread, asynchronous function call and message passing interface;
2. A macro system that allow defining new syntax in a similar way as writing a BNF-like syntax description;
3. Mixed programming with C language, allowing embedding C codes in Dao scripts.

Dao is an object-oriented language with rich data types and syntax. Though it is not designed to be functional, it supports first class functions, so it may allow certain kind of functional programming. Other features supported include exception handling, numeric array, regular expression matching and automatic memory management etc. It also has a clean interface to C language, and is extendable and embeddable.

The implementation of the language and the mentioned features is by no means perfect, there is still much to be improved. The documentation is still priliminary, but should be enough to get a clear idea of the language. It would nice if you can make some comments, critiques and suggestions on this language. Thank you!

links:
Dao Language Site
SourceForge project for the language

The Computer Revolution Hasn't Happened Yet

Google video recording of Alan Kay's OOPSLA 1997 keynote. Very good!