Lambda the Ultimate

Deriving a grammar from source
started 9/28/2003; 4:32:57 AM - last post 9/29/2003; 6:25:18 AM

andrew cooke - Deriving a grammar from source

9/28/2003; 4:32:57 AM (reads: 262, responses: 4)

Is this possible? Has anyone done it?

I'm not thinking so much of natural languages (I know that there do exist human-generated attempts at formal grammars for English) as for reverse-engineering source code. Source code should be easier because you can make some useful assumptions about syntax (matching braces, for example).

My motivation is the need to pretty-print code written in an obscure language (the scripting language for IRAF, CL). I found GPP on the 'net, but if I understand correctly it's not as generic as you might think because it require a context free grammar for the language in question.

This raises another practical issue - it's not clear to me that CL has a useful context free grammar. It was designed back when such things were (I suspect) the concerns only of quiche-eaters and is implemented in yacc with a fair amount of behind-the-scenes management of state. But maybe an approximate grammar is possible (I don't have a good explanation for what "approximate grammar" means).

I suspect that the hardest part of the problem isn't generating *a* grammar, but generating one that is both compact and reflects the semantics of the language.

Of course there are probably simpler solutions to the problem (I'm going to try vgrind with each language in turn and see what happens). But I thought the problem was interesting and someone here might be tickled by it.

Sorry for not posting or reading here recently - new job

Neel Krishnaswami - Re: Deriving a grammar from source

9/28/2003; 7:57:51 AM (reads: 257, responses: 1)

It's possible, but likely not worth the effort. You may have heard of hidden Markov models, in which the state of a system is assumed to be controlled by a probabalistic finite-state automaton, and the problem is to infer a state model from a series of observations. There's also something called "hidden Markov grammars", in which the control system is assumed to be a pushdown automaton of some kind, and the problem is to learn the grammar from the observations. As you can no doubt guess, doing this is horrendously complicated.

You are almost certainly better off building a grammar by hand, starting with a simple subset and then adding cases to handle the corner cases and rarely-used features. Write a grammar that can parse one of your documents, and then incrementally extend it as needed to parse the rest.

Ehud Lamm - Re: Deriving a grammar from source

9/28/2003; 10:11:06 AM (reads: 246, responses: 0)

I explored a similar question when I was writing file format convertors, between various wordprocessor formats. I wanted to deduce from the document layout, which elements constituted headings, new paragraphs etc. I ended up inventing ad hoc techniques, but a few years later on I discovered that quite a lot of work was done along these lines.

For what it's worth, I agree with Neel: Unless you are interested in machine learning for its own sake, don't go there...

andrew cooke - Re: Deriving a grammar from source

9/28/2003; 2:12:28 PM (reads: 230, responses: 0)

Thanks guys. I'd underestimated the GPP and overestimated the task - I was assuming it was a CFG with constraints (for whatever parsing they're doing). In fact it says *any* CFG. So doing something by hand that vaguely works shouldn't be so hard (I was expecting to lose days in a morass of shift/reduce conflicts and the like...).

That makes sense doesn't it? I'm pretty tired...!

[added later. hmmmm - you have to install too many things. xemacs(!).] [and even later - just heard from the author; this is now part of Stratego, so install may be simpler]

Neel Krishnaswami - Re: Deriving a grammar from source

9/29/2003; 6:25:18 AM (reads: 189, responses: 0)

It's worth noting as a language design thing, that the names of the productions that give human readers big clues about how the grammar is structured. Even if you can automatically infer a grammar, it will likely be very hard to grasp because the nonterminals will be named things like "X@1245".

But I'm glad that Andrew found a good alternative. :)