Pathological Problems in Parsing

Hello LtU!

I have composed an algorithm for a garbage-collected LL-derived parser for extracting parse forests in a dynamic-syntax language. I'm not really sure how to go about calculating the time complexity though, I have no formal training. So far all my tests seem to indicate that the algorithm is O(n^2) in the length of the input in the worst-case and linear in the best case.

Since it has been stated that GLR can take O(n^3) in the worst case i find my result surprising. I have been unable to locate any examples that trigger the O(n^3) behaviour in the GLR though.

So my question is, does do readers of LtU have any examples of really pathological grammars?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Take a look at GLL

I'd suggest looking at Elizabeth Scott and Adrian Johnstone's series of papers on their "GLL" method of parsing. Sounds closest to what you're doing and they tend to have good example grammars in their papers.

Also, if you're really interested in the topic, i'd recommend Jay Earley's paper from 1970 on parsing context free grammars (in CACM). I think it will be accessible even without formal training and should give you some deeper insight into the issues.

Also, regarding GLR: O(n^3) isn't quite right -- depends on the whether it is building a parse forest or not, because the former can take O(n ^ k), for unbounded (but grammar-specific) k. Again, Scott and Johnstone have some papers on this topic (including how to get O(n^3) GLR parsing).

hmmm

GLL does indeed look close to what I've implemented but I really can't work out the paper. I don't have any formal training in parser generation. Is there an easier introduction?

I managed to work out that my algorithm is cubic time in the worst-case (I think).

I also tested their "highly ambiguous grammar". In my parser, grammar is constructed dynamically and there is no need for "parser generation". The parser accepts a grammar and a text as input and returns a generator which yields the ASTs.


b = re.compile("b")
End = re.compile("$")
S = Choice()
S.append(expr(b,name="S1"))
S.append(expr(S,S, name="S2"))
S.append(expr(S,S,S, name="S3"))
outer = Choice(expr(S,End,name="Grammar"))

for i in outer.parse(input):
print(i)

gives me the output

text="b":
[(S1, ['b']), $]

text="bb"
[(S2, [(S1, ['b']), (S1, ['b'])]), $]

text="bbb"
[(S3, [(S1, ['b']), (S1, ['b']), (S1, ['b'])]), $]
[(S2, [(S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b'])]), $]
[(S2, [(S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])])]), $]

text="bbbb"
[(S3, [(S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b']), (S1, ['b'])]), $]
[(S3, [(S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b'])]), $]
[(S3, [(S1, ['b']), (S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])])]), $]
[(S2, [(S3, [(S1, ['b']), (S1, ['b']), (S1, ['b'])]), (S1, ['b'])]), $]
[(S2, [(S2, [(S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b'])]), (S1, ['b'])]), $]
[(S2, [(S2, [(S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])])]), (S1, ['b'])]), $]
[(S2, [(S1, ['b']), (S3, [(S1, ['b']), (S1, ['b']), (S1, ['b'])])]), $]
[(S2, [(S2, [(S1, ['b']), (S1, ['b'])]), (S2, [(S1, ['b']), (S1, ['b'])])]), $]
[(S2, [(S1, ['b']), (S2, [(S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b'])])]), $]
[(S2, [(S1, ['b']), (S2, [(S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])])])]), $]

text="bbbbb"
[(S3, [(S3, [(S1, ['b']), (S1, ['b']), (S1, ['b'])]), (S1, ['b']), (S1, ['b'])]), $]
[(S3, [(S2, [(S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b'])]), (S1, ['b']), (S1, ['b'])]), $]
[(S3, [(S2, [(S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])])]), (S1, ['b']), (S1, ['b'])]), $]
[(S3, [(S1, ['b']), (S3, [(S1, ['b']), (S1, ['b']), (S1, ['b'])]), (S1, ['b'])]), $]
[(S3, [(S2, [(S1, ['b']), (S1, ['b'])]), (S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b'])]), $]
[(S3, [(S1, ['b']), (S2, [(S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b'])]), (S1, ['b'])]), $]
[(S3, [(S1, ['b']), (S2, [(S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])])]), (S1, ['b'])]), $]
[(S3, [(S1, ['b']), (S1, ['b']), (S3, [(S1, ['b']), (S1, ['b']), (S1, ['b'])])]), $]
[(S3, [(S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])])]), $]
[(S3, [(S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])]), (S2, [(S1, ['b']), (S1, ['b'])])]), $]
[(S3, [(S1, ['b']), (S1, ['b']), (S2, [(S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b'])])]), $]
[(S3, [(S1, ['b']), (S1, ['b']), (S2, [(S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])])])]), $]
[(S2, [(S3, [(S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b']), (S1, ['b'])]), (S1, ['b'])]), $]
[(S2, [(S3, [(S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b'])]), (S1, ['b'])]), $]
[(S2, [(S3, [(S1, ['b']), (S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])])]), (S1, ['b'])]), $]
[(S2, [(S2, [(S3, [(S1, ['b']), (S1, ['b']), (S1, ['b'])]), (S1, ['b'])]), (S1, ['b'])]), $]
[(S2, [(S2, [(S2, [(S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b'])]), (S1, ['b'])]), (S1, ['b'])]), $]
[(S2, [(S2, [(S2, [(S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])])]), (S1, ['b'])]), (S1, ['b'])]), $]
[(S2, [(S2, [(S1, ['b']), (S3, [(S1, ['b']), (S1, ['b']), (S1, ['b'])])]), (S1, ['b'])]), $]
[(S2, [(S2, [(S2, [(S1, ['b']), (S1, ['b'])]), (S2, [(S1, ['b']), (S1, ['b'])])]), (S1, ['b'])]), $]
[(S2, [(S2, [(S1, ['b']), (S2, [(S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b'])])]), (S1, ['b'])]), $]
[(S2, [(S2, [(S1, ['b']), (S2, [(S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])])])]), (S1, ['b'])]), $]
[(S2, [(S2, [(S1, ['b']), (S1, ['b'])]), (S3, [(S1, ['b']), (S1, ['b']), (S1, ['b'])])]), $]
[(S2, [(S1, ['b']), (S3, [(S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b']), (S1, ['b'])])]), $]
[(S2, [(S1, ['b']), (S3, [(S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b'])])]), $]
[(S2, [(S1, ['b']), (S3, [(S1, ['b']), (S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])])])]), $]
[(S2, [(S3, [(S1, ['b']), (S1, ['b']), (S1, ['b'])]), (S2, [(S1, ['b']), (S1, ['b'])])]), $]
[(S2, [(S2, [(S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b'])]), (S2, [(S1, ['b']), (S1, ['b'])])]), $]
[(S2, [(S2, [(S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])])]), (S2, [(S1, ['b']), (S1, ['b'])])]), $]
[(S2, [(S1, ['b']), (S2, [(S3, [(S1, ['b']), (S1, ['b']), (S1, ['b'])]), (S1, ['b'])])]), $]
[(S2, [(S1, ['b']), (S2, [(S2, [(S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b'])]), (S1, ['b'])])]), $]
[(S2, [(S1, ['b']), (S2, [(S2, [(S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])])]), (S1, ['b'])])]), $]
[(S2, [(S1, ['b']), (S2, [(S1, ['b']), (S3, [(S1, ['b']), (S1, ['b']), (S1, ['b'])])])]), $]
[(S2, [(S1, ['b']), (S2, [(S2, [(S1, ['b']), (S1, ['b'])]), (S2, [(S1, ['b']), (S1, ['b'])])])]), $]
[(S2, [(S1, ['b']), (S2, [(S1, ['b']), (S2, [(S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b'])])])]), $]
[(S2, [(S1, ['b']), (S2, [(S1, ['b']), (S2, [(S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])])])])]), $]
[(S2, [(S2, [(S1, ['b']), (S1, ['b'])]), (S2, [(S2, [(S1, ['b']), (S1, ['b'])]), (S1, ['b'])])]), $]
[(S2, [(S2, [(S1, ['b']), (S1, ['b'])]), (S2, [(S1, ['b']), (S2, [(S1, ['b']), (S1, ['b'])])])]), $]

Does this look right? More than this starts to produce a ridiculous number of ASTs.
My complete parser is 85 lines of python code but the only implementation of GLL I can find is in scala with which I am unfamiliar.

And reply to the below. Yes my parser is definitely grounded in the packrat in that it uses memoization but it can definitely handle left recursion as shown above. It can even handle cases that would produce an infinite number of parses such as:

S: S | "b"

from which the generator will yield:

>>> b = re.compile("b")
>>> S = Choice()
>>> S.append(expr(b,name="S1"))
>>> S.append(expr(S, name="S2"))
>>> outer = Choice(expr(S,End,name="Grammar"))
>>> l0 = outer.parse("b")
>>> l0

>>> l0.__next__()
[(S1, ['b']), '']
>>> l0.__next__()
[(S2, [(S1, ['b'])]), '']
>>> l0.__next__()
[(S2, [(S2, [(S1, ['b'])])]), '']
>>> l0.__next__()
[(S2, [(S2, [(S2, [(S1, ['b'])])])]), '']
>>> l0.__next__()
[(S2, [(S2, [(S2, [(S2, [(S1, ['b'])])])])]), '']
>>> l0.__next__()
[(S2, [(S2, [(S2, [(S2, [(S2, [(S1, ['b'])])])])])]), '']
>>> l0.__next__()
[(S2, [(S2, [(S2, [(S2, [(S2, [(S2, [(S1, ['b'])])])])])])]), '']

ad infinitum...

it should be noted that at this point all the actual parsing is complete its just the AST extraction that gets caught in the infinite loop.

The GLL paper

with all due respect to S&J, the problem isn't your lack of formal training. its a tough paper. I've been told that their more recent paper (Modelling GLL ...) has a somewhat better overview.

Here's (part of) a summary i wrote a while back after reading it, fwiw:

* GLL is LL parsing with support for ambiguity and left-recursion and relatively efficient operation.
* These features are achieved by replacing the (implicit) call stack of a RD (recursive-descent) parser with an explicit GSS-style structure (note: GSS, graph-structured stack, is the data structure used in the GLR algorithm to encode the multiple, alternative LR stacks).
* GLL describes a direct translation from a grammar to code. It does not use an intermediate automaton.
* GLL need not traverse depth first. It can handle breadth first as well.

I would say the core idea of the paper is: change the call stack of a recursive-descent parser into a GSS. S & J did not highlight any hard/interesting decisions which had to be made along the way from that idea to their solution, and nothing popped out at me.

Also, you might want to look over S & J’s "Generalized Bottom Up Parsers With Reduced Stack Activity". This paper forms the basis for their GLL idea (IMO). I thought the paper was very well written and really enjoyed it. Their basic idea is to construct a machine which incorporates nonterminal reductions into the DFA as long as the nonterminal does not have embedded recursion.

I would start by heartily recommending the book

"Parsing Techniques, 2nd Edition". as inside I find the following:

"Greibach [389] describes the “hardest context-free language”, a language such that if we can parse it in time O(nx), we can parse any CF language in time O(nx). Needless to say, it’s hard to parse. "

I believe this is the best example of a pathological grammar: "Greibach, Sheila A. The hardest context-free language. SIAM J. Computing, 2(4):304– 310, Dec. 1973. "

http://epubs.siam.org/sicomp/resource/1/smjcat/v2/i4/p304_s1

I would also echo the scott and johnson papers on cubic time parsing. They're well written, a little dense but explain the necessities of cubic time parsing and recognition - (roughly, binarize your grammar and parse tree, and some problems with null reductions you have to deal with)

As for the earley parser (there are some errors within the original paper) - I find the treatment within parsing techniques to be exemplary and I would recommend it above the original paper as an accurate and in-depth explanation of the technique, as well as adaptations

But to go back to your point - LL by nature is a subset of LR and can recognise less. I'd wager that your runtimes look reasonable assuming that you can't recognise all of LR. I would also imagine that your parser looks like an earley parser or a packrat parser and would benefit from exploring those techniques.

Aside: Earley parsers can be adapted to recognize lr-regular grammars in linear time and context free in cubic time.

Source code

You said that your parser was only 85 lines; any possibility that we could take a look at it?