A new implementation of recursive-descent parsing?

It is well known that the classic way to model a recursive-descent parser is to write (or automatically produce) a set of mutually-recursive functions either by writing procedures or combining parser modules using a DSL (like in Boost::Spirit) or using templates/generics (for those languages that support such a concept).

All the different types of implementations work in a similar manner: each procedure invokes one or more procedures. This approach has some drawbacks:

  1. the stack can easily overflow, especially when parsing deeply-nested rules.
  2. some of the results may be thrown away, making parsing slow.
  3. left-recursive grammars can not be handled.
  4. backtracking has a cost.

While trying to approach the issue of RD parsers from a different perspective, I think I have found an implementation which solves some of the problems mentioned above. I hope I am not repeating someone else's work (a brief search over the internet found no similar approaches).

The idea is that instead of the input data travelling down the tree of parsers, each parser processes the current input and then leaves a continuation on a stack. The main parse engine pops the continuation off that stack and executes it; the new continuation puts another state on the stack etc. Thus an RD parser becomes a sort of state machine.

Generally an RD parser consists of the following operations:

  • sequence; parsers are sequentially invoked to parse the input
  • choice; a parser is invoked when another parser fails.
  • loop; a parser that is previously invoked is invoked again.

The traditional approach handles these operations in the following ways:

  • sequence: formed 'naturally' by placing one statement after the other or placing parser objects in a list.
  • choice: 'if' statements are used to select another possible branch.
  • loop: a parsing procedure is invoked again with new parameters.

For example (in pseudo-Java):

class sequence_parser implements parser {
    list parsers;
    iterator parse(iterator input) {
        foreach(p, parser) {
            input = p.parse(input);
        }
        return input;
    }
}

class choice_parser implements parser {
    list parsers;
    iterator parse(iterator input) {
        foreach(p, parser) {
            iterator output = p.parse(input);
            if (output != null) return output;
        }
        return null;
    }
}

class loop_parser implements parser {
    parser other;
    iterator parse(iterator input) {
        input = other.parse(input);
        return parse(input);
    }
}

The new approach handles these tasks differently. The approach uses the following context for parsing:

  1. a stack of parsers that is used for sequences
  2. a stack of parsers that is used for choices
  3. an iterator over the input
  4. a stack of saved contexts used for backtracking

Parsing operations are handled like this:

  • sequence: a new parser is pushed onto the sequence stack.
  • choice: the altenative branch parser is pushed on the choice stack, then the first branch of the choice is invoked.
  • loop: the parser pushes itself on the sequence stack, then invokes another parser.

For example:

class sequence_parser implements parser {
    parser sequence_next;
    parser this_parser;
    iterator parse(context c) {
        c.sequence_stack.push(sequence_next);
        return this_parser.parse(c);
    }
}

class choice_parser implements parser {
    parser choice_next;
    parser this_parser;
    iterator parse(context c) {
        c.save();
        c.choice_stack.push(choice_next);
        return this_parser.parse(c);
    }
}

class loop_parser implements parser {
    parser this_parser;
    iterator parse(iterator input) {
        c.sequence_stack.push(this);
        return this_parser.parse(c);
    }
}

The parsing algorithm then becomes a very simple loop:

void parse(context c) {
    while (true) {
        parser = c.sequence_stack.pop();
        if (parser != null) {
            iterator output = parser.parse(c);
            if (output == null) {
                parser = c.choice_stack.pop();
                if (parser) {
                    c.restore();
                    c.sequence_stack.push(parser);
                }
                else throw parse_exception();
            }
        }
        else (if c.input_exhausted()) {
            return;
        else {
            throw parse_exception();
        }
    }
}

The above does the following:

  1. it pops a parser from the sequence parser stack.
  2. if the sequence parser stack is empty, then if input is exhausted then parsing succeeded, lese parsing failed.
  3. if the sequence parser stack is not empty, then the parser object is invoked to parse the input; it is passed the context.
  4. if the parser succeeded, then the loop continues
  5. otherwise a parser is popped from the choice stack and
  6. a context state is popped from the stack and activated.
  7. The parsing loop continues if the choice parser is not empty.

This approach solves some of the problems of the traditional approach in the following ways:

  1. stack overflow becomes an impossible task because user-defined stacks that can automatically grow is used; the practical limit is the process' memory limit. The user-defined stacks do not grow very large because the main parsing loop continuously pops items from them and new entries are pushed on the stack on a need basis only.
  2. backtracking becomes a constant-time operation no matter how deep the parser has proceeded, because a parser state is popped from the stack. A context save is composed of the sequence & choice parser stack position and the input position (it can be down to 12 bytes if three 32-bit integers are used).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Tail calls

How does it compare to the naive recursive descent parser run in a language which implements tail call optimization? I think it's the same, i.e. you are just simulating tail call optimization by hand.

I think it is not entirely

I think it is not entirely similar to a tail call situation, because in the tail call situation it is only the last call (i.e. the tail) that is optimized. In many RD functions, a sequence has many steps that are not tail calls. In this version, all calls are like 'tail calls' because each parser function never calls another one, it just pushes the 'next' parser to a stack.

This new approach is (propably, I hope!) faster than the traditional one, even with tail call optimizations, because

a) the call/return instructions are minimized.

b) it utilizes better the system stack.

c) when it needs to take another branch, control flow does not have to return to the caller.

Trampoline style

Implementing calls by a trampoline (which is one way to obtain TCO on a runtime which doesn't provide it itself) doesn't use function calls of the host language, neither tail nor non-tail; except in a single place in the trampoline driver, which is used once for each emulated call. Each function returns the pointer to the function it wishes to call, and before that it might have pushed or popped a function pointer playing the role of the return address.

It's actually slower than the direct style (except for using less memory for a large number of tail calls which can make it faster), but it's necessary to obtain TCO. It's slower because a call is implemented with a return and a call (and parameters are passed in a different place), nothing is saved.

This technique is used by some language implementations which generate C code, for example Glasgow Haskell and my language Kogut (except that in both of these compilers there is an optional assembler mangler which eliminates the need of the trampoline, by turning returning of function pointers into direct calls). This is done automatically for the whole code, there is no need to implement it by hand in these languages.

With this technique local variables and return addresses are pushed on a custom stack, which is used instead of the system stack. This happens to make easy to solve the problem of stack overflows of a fixed-size stack: an overflow can be detected and can cause to physically resize the stack. Again, this is done transparently for the whole language, not just for the parser.

Your technique is not faster than the trampoline technique because it's essentially the same technique. Real compilers which provide TCO are likely to be faster if they can avoid the trampoline at all, which requires stepping outside of portable C. Some of them, e.g. OCaml, use the system stack, which is even faster, but is vulnerable to stack overflows for nested non-tail calls.

If you still disagree with this, I would like to see a runnable source of an implementation.

Trampolines?

Since trampoline can mean many things, I assume that you are talking about hijacking the return address so as that instead of returning to the caller, a new parsing function is invoked.

Well, it may be cool to be able to do so, but personally I am against those extremely low-level tricks unless absolutely needed. I prefer to optimize the algorithm before engaging myself in so low-level hacks.

One possible optimization of the algorithm used above is to not pop a parser from a stack but replace it with another parser...it will save a few cycles.

The main benefit of the algorithm though is the instant backtracking (no need to return from all those functions or use exceptions as gotos) and the increased locality of data and code which means fewer cache misses. I guess the algorithm can be optimized further with various tricks like trampolines.

No, he's talking about

No, he's talking about (from your Wikipedia link):

Used in some LISP implementations, a trampoline is a loop that iteratively invokes thunk-returning functions. A single trampoline is sufficient to express all control transfers of a program; a program so expressed is trampolined or in "trampolined style"; converting a program to trampolined style is trampolining. Trampolined functions can be used to implement tail recursive function calls in stack-oriented languages.

Which is exactly what you are doing (as I pointed out below), only instead of returning a function for the trampoline loop to call, you store it in a data structure. In the success/failure continuation backtracking parser, you have to indicate to the trampoline loop whether it is calling a success or failure continuation next, because they have different types. You achive this by representing a success continuation as the top of a nonempty sequence stack, and a failure continuation as a pair of an empty sequence stack and the top of a nonempty choice stack.

Well, it may be cool to be able to do so, but personally I am against those extremely low-level tricks unless absolutely needed. I prefer to optimize the algorithm before engaging myself in so low-level hacks.

But your whole thing is low-level hacks to get around lack of closures, lack of proper tail calls, and lack of a lightweight way to express multiple return values or tuples! :)

But your whole thing is

But your whole thing is low-level hacks to get around lack of closures, lack of proper tail calls, and lack of a lightweight way to express multiple return values or tuples! :)

I do not consider using a set of stacks as 'low-level' from a C++ perspective, because I am not messing up with structures I am not supposed to touch or that are extremely dangerous or non-portable.

If the result that can be achieved in a 'poor' language (i.e. a language without the goodies you mention) is equally clean and effective with the result from the 'rich' language, then I consider the solution to be a successful one.

EDIT:

How would backtracking be handled when closures/tail calls are available? wouldn't a continuation be stored in a stack so it can be popped off at the right moment?

Incidentally

Your parser can still run out of stack space, because you have only trampolined the calls to the success and failure continuations, not the nested calls to parsers. Take a look at what would happen if you had sequences and/or choices nested in the first position.

I suppose that you mean what

I suppose that you mean what would happen if a sequence points to another sequence which may point to a choice and that to another sequence etc. There are two solutions to this:

1) the type system of the host language can be used to make sure a sequence never invokes a sequence, a choice never invokes a choice etc.

2) a parser does not directly invoke another parser but pushes it on the stack. The trampoline loop will invoke the next parser anyway.

Anyway, the chance to run out of stack space is severely minimized even if the algorithm stays as it is...the real problem in RD parsing is when parsing expressions, especially when there are several levels deep. For example, a traditional procedural RD parser can easily choke on complicated expressions like getFactory().createInstance().doJob(1, 2, "x")[object.getIndex(data)] which are often used in many programs, because of the many levels of operators and expressions of the grammar.

Escape continuations

Your failure continuation is equivalent to an exception.

It's not asymptotically cheaper than encoding the success or failure in a return value and checking it explicitly after each call. While the jump itself is cheaper than returning from multiple functions, the same number of functions which are now returning must have been called before, one by one. So the difference is only in constants and amortization.

Exceptions in C++ are traditionally quite expensive (in a typical implementation the non-exceptional case has no overhead, but throwing an exception examines every stack frame and decodes the return address to determine which destructors to call). This is not true in every language. E.g. in OCaml and Kogut establishing an exception handler has a non-zero cost but throwing an exception is constant-time. Just like in your technique.

stackless runtimes are good

Achilleas Margaritis The idea is that instead of the input data travelling down the tree of parsers, each parser processes the current input and then leaves a continuation on a stack.

(Note I've only read your natural language text and not the sample code.) I think you'd get much the same effect by using a stackless language with continuations and tail call optimization, which is what Qrczak also seems to be saying. Ie, that this style of recursive descent parser seems a lot like special casing a general stackless runtime specifically for a parser alone. I'd expect a backtracking RD parser in (say) Scheme might easily closely resemble your plan without a theoretical justification.

My practical feedback is that you might get more mileage by using a stackless approach with continuations generally for everything, and not just the parser. But since I don't know many infrastructures like that you can start with (I'm sure someone can cite one for me), I'm not helping much. (I'm writing something exactly like that for my own use; but since I'm aiming it at Lisp and Smalltalk runtimes, it won't be terriby friendly toward all languages, and certainly not things in the C family.)

Complicates debugging.

I did the same in (previous versions of) my previously described matching/rewriting language. It's probably the most intuitive way to do RD parsing in a language without call/cc, when one is aware of implementations of amb with call/cc. I can publish the latest version that still used this approach, if you want (Common Lisp). The problem I faced:

  1. Everything is a tail call, which makes things hard to debug, even once the 'continuations' are printable.
  2. It's slow: this approach tends to cons *a lot*, not to mention that trampolines are a slow way to implement tail calls.

Note that you don't actually need a stack of continuations, especially for backtracking. You can just save the whole state, including choice and sequence continuations. If you use functional data structures, the sharing means that you have the same effect as explicitly saving only the required part of the context, with the added value of not exposing a bug when you have to backtrack to a position that you've already popped off the sequence stack.

I've gone back to traditional monadic combinators now (with user-level lazy lists). The code is much clearer, debugging is much easier, and I dare hope that stack overflow is mostly a non-issue nowadays (C and C++ are obvious exceptions). As for backtracking not being O(1), its constant factor is in the rather tiny. I'd say that your biggest problem wrt backtracking is that you don't memoise the results, which means that you'll often do the same work repeatedly -- a much more expensive waste than I ever expect returning down the hardware stack to be.

Looking at your list of issues:

  1. the stack can easily overflow, especially when parsing deeply-nested rules.
  2. some of the results may be thrown away, making parsing slow.
  3. left-recursive grammars can not be handled.
  4. backtracking has a cost.

  1. Is surprising for me, but then I've only programmed a nearly trivial RD parser in Java, and nothing in C or C++. Also, note that this is a much bigger issue for approaches that implement backtracking via (faked) continuations. If you use monad combinators, the structure of the calls then become very similar to normal programs. Obviously, if you have real call/cc, that's not an issue.
  2. I'm not sure what you're referring to here, nor how it can only affect RD parsers. Would adding a "first" operator (which only returns the first result of the rules it wraps, without allowing backtracking in those rules later on) help?
  3. I don't think there's any excuse not to handle left-recursive grammars in a memoising [edit: explicitly memoising-- those that depend on call-by-need don't have quite as easy] RD parser. I also think there aren't many reasons not to add memoisation to a RD parser ;) Both the previous versions and the current approach of my parser/rewriter do handle left-recursive rules [with the added bonus that the previous version treated left-recursive rules as a success and faked a weird return value in order to implicitly rewrite circular data structures nto circular ones, something made possible by my treating return values and state separately... a huge pain, but this is probably better saved for another venue].
  4. Again the cost of backtracking?? I don't see how backtracking itself is an issue -- the gratuitious matching of rules due to naïve backtracking, or the issue of the locality of backtracking, *that* I can see (I have also suggested 'first' and memoisation as solutions).

So yes, this is a plausible approach, and it works, but, imho, it doesn't present any real advantage over monadic parsing. I suppose that if your only alternative is a classical choice-as-if, everything-is-a-tailcall (in a language that does not guarantee/encourage TCO), it is interesting, but I'd suggest to implement a monadic parsing library (or to use a premade one).

[Sort of edit: I just saw another reply saying the same thing wrt TCO and your trampoline. I did something similar (minus the trampoline) in CL, in an implementation that does implement TCO because I wanted to abstract it away from the user, and let them mostly use rewriting rules instead of directly handling continuations, etc. However, the other points are also important, imho.]

[EDIT2: cite -> blockquote]

1. Everything is a tail

1. Everything is a tail call, which makes things hard to debug, even once the 'continuations' are printable.
2. It's slow: this approach tends to cons *a lot*, not to mention that trampolines are a slow way to implement tail calls.

I do not know about Lisp, but doing it in C++ was much better than the traditional approach. It was much easier to debug, because I only had to iterate in one loop (instead of traversing a control flow tree), and there were no 'conses' involved.

I'm not sure what you're referring to here, nor how it can only affect RD parsers. Would adding a "first" operator (which only returns the first result of the rules it wraps, without allowing backtracking in those rules later on) help?

The problem is when one does this:

r = a b
  | a

Then 'a' is parsed twice in a RD parser. Thanks to EBNF and operator overloading, the above can be written as:

r = a b*

and in Spirit spirit:

rule<> r = a >> *b;
I don't think there's any excuse not to handle left-recursive grammars in a memoising [edit: explicitly memoising-- those that depend on call-by-need don't have quite as easy] RD parser.

In the traditional approach there is a problem: if a function A immediatelly calls A(), then we have an infinite loop. With my approach the problem can be solved, but for now let's stick to the discussion about the other parts.

Again the cost of backtracking?? I don't see how backtracking itself is an issue -- the gratuitious matching of rules due to naïve backtracking, or the issue of the locality of backtracking, *that* I can see (I have also suggested 'first' and memoisation as solutions).

In traditional frameworks like Boost::Spirit and others, for each failed production another path has to be taken. There is a great cost if one path fails, because many functions must return, stack unwinded many times etc. With this new approach, branchhing is like jumping to a previous state...sort of using exceptions (I think Spirit uses this trick).

Another advantage is that two loops can be combined: one that tokenizes and one that parses, using the same code.

You don't seem to be addressing my points...

I do not know about Lisp, but doing it in C++ was much better than the traditional approach. It was much easier to debug, because I only had to iterate in one loop (instead of traversing a control flow tree), and there were no 'conses' involved.

Oh, I'm not saying that it's harder to debug than the traditional approach, but rather that it's harder to debug than monadic parsing combinators. Wrt tailcails and debugging, both this approach and the traditional one have the same problem. Also, I doubt that you can save the current state without consing (allocating memory).

r = a b
  | a

Is exactly the sort of problem that memoisation solves.

In the traditional approach there is a problem: if a function A immediatelly calls A(), then we have an infinite loop. With my approach the problem can be solved, but for now let's stick to the discussion about the other parts.

Yes, but if one does explicit memoisation, one needs a predicate to tell whether 2 states are equivalent. The same predicate can be used to detect circularity (in the driver loop, an around method, or an explicit prelude to every parsing function, for example) and backtrack to the next alternative.

Finally, function returns are *cheap*. They're easily predicted by the CPU, and have a much lesser interpretative overhead than a user-managed stack. Faking a return stack by hand is, imho, nearly always a pessimisation.

Re tokenisation and parsing in a single loop, that would be true of any stateful parsing framework. In fact, they can be generalised to rewriting to and from arbitrary data structures*, not just from lists of characters/strings to ASTs.

* Sorry, no fancy proof, apart from having done so :)

Also, I doubt that you can

Also, I doubt that you can save the current state without consing (allocating memory).

The state can be saved using stacks: a save means 'push a state' and a restore means 'pop a state'. Since stacks are preallocated and grow on a need basis, consing is not necessary (unless you are talking about languages that consing is the only way to achieve the desired effect).

Yes, but if one does explicit memoisation, one needs a predicate to tell whether 2 states are equivalent. The same predicate can be used to detect circularity (in the driver loop, an around method, or an explicit prelude to every parsing function, for example) and backtrack to the next alternative.

I think we should separate memoisation from the problem of left recursion. They are two different things. For solving the problem of left recursion, a parser could either branch to a non-left recursive production or convert the grammar from r = r a | b to r = b *a, which is my favorite.

Again...

The state can be saved using stacks: a save means 'push a state' and a restore means 'pop a state'. Since stacks are preallocated and grow on a need basis, consing is not necessary (unless you are talking about languages that consing is the only way to achieve the desired effect).

You're still allocating a relatively large amount of memory to represent the states (just so we're clear, consing ~= heap allocation). Obviously, when you do more work in the host language, you need to represent less state, so depending on the approach taken wrt programming the parser, it may not be a problem. However, while allocating and deallocating elements in vectors will be cheaper than for GCed lists, you're also suffering from a bug when you want to backtrack to a state you've already popped off the sequence stack...

I think we should separate memoisation from the problem of left recursion. They are two different things.

Indeed, they are two every different things. However, in the context of the great-great-grandparent post, where I point out that memoisation solves the, imo, most important efficiency problem of recursive-descent parsers (multiple matches of the same rule against the same position in naïve grammars), my point was simply that once you've decided to memoise your parser, detecting circularity should be a given.

you're also suffering from

you're also suffering from a bug when you want to backtrack to a state you've already popped off the sequence stack...

I do not think it is possible for what you mention to happen in a parser. Backtracking always results in a previous state in which the remaining continuations in the parser have not been executed yet.

my point was simply that once you've decided to memoise your parser, detecting circularity should be a given

how?

I do not think it is

I do not think it is possible for what you mention to happen in a parser. Backtracking always results in a previous state in which the remaining continuations in the parser have not been executed yet.

That is equivalent to saying that backtracking can be implemented solely with dynamic extent continuations, e.g. setjmp/longjmp.

r1 = [rule a] | [rule b]

r2 = r1 [fail]

I would expect 'r1' to try and execute both rule 'a' and rule 'b'.

If you only allow backtracking to a state which hasn't been executed yet, I believe that is similar to only allowing rules to return 1 solutions (as if every rule was wrapped with an operator that only allowed local backtracking). I guess it might be what you want, and is closer to how I would write a recursive-descent parser by hand. It's not exactly vanilla backtracking, however. In this case, I really fail to see how your approach can be easier to debug, or offer a performance advantage: you're strictly simulating the hardware stack in software. In some environment, I guess that it might let you escape a particularly constrained stack.

my point was simply that once you've decided to memoise your parser, detecting circularity should be a given

how?

As written in the great^4-grandparent post, if you have memoisation, you have a way to compare states for equality. Thus, you have a way to check for circularity.

That is equivalent to

That is equivalent to saying that backtracking can be implemented solely with dynamic extent continuations, e.g. setjmp/longjmp.

Indeed.

I would expect 'r1' to try and execute both rule 'a' and rule 'b'.

I never met a recursive-descent parser that works like that. What is the meaning of that? isn't an RD parser supposed to test the first branch, then if that fails the 2nd branch etc?

In this case, I really fail to see how your approach can be easier to debug

Easier than what? certainly it is easier to iterate within a loop instead of many procedures (at least for me). I usually get lost when stepping through many mutually recursive procedures.

or offer a performance advantage: you're strictly simulating the hardware stack in software.

If a deeply nested rule fails to be matched, then there is no need to return from all those procedures.

As written in the great^4-grandparent post, if you have memoisation, you have a way to compare states for equality. Thus, you have a way to check for circularity.

But in memoisation you simply cache the parsing result. In left recursion, the same rule is invoked to parse itself without the parsing result existing at that moment. In other words, when there is a left recursion, there is nothing to compare against.

Maybe if you told us a little more about it?

I never met a

I never met a recursive-descent parser that works like that. What is the meaning of that? isn't an RD parser supposed to test the first branch, then if that fails the 2nd branch etc?

It means that the parser would execute 'r2', which invokes 'r1' to try to execute rule 'a', then hit the [fail] in 'r2', backtrack (and rebuild the stack) to 'r1' and execute rule 'b', and finally hit the final failure.

I guess that's what happens when you have constraint programming on your mind. FWIW, I think that parsing combinators work that way. You have a recursive-descent parser, but it doesn't have full backtracking. Moreover, it is easy to get the same behaviour from an implementation with full backtracking (just make the backtracking local to each rule's dynamic extent). The reverse is false.

or offer a performance advantage: you're strictly simulating the hardware stack in software.

If a deeply nested rule fails to be matched, then there is no need to return from all those procedures.

As someone else pointed out, you're only paying the same (algorithmically) total cost incrementally at every call/return in the trampoline. And, as I pointed before, procedure returns are cheap: they're common in normal programming practice and easily predicted, unlike other indirect jumps/calls.

But in memoisation you simply cache the parsing result. In left recursion, the same rule is invoked to parse itself without the parsing result existing at that moment. In other words, when there is a left recursion, there is nothing to compare against.

Maybe if you told us a little more about it?

In memoisation, you have a map from 'input state' -> 'parsing result'. You need a predicate to compare input states for equality. The same predicate can be used to detect circularity (2 identical input states in a "backtrace"). You can even use the memo cache to indicate a failure for the current state when you enter it, and then overwrite that failure with the result when you leave the state (since your backtracking never causes the stack to be rebuilt).

It means that the parser

It means that the parser would execute 'r2', which invokes 'r1' to try to execute rule 'a', then hit the [fail] in 'r2', backtrack (and rebuild the stack) to 'r1' and execute rule 'b', and finally hit the final failure.

I guess that's what happens when you have constraint programming on your mind. FWIW, I think that parsing combinators work that way. You have a recursive-descent parser, but it doesn't have full backtracking. Moreover, it is easy to get the same behaviour from an implementation with full backtracking (just make the backtracking local to each rule's dynamic extent). The reverse is false.

But if [fail] is executed, you won't have to go back to r1...because r2 would have failed to be matched.

The order of evaluation of rules are:

[start]
r1 
[rule a] or [rule b]
[fail]

So if parsing fails due to [fail], the parser returns to [start] and not to [rule b]. [Rule b] is executed only if [rule a] fails.

Therefore I do not see why you have to "rebuilt the stack" in any way.

As someone else pointed out, you're only paying the same (algorithmically) total cost incrementally at every call/return in the trampoline.

In the trampolined version, all those push/pop/stack adjustment instructions do not exist; the stack is used in a much better way. In the classic RD approach, most of the allocated stack is never used.

In memoisation, you have a map from 'input state' -> 'parsing result'. You need a predicate to compare input states for equality. The same predicate can be used to detect circularity (2 identical input states in a "backtrace")

So basically we are saying the same thing. The predicate I use is the rule itself: I have a flag in the rule object that is used as the predicate to check for left recursion.

But that does not solve the problem. A left recursive rule could be like this:

r = r a
  | r b
  | c

which is equal to:

r = c (a | b)*

So the question remains: what do you do after you detect a left recursion? You may branch to the non-left recursive part, but then you have to loop over the part after the left recursive reference.

What I have done is to reconstruct the grammar from a left recursive one to a non-left-recursive one by a) inserting an 'optional' operator after the non-recursive branch b) uniting the parts of the left-recursive productions into a choice, c) inserting a goto operator after the choice which goes back to the optional operator.

What I have done is to

What I have done is to reconstruct the grammar from a left recursive one to a non-left-recursive one by a) inserting an 'optional' operator after the non-recursive branch b) uniting the parts of the left-recursive productions into a choice, c) inserting a goto operator after the choice which goes back to the optional operator.

How does it work with a left recursion hidden in nested sequence calls ? How does it work with a left recursion hidden such as in r ::= (space* r) | a ?

I used to handle left recursion with memoization: when I encounter a left recursion, I throw an exception which i caught in the first call to the rule that caused the LR. There I memoize a failure (it's kind of an hypothesis) and retry to parse agaisnt the rule. If the result is "better", I memoize and try again until the new result is no better than the previous. Eventually I memoize the best result.
The tricky part is to track which memoizations depend on the hypothesis so as to know they must be discarded each time the hypothesis change.

But if [fail] is executed,

But if [fail] is executed, you won't have to go back to r1...because r2 would have failed to be matched.
...
So if parsing fails due to [fail], the parser returns to [start] and not to [rule b]. [Rule b] is executed only if [rule a] fails.

Therefore I do not see why you have to "rebuilt the stack" in any way.

This is a limited form of backtracking, and is the behaviour we have grown to expect from recursive-descent parsers. However, this isn't backtracking as it is normally understood in constraint programming. Instead, (for a depth-first search), failure branches to the latest backtracking point, rebuilding the whole stack if necessary. That is why, for example, the authors of High-Level Nondeterministic Abstractions in C++ have to implement real continuations via stack copying. Doing otherwise would require programmers to nearly CPS their code by hand.

In the trampolined version, all those push/pop/stack adjustment instructions do not exist; the stack is used in a much better way. In the classic RD approach, most of the allocated stack is never used.

This is getting even more off-topic than the rest, but here goes... Any performance gain you might get from using less memory (not a lot, since it doesn't affect locality) will be largely offset by the fact that every call in the trampoline is indirect. Indirect calls are usually predicted to point to the same place as they did at the last execution. Nearly *every* call in the trampoline will be mispredicted, costing dozens of cycles. Compare that to adjusting the stack followed by a well-predicted return. Again, trampolines are very slow.

So the question remains: what do you do after you detect a left recursion? You may branch to the non-left recursive part, but then you have to loop over the part after the left recursive reference.

Currently, I simply trigger a failure and try the next alternative. In this case, that behaviour is less than ideal (it only matches 'c', and nothing else), but is already better than blindly running around in circle.

Stack overflow not a stack overflow?

stack overflow becomes an impossible task because user-defined stacks that can automatically grow is used; the practical limit is the process' memory limit. The user-defined stacks do not grow very large because the main parsing loop continuously pops items from them and new entries are pushed on the stack on a need basis only.

This doesn't really solve the stack overflow "problem", does it? It just moves it from one place to another - perhaps not even so. In practice when are you likely to encounter a stack overflow in a RD parser anyway?

[edited to fix "quote" to "blockquote"]

It depends. Normally, the

It depends. Normally, the number of parsers pushed in the stack should be the number of parsers popped from the stack, even for loops. So normally the stack will never overflow. But if you write a parser that pushes more parsers to the stack for each one popped, then there would be trouble.

stack overflow risk and handling

Christopher Campbell: This doesn't really solve the stack overflow "problem", does it? It just moves it from one place to another - perhaps not even so. In practice when are you likely to encounter a stack overflow in a RD parser anyway?

Stack overflow isn't much of a risk unless stack representation is limited. But it's considered very bad (engineering) style to write code that can crash hard on input being parsed. Early Mac systems I used often had 8K stack limits by default, which was really easy to overrun. I took to asking folks a standard question in interviews (of which the very short form is: how can you tell how deep the stack is?).

Typically Unix stacks in C can get quite big, and you'd have trouble overflowing stack unless you encountered pathological code (like a denial of service attack on your parser). But if you're using a pthread library for threads (pretty standard in C++ servers), typically you'd choose a stack size for a thread that's not very big, and you'd definitely want to protect yourself against non-pathological input syntax as well.

I like to solve stack overflow problems all the same way, which I've taught other folks who got stuck deciding what to do with parsing errors in general. The solution is: reify everything and let the AST consumer sort it out. In other words, turn every error into explicit nodes in the AST and keep going. Error reporting and handling can wait until someone walks the AST later. (As an optimization, a count of error nodes in the AST will stop futile efforts like compilation that can't succeed.)

You can make a "stack overflow" judgment call in several practical/arbitrary ways (eg measuring distance in stack bytes or distance in call nesting), with the effect of limiting the stack depth of nesting
that's permitted to code writers. Your docs might include, for example, a warning you only permit a maximum depth of, say, 256 nested block scopes in one method. Since I always write parsers by hand, I have no idea how you'd map this onto parser generators, but it can't be hard.

Angst ridden thoughts on parser generators.

After years of writing recursive descent parsers by hand, I found the ultimate solution to my problem... don't :).

In all seriousness, I've never had a problem with stack overflow, even in C, that wasn't due to a bug in my code. Also, I love efficiency, really, but I think standard results in parsing general context free grammars are "good enough" for everyday use(now, if you can figure out how to parse all context free languages in linear time, you *are* the man, and should promptly write this down and get published...). Most of the problems I see in parser generators are more "software engineering" oriented.

If you're going to write your parsers by hand, you should probably at least look at parsing combinators. If you're using Java, and you're more interested in designing a language(semantics and syntax) you might want to just save yourself many of the headaches of parsing and just use a parser generatore like JavaCC(note, its not the best in the world, but it gets the job done, and coupled with Java Tree Builder(JTB), you won't be writing semantic actions just to build a stupid parse tree, it even generates visitor classes you can extend to walk the tree).

Okay, since this LtU, I want to mention some thoughts I've been bottling up a while about everything that's wrong with most "industrial" parser generators. In that vein, I am also interested in limiting my critiques to context free languages(while I would _love_ to be able write down a syntax + semantics and have a reasonably efficient interpreter pop out, I'm not going to hold my breath).

Most of these are "Software Engineering" oriented, but I think a really good system for dealing with all aspects of parsing cfg's would be a nice, scope limited system, that could really get a lot of traction(and if not, I'll use it...). Oh yeah, if what I'm describing already exists, I beg apology for making a fool of myself :).

  • Grammar Reuse - I find myself cutting and pasting rules for expressions, and other common idioms all the time. Without going beyond CFG's in expressive power, the ability to write "generic" rules ala parsing combinators and reuse them would be a wonderful gift to the world. (2 level grammars are also interesting here, Van Wijngaarden grammars provide some interesting ideas to express grammatic abstraction)
  • Unicode and binary files exist - While there have been past threads on "data description languages"(see this thread), within bounds, a CFG parser generator could do wonders to help dealing with oddball binary formats, and in a similar vein, Unicode exists and I want to parse it too.
  • Scannerless parsing would be nice - Seriously, sometimes a scanner is a nice concept, but scanners are the main reason I have problems parsing files with fixed width fields(with need of tokens like "match any 10 character" and the "maximal munch rule"... you do the math...). And besides, I long ago thought setting my "white space" rule to be named "_" would make it look nice in the grammar(And explicate where said whitespace could occur)
  • Error handling - JavaCC gives some nice hooks and nice error messages when your input files go wrong. Good "default" error handling, plus "easily extensible" error handling would be a wonderful in a parser generator for common use.
  • Parse trees vs. semantic actions - First off, I really don't want to write semantic actions, I just want a tree. I realize that some languages can be parsed and interpreted at the same time and you can do it all within a semantic action, but most of the time, that would never work with what I'm doing, so I never do it. A way to say, "I just want a parse tree, and oh yeah, throw away the following tokens." would be cool.

(while I would _love_ to be

(while I would _love_ to be able write down a syntax + semantics and have a reasonably efficient interpreter pop out, I'm not going to hold my breath)

Would a reasonable set of parsing combinators plus a semantics defined as a monadic fold over the AST type count? If so, we're arguably there and it's certainly looking doable.

Have you tried Parsec?

Thank you for sharing your pain with us, Matt. We are here to help. :-)

If Philippa will forgive me for stealing her thunder, I think she is hinting that you should try Haskell's Parsec library. I couldn't agree more; it seems to be exactly what you are looking for!

The following code is intended to show that Parsec has all the features that you have requested. The only thing I have not demonstrated is Parsec's ability to handle Unicode or binary file formats.

Parsec and a literate compiler are bundled with most current distributions of Haskell, so if you copy and paste the code into your favourite text-editor, save it as "parsec-example.lhs" and then show it to your favourite Haskell implementation, it should Just Work(TM). Here is a sample session, using Hugs. Notice that the parser ignores the comments:

~/code/haskell $ runhaskell parsec-example.lhs
> 1+2 * 3
Add (Nat 1) (Mul (Nat 2) (Nat 3))
> (1+2) * 3 -- This end-of-line comment is ignored
Mul (Add (Nat 1) (Nat 2)) (Nat 3)
> {- Some {- nested -} comments. -} 0001
Nat 1
> 0xA + 0o12 + 10 -- Recognises hex and octal numbers
Add (Add (Nat 10) (Nat 10)) (Nat 10)
> 1 + -- trigger a customised error message
"typed in expression" (line 1, column 32):
unexpected end of input
expecting simple expression
> 
Done

(If you prefer to use ghc, you can compile the code by invoking something like ghc --make parsec-example.lhs -o parsec-example-executable. You might encounter a slight bug in the REPL, causing it to print the ">" in the wrong place. I can't be bothered to fix it right now.)

Without further ado, I present the demonstration code. Happy parsing!

A demonstration of the mighty powers of Parsec.  Based on an
example from the Parsec Users' Guide, at
http://www.cs.uu.nl/~daan/download/parsec/parsec.html

> import Text.ParserCombinators.Parsec
> import Text.ParserCombinators.Parsec.Token
> import Text.ParserCombinators.Parsec.Expr
> import Text.ParserCombinators.Parsec.Language (haskellDef)

Define the datatype for the abstract syntax tree that we want the
parser to produce.

> data AST = Nat Integer
>          | Add AST AST
>          | Sub AST AST
>          | Mul AST AST
>          | Div AST AST
>            deriving Show

There is, in principle, no difference between a scanner and a parser,
and the Parsec library is flexible enough to accomodate both.

Here is the specification for the scanner.  Rather than define one
from scratch (which is not difficult), we will adapt a scanner
designed to understand Haskell's syntax.

> scanner :: TokenParser ()
> scanner  = makeTokenParser 
>            (haskellDef
>             {reservedOpNames = ["*","/","+","-"]})

Note that "scanner" is *not* the scanner.  Rather, it is the
specification for a library of scanners.  For instance, invoking
"natural scanner" returns a scanner that understands Haskell's syntax
for natural numbers.  We will also use "reservedOp scanner",
"whiteSpace scanner" and the parser combinator "parens scanner".

"factor" and "expr" define the core of our parser.

> factor =   parens scanner expr
>        <|> (natural scanner >>= return . Nat)
>        <?> "simple expression"

"table" contains the specifications for the operators, their
precedences, associativities, and the associated semantic actions.
(The semantic actions are simply the constructors for the Abstract
Syntax Tree.)

> expr    :: Parser AST
> expr    =   buildExpressionParser table factor
>         <?> "expression"
>     where
> 
>       table = [[op "*" Mul AssocLeft, op "/" Div AssocLeft]
>               ,[op "+" Add AssocLeft, op "-" Sub AssocLeft]]
> 
>       op s f = Infix (reservedOp scanner s >> return f <?> "operator")

Finishing touches for the parser.

> parser = do whiteSpace scanner -- Skip past leading whitespace
>             x <- expr       -- Run the parser "expr"
>             eof                -- Stop at the end of the input
>             return x           -- Return the result of "expr"

Here is how we invoke the parser and convert its output into a
human-readable format.

> run input = case (parse parser "typed in expression" input) of
>               (Left err)     -> show err
>               (Right result) -> show result

A Read-Eval-Print loop.

> main = do putStr "> "
>           input <- getLine
>           if input == ""
>             then putStrLn "Done" >> return ()
>             else putStrLn (run input) >> main

I'd deliberately not named

I'd deliberately not named Parsec, though it's certainly a good intro to the genre - I figured the value of "good" should be left to the reader, as there's still some work to be done before we can start claiming even a limited perfection. Also, I think the meat of my point was about the monadic folds for semantics - it's less well-understood than the existance of good "just-a-parser" libs (if anyone wants me to expand, just yell?). YMMV :-)

Ten fingers, but only two eyes

Sorry, Philippa. I should have taken the time to RTFP before jumping in.

if anyone wants me to expand, just yell

I'm yelling!

When you say "monadic folds for semantics" are you referring to things such as Monad Transformers and Modular Interpreters? This paper looks really exciting to me, but I've still got some way to go before I really understand Monad Transformers. Until I get there, maybe the good people of LtU could answer some questions for me?

Like I said, I don't fully grok this yet, but from reading that paper, it seems as though we are almost there. (Which is what Philippa just said.) Are my impressions correct? Is it just a matter of developing these proof-of-concept implementations into a set of industrial-grade libraries? If not, what is still missing?

I'm not directly referring

I'm not directly referring to monad transformers, although the paper also effectively gives plenty of examples of what I'm talking about in the process (the modularity aspect is effectively taking it a step further, although often at some run-time cost). It's also pretty simple to implement monads as interpreters as well if you've got GADTs available - you build an AST with a constructor for each of the monadic operations (including the ones specific to the monad in question) and run with it. If you'd rather a more traditional monad implementation you can get it with a little refactoring, with the non-AST parameters to the interpreter becoming the monad type.

The fold part's perhaps more interesting - you can use the monad to supply an abstract machine (so at a bare minimum it probably includes an environment), and then the rest of the interpreter becomes a simple traversal of the AST - one expressible as a fold computation within the monad.

It's just in a different guise

Your parser is a first-order and trampolined implementation of the usual backtracking parser using success and failure continuations:

  1. Start with a recursive-descent backtracking parser using success and failure continuations (note that already all calls are tail calls)
  2. Trampoline the parser
  3. Defunctionalize the continuations. Depending on various implementation details, a defunctionalized success continuation will be essentially a stack of parsers ("sequence" parsers). A defunctionalized failure continuation will be a stack of parsers each along with whatever other goodies you need to backtrack (things like saved states of the input stream).
  4. Split the defunctionalized failure continuation into two exactly parallel stacks, one of parsers ("choice" parsers) and one of the other stuff ("contexts")
  5. Voila

A paper on trampolining.

I've found it here (on LtU of course!).

It's even in the research

It's even in the research papers page. It's an old favorite.

another place this trick is

another place this trick is useful is writing javascript code for browsers. the only way (i know of) to get a screen update is for the javascript code to terminate. if you are doing some major processing that's a problem, because you'd like regular screen updates. by transforming your code in this way you can break between iterations (restart from the next "continuation" using the timer - plain ol javascript doesn't have continuations, but does have closures, so it's pretty easy).

i've written code that draws "spirograph" patterns (javascript constructing svg in the dom) and "typesets" documents (javascript constructing html dom) using this (only works on firefox, afaik).