C++ LL(1) Parser Combinator Library

I have put a C++ LL(1) parser combinator library on GitHub https://github.com/keean/Parser-Combinators. There is an example to show how it is used, and there is a simple recursive decent parser for comparison. The parser-combinators benchmark 20% faster than the simple parser. I would appreciate comments on the code, the choice of combinators (I want to keep the number to a minimum without compromising usefulness), and the comparative readability of code using the combinators vs the simple parser, or any other thoughts.

There are two basic string recognisers ('accept', 'expect'), two nullary parsers ('succ', 'fail'), and two parser-constructors ('all', 'any') that take a user supplied functor and a variadic argument list of parsers or recognisers. The functor for the parser constructors is variadic, so it is passed one result argument for each parser, 'all' only calls the functor if all the parsers succeed and provides all arguments, 'any' calls the functor as soon as the first parser from the left succeeds with an index number indicating which argument has the valid result (the remaining arguments are initialised with the default constructor). Finally there are four combinators that can be used on both recognisers and parsers '&&', '||' which behave as they do in boolean logic (with short-circuit evaluation), 'many' and 'discard'.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Vs. Boost::Spirit?

Sorry for the easy comment, but how does your library compare with Boost::Spirit? Is the only difference being LL(1)?

Even if you know about Spirit, you need to discuss the difference somewhere.

Small and Independent

The quick answer is that it is a small library (~500 lines) and does not depend on boost. It was designed from a clean sheet, and tries to implement a minimal set of combinators. Spirit tries to fit operators to parsing so that they obey the C++ operator precedence (including overloading *), this only uses operators where their semantics obviously fit (&&, ||) and uses names elsewhere in line with the Concepts (as in the Stepanov definiton) underlying these operators.

In more detail, I think its a cleaner implementation, it doesn't use the curiously recurring template pattern (which I don't really understand the need for as you don't need a base class for static polymorphism). Also all parsers are treated as const values, as are any user-functors. You can of course pass a mutable object into your user-functor as a constructor argument for state (as shown in the example_expression parser). Parsers do not return a match object, rather you pass in a data-structure to store the results as an optional argument, which if null means the results are never stored. There are no parser-contexts, this is achieved through the user-functors and 'any' and 'all' combinators. There may be more differences I haven't noticed yet.

A further difference is that the purpose of posting it for discussion is to get feedback that may alter and improve the design, something that is harder to do with an established library even if you fork it (due to code size, complexity and dependencies).

Genius

Man, I think you're a genius in C++ but I just don't believe you got it right; or it is severely restricted. One of the nicer things of functional parser combinators is that you can toss analytical information in and I don't think you can get that right in this manner. (I also doubt you can do that in Boost.Spirit.) Maybe you shouldn't care about that since almost all compiler tools only allow for synthesized information.

The other thing is that parser combinators allow, with a bunch of higher-order operators, to be combined into new parser combinators and I don't see that either. You seem to stick parsers together explicitly in code with for and while loops which, somewhere, totally goes against the spirit of the whole exercise.

It's interesting, I am in awe of your skills, but I don't see it. (But I'll admit I have a hard time reading your code.)

which one are you looking at?

Test_simple is the example of using a simple non-combinator parser. Test_combinators is the combinator example, and you can see two functors used in a higher order combinator (all) to produce a parser that returns ints, from comma separated values on a single line, and one an order higher that combines that into a multi-line parser. If I understand you right, you do get the analytical bit, for example you can pass a functor to calculate the sum, rather than return the result in an array. Perhaps an expression evaluator would be a better example?)

Here is the csv parser example:

struct parse_int {
    parse_int() {}
    void operator() (vector<int> *ts, string &num, string &sep) const {
        ts->push_back(stoi(num));
    }
} const parse_int;

struct parse_line {
    parse_line() {}
    void operator() (vector<vector<int>> *ts, vector &line, string &line_sep) const {
        ts->push_back(move(line));
    }
} const parse_line;

auto const recognise_number = some(accept(is_digit));
auto const recognise_space = some(accept(is_space));
auto const recognise_separator = option(accept(is_char(',')) && discard(option(recognise_space)));
auto const parse_csv = some(all(parse_line, some(all(parse_int, recognise_number, recognise_separator)), option(recognise_space)));

You can see we define three string recognisers using combinators, and two user defined 'aggregation' functors. 'all' is a higher-order-function the first argument of which is a map function, the first example parse_int, is used here all(parse_int, recognise_number, recognise_separator) it is passed the result of recognisers recognise_number and recognise_separator as function arguments (only if both succeed), as you can see here void operator() (vector<int> *ts, string &num, string &sep) const

I guess you'll need to show me

Yah. I don't really know what parse_line does though I can take an educated guess at its semantics. (Ah, it's a functor.)

I would like to see this evaluated with your combinators:

(ADD 3 (MUL 5 (DEC 7))

Okay. I think I get it. The confusing thing was the naming. "parse_int" isn't the actual parser but the epsilon rule/attribute evaluation. I would have preferred it if you would have called the parser expression which starts with "all" "parse_int"; and people likely would find it more readable if the epsilon rule would come as the last argument; or you just wrap it around whatever is done in all, it looks too specialized somehow.

(Then the other thing is that the parser combinators I once wrote are probably around a hundred lines in total. I don't get why you wouldn't be able to trivialize it further.)

Makes Sense

Yes, that makes sense. I could call it "epsilon_int" and define "parse_int" as suggested to make it more readable. I'm afraid the variadic syntax needs the variable arguments to come last, so it has to be first. If you think that all arguments after the first are user-defined it kind of makes sense. Also it follows the pattern of a fold function where the accumulator is the first argument.

Fold

Yah. We discussed that and I told you I don't believe fold will buy you much in defining parsers. Attributes just have different types, "SUM" is a piece of text and "1" is an integer. (Though you also do type tricks in C++ I didn't think were possible.) And though lists of things emerge sometimes, they aren't that important usually that you would like to specialize in that direction.

The trick to observe is that if you can't do the attribute evaluation directly in the second level of the CFG part, or maybe don't want to, that you then push the attribute evaluation into epsilon rules which you then trickle around in your grammar. (Bison is an affix grammar with the epsilon rules last.) That would also hopefully keep the (intermediate) types of your parsers/production rules sane.

I didn't think it much trough but a specialized combinator probably would take care of that.

Example Expression Evaluator

I got rid of the 'fold' (which was polymorphic) as the variadic form is much neater (the 'all' and 'any' combinators), you don't need multiple template definitions for each parser type, as single functor receives all the parsers as the appropriately typed argument. You can see this in "return_exp2", the first argument is the result, the rest are the parser results which are of type 'enum op' and 'int' respectively. These types must match the types returned by the parsers in those positions of the 'all' combinator, and this is enforced by the type system.

Here's an expression evaluator, no operator precedence however. These are the static definitions (const objects and types), a nice point to note is that no templates are needed in the user code, thats all contained in the library. The complete example is now on GitHub.

enum op {add = 0, sub = 1, mul = 2, div = 3};

struct return_int {
    return_int() {}
    void operator() (int *res, string &num) const {
        *res = stoi(num);
    }
} const return_int;

struct return_op {
    return_op() {}
    void operator() (enum op *res, int choice, string &add, string &sub, string &mul, string &div) const {
        *res = static_cast<enum op>(choice); // this relies on the order of parsers being the same as the enum.
    }
} const return_op;

struct state {
    int acc;
    state() {};
};

struct return_exp1 {
    state &s;
    return_exp1(state &s) : s(s) {}
    void operator() (int *res, int &left) const {
        s.acc = left;
        *res = s.acc;
    }
};

struct return_exp2 {
    state &s;
    return_exp2(state &s) : s(s) {}
    void operator() (int *res, enum op &opr, int &right) const {
        switch (opr) {
            case op::add:
                s.acc += right;
                break;
            case op::sub:
                s.acc -= right;
                break;
            case op::mul:
                s.acc *= right;
                break;
            case op::div:
                s.acc /= right;
                break;
        }
        *res = s.acc;
    }
};

auto const recognise_number = some(accept(is_digit));
auto const recognise_space = many(accept(is_space));
auto const parse_operand = all(return_int, recognise_number) && discard(recognise_space);
auto const parse_operator = any(return_op, accept(is_char('+')), accept(is_char('-')),
    accept(is_char('*')), accept(is_char('/'))) && discard(recognise_space);

Here is the stateful bit called from inside a function, there's no global state so it should be thread-safe:

state my_state;
return_exp1 e1(my_state);
return_exp2 e2(my_state);
auto const parse = all(e1, parse_operand) && many(all(e2, parse_operator, parse_operand));
decltype(parse)::result_type a;
b = parse(in, &a);

This correctly evaluates 3 * 2 + 1 as 7. There is one slight 'trick' used here, because the result type is a plain int (parsed by nullable-reference, aka pointer) it gets repeatedly overwritten as the parser progresses, as return_exp1, and return_exp2 return the current accumulator state. You could easily replace this with a vector<int> and push each update into this, which would give a history/timeline of the accumulator as the expression evaluation progressed.

Nice but not good enough for me

Well. That took me a while. You're an interesting exercise in getting my own arguments right, as well as upgrading my C++.

Passing information between parsers under water with a hidden state looks like a pretty bad idea to me. This works on your trivial grammar without operator precedence rules which evaluates left to right but I don't see it scale. Why don't you just wrap the parse definition in another "all" rule? (If you can't, then I'ld say that hints at an expressive weakness in your library.)

I understand that stuff automagically going out of scope and turning into garbage is a problem in C++ but I don't think this is the manner to handle that.

Ah well. You rely on deterministic LL(1) over character streams, which simplifies a number of problems, whereas I want (backtracking) LL(k) over a search space which introduces its own quagmires. It's just not general enough for me.

as many as you like

You can wrap with as many 'all' combinators as you like.

Passing information in hidden state is the same way monad based combinators work. I can make it so there is no hidden state, but why would I want to do that. Are there any actually reasons not to do this, or is it just a vague personal preference. I would need to be convinced the alternative is better.

I think you should look at Boost Spirit and Phoenix, as it does LL(k) and every combinator can have "attributes" using the Phoenix functional library.

Complex interpreters

If you want to do more than just evaluate tiny arithmetical expressions it quickly becomes obvious that you don't want to have a state shared between different expression parsers.

Of course I look at Boost Spirit, I am figuring that out at the moment. But I have higher-order backtracking LL(k) attribute grammar specifications. And I use everything of that. Fifty lines of code.

I doubt Spirit can express analytical information and I am almost completely certain it doesn't have higher-order functionality. But still reading.

It's just an academic exercise to keep me busy though. I don't care much about it.

(Ah well. They call analytical information "inherited" these days. Lousy. Oh. It's something different.)

Passing information between

Passing information between parsers under water with a hidden state looks like a pretty bad idea to me.

I think it could be done pretty well with (for example) a concatenative programming metaphor. Parse this, parse that, pop two parsed items off the stack then combine them...

He already does that with map

But in your vocabulary, he doesn't "pop" explicitly between parsing. It just lingers somewhere in the global void. Like a half-baked product.

Not sure what you mean

I am not sure what you mean? Which value is lingering in the global void? Each parser _returns_ its result (as a pass by reference parameter).

What is the simplest example that shows (requires) the higher order behaviour you want?

State

state my_state;

If state would be a symbol table it would make sense. But this? For a piece of synthesized information, an integer? And now it's halfway in between synthesized and global information anyway. Lousy.

(Ah well. I guess you wanted to show that you can thread something like a symbol table. But then make a language where that makes sense like evaluate "1 + 2 * myaccount" where "myaccount" is in a thread scoped context.)

Version without "state"

Here is a version that communicates only using the return values of the parsers without any threaded state:

enum op {add = 0, sub = 1, mul = 2, div = 3};

struct return_int {
    return_int() {}
    void operator() (int *res, string const& num) const {
        *res = stoi(num);
    }
} const return_int;

struct return_op {
    return_op() {}
    void operator() (
        enum op *res, int choice, string const& add,
        string const& sub, string const& mul, string const& div
    ) const {
        *res = static_cast<enum op>(choice);
    }
} const return_op;

struct return_left {
    return_left() {}
    void operator() (int *res, int left) const {
        *res = left;
    }
} const return_left;

struct return_right {
    return_right() {}
    void operator() (int *res, enum op opr, int right) const {
        switch (opr) {
            case op::add:
                *res += right;
                break;
            case op::sub:
                *res -= right;
                break;
            case op::mul:
                *res *= right;
                break;
            case op::div:
                *res /= right;
                break;
        }
    }
} const return_right;

auto const recognise_number = some(accept(is_digit));
auto const recognise_space = many(accept(is_space));
auto const parse_operand = discard(recognise_space) && all(return_int, recognise_number);
auto const parse_operator = discard(recognise_space) && any(return_op, accept(is_char('+')), accept(is_char('-')),
    accept(is_char('*')), accept(is_char('/')));
auto const parse = all(return_left, parse_operand) && many(all(return_right, parse_operator, parse_operand));

Now the parser is entirely const, so all the definition is in the const type section, with just a simple call in the function:

        decltype(parse)::result_type a;
        b = parse(in, &a);

Is that better?

Oh

It was already quite good; and the C++ is excellent as far as I can judge (I can't, really.) You would need to test it against a number of grammars but it convinces me one can write reasonable LL(1) interpreters over character streams with your solution.

You're still stuck with a bit of an academic exercise, sorry to inform you. Some people will still prefer pointer bumping for lexing, even if this solves their problem more declarative, and some lexemes are sometimes complex even when written as regular expressions (which will be a pain to express LL(1)).

If I were you I would put it into a blog post, explain some of the ideas behind it, and explain how it's used. Then forget about it. Unless you think they're doing it wrong in Boost Spirit; if you can show people to do it better, that's always interesting.

Meanwhile, I'll think over my own problems and see if I can steal some of your code to fix my own problems. It's a nice manner of learning C++ and (I really don't know yet) you have some more modern ideas on type abuse in C++.

Dogfood

Actually I have a whole heap of things that use parsers that I intend to use it for (including the language I am working on). I am more concerned with the academic exersize than how many people actually use it. That's why I am looking for feedback here. I am looking to contribute towards an ongoing discussion (in code) of the definitive definitions of such things. I really like Stepanov's concept of algorithms as a shared mathematical dialog. I still don't think there is a place for publishing such things which is a shame.

My conclusion is that Boost Spirit is doing it more or less right, and that Phoenix is interesting. I think if I were to take the next steps to actually using Phoenix then I would be better using Spirit too, although I prefer the simpler way I handled return values polymorphically, leave threading state to the user, and don't try and overload every operator breaking the algebraic axioms of many operators. For example unary * normally is a pointer dereference, using it as a prefix kleene star makes things hard to read at best.

What is the simplest example for the higher order behavior?

Parsing a parenthesized term. With the functional approach you can write a user-defined function which takes a parser and returns a parser which parses the production rule enclosed by parentheses. Looks like your parsers are functions, so it should be trivial.

(It's higher-order since you can write functions over parsers and impossible to express in, for example, Bison or that Moony Early thing here. It's a nice to have.)

That'll be nice to showcase.

polymorphic recursion.

I have been working on a sort of fix point combinator, and I have got it working so that you can add runtime polymorphism at certain points in the parser. Because the parsers all have different types, and its recursive, this is a kind of polymorphic-recursion.

As expected we have to add some runtime polymorphism which is achieved in c++ with a base class with a pure virtual method, and a teplated subclass to hold the parser of any type. We can't directly instanciate the base class so we need a handle class to hold a reference or pointer to the base-class. We are almost there, but we need to handle copying and assignment of the handle. If we do this in a const way we need to cope with the self-reference being uninitialised in the definition, and as the parser objects copy their arguments we need something const they can copy that can point to the as yet uninitialised self reference. If we do this dynamically we need something slightly different, where we have a shared pointer in the handle so it can be copied when assigning the self referential parser to the polymorphic handle.

I just need to either get these two methods to work together, or if I can't, chose the one I prefer. This how the static version looks:

parser_handle<int> p = ... && ref(p);

And the dynamic:

parser_handle<int> p;
...
p = ... && p;

I moved

Well. All the type theory question and language philosophy got me wondering about module systems, the role of impure constructs, and how they solved that in ocaml. Did you notice they already solved the question of applicative vs generative in Ocaml? I.e.: If a class is a reinstantiatable abstract module declaration, then they solved the question.

Since I now believe I want fundamental hard-core impurity over functional methods in my language I wonder how they solved that in ocaml. And how to design my language that way.

I don't know why you do fixed points now, but the question is a bit passe for me at the moment. Sorry. Bit of a butterfly mind.

please elaborate

"fundamental hard-core impurity over functional methods"?

Feelings

I want OO in the sense of reinstantiable module declaration since the reinstantiable implies it'll buy me more than any other module system; at least, in terms of software engineering in the large concepts.

So, a module systems where values, or a value, is part of the interface declaration and the rest is mostly pure, but sometimes impure, so impure anyway. (If default OO is OO over an imperative language I want OO over an impure functional language. So, Ocaml.)

Think: reinstantiable signatures with values as first-class citizens of the language.

(It'll be done somewhere else. So I am looking at Ocaml, and I'll have a look at OHaskell sometime. Possibly I don't need much more than the interfaces I have in my language anyway. But I am just goofing around with my spare time; no idea whether I'll do it. Maybe I'll just decide that if type theoretical questions can't be solved I'll just not solve them and go untyped anyway.)

(Then again. Without types I'll end up implementing Lisp/Javascript so that doesn't add up too.)

You mean First-class modules/virtual classes?

I'm trying to understand if what you mean in the end is some variation of "first-class modules" (in the functional programming community) and/or some variation of virtual classes — including Scala itself, path-dependent types, and lots more I'm not expert on (including the Beta language, dependent classes). If so, Ocaml is not appropriate IIUC because it does not have first-class modules. Viceversa, Scala is designed to subsume ML modules — see the intro of "A nominal theory of objects with dependent types".
Moreover, those names might help you in your search.

If you go this road, you'll end up facing a different question (which often is mixed up with the above): should you just make ML modules first-class, or should you switch to mixin modules (as in Scala or in Derek Dreyer's work)?
IMO, mixins face similar tradeoffs with inheritance — they're really convenient (because they remove explicit wiring you'd often need with delegation) but they don't enforce modular reasoning (similarly to aspects — aspects and before/after advices come from CLOS mixins, so the similarity is rather deep). It is somewhat clear by now that aspects are not the right direction, but the field is still open. (Some work simply generates the wiring as needed, such as "Forsaking Inheritance: Supercharged Delegation in DelphJ", but I don't know the details).

Final warning: it's probably not obvious that the different things I'm talking about have anything to do with each other. IMHO studying the literature makes it somewhat clear (see "A nominal theory of objects with dependent types"'s intro, again), and that's probably known to the experts in the field, but I don't know if anybody's bothered to write it down. If not, such a write-up could IMHO be published (I'd be interested in working on this, in my copious free time).

First Class Modules, Meta Classes

This is close to where I am going, which is first class modules (hence the type system with principle typings). Classes and other similar things will be implemented as meta language constructs (predicates on types).

Scala does not subsume ML modules

Nit: Scala does not subsume ML modules, see e.g. my recent answer on SO. Also, the νObj calculus is quite different from actual Scala: more expressive but less decidable.

More feelings

No, as I said, and you indicated to understand with the word "road," it's a general idea of the direction I want to take a typechecker once I get interested.

The idea is to conflate, possibly, higher-rank type impure type constructors as a first citizen within the language.

And since I want neither MLs value restriction as well as neither co-/contravariance that implies a number of things.

Don't think I will solve it but, yeah, interested.

Static Polymorphism

So the parser combinators build a parser using static polymorphism. This is like compile time generative programming. The result of executing the template program at compile time is a specialised parser for the grammar. That makes it fast and efficient. Rather than using runtime polymorphism everywhere and paying an order of magnitude performance penalty we invoke runtime polymorphism explicitly where needed. This turns out to be precisely any point of recursion (hence reference to fix-point). I have uploaded the code to github as I am mostly happy with it. There are currently two types of reference I would like to combine, and I need to update the example.

F-ing modules

You probably already know about this paper, but if not, maybe it can help you.

Yeah I scanned it

Not interested. Or not interested, yet. Unsure. (I am reading the Scala language definition which is far more appropiate for the simple things I want.)

Anyway, the general informal ideas I have are:

I want to be able to type

f = \x. (1, f)

I would like to type, without higher ranks,

g = \f x y. (f x, f y)

And I want to conflate data type, mutable object, and module. And avoid covariance and contravariance.

And that all in some bash replacement; uh, no REPL though. I.e., implemented in C, and reasonably fast. Uh, and the other stuff I had in my language I guess, like exceptions and type cases and stuff.

Updated Expression Parser Example

I have updated the example_expression to parse nested binary expressions using the polymorphic recursion combinators I added to the parser combinator library. I also added a logging parser to help debug these recursive parsers. You can see the example in full on GitHub, but here is what the actutal recursive parser looks like in the two available methods (minus all the simple parser definitions). Method 1:

parser_reference<int> expression_parser1;
auto const expression1 = discard(recognise_start) && all(return_exp,
        log("left", expression_parser1 || parse_operand),
        log("op", parse_operator),
        log("right", expression_parser1 || parse_operand)
    ) && discard(recognise_end);

Those are const global, you need to close the recursion dynamically in code before using like this:

expression_parser1 = expression1;

Method 2:

Has the advantage of describing the recursion entirely in the const top-level definition, at the cost of having to wrap every self-reference with a "reference" parser combinator.

parser_handle<int> expression_parser2 = discard(recognise_start) && all(return_exp,
        log("left", reference(expression_parser2) || parse_operand),
        log("op", parse_operator),
        log("right", reference(expression_parser2) || parse_operand)
    ) && discard(recognise_end);

As the methods are quite similar, I may remove one after writing a few parsers to see which one is easier to use in practice.

Open recursion

If I correctly understand that the first allows open recursion, then you may want to keep it for modularity reasons (despite the fact that most often you don't actually need the extra flexibility and the second form is just fine). Here is the use-case I have in mind:

  additive_expr(expr1) = (expr1 '+' expr1) | (expr1 '-' expr1)
  mutilplicative_expr(expr1) = (expr1 '*' expr1) | (expr1 '/' expr1)

  expr = additive_expr(expr) | multiplicative_expr(expr)

You are able to separately define the additive and the multiplicative operation, and later tie the knot to get expressions that may have either form of operators at any recursion level. Note the difference with:

  expr1 = (expr1 '+' expr1) | (expr1 '-' expr1)
  expr2 = (expr2 '*' expr2) | (expr2 '/' expr2)
  expr = expr1 | expr2

Backtracking

That would make sense if the parser was LL(K), it seems of limited use with LL(1)... but might still be useful. Maybe I will just have to implement more backtracking.

In another discussion I was

In another discussion I was pointed to uu-parsinglib. From what I've seen so far it is really amazing. See also this paper: http://www.staff.science.uu.nl/~swier101/Papers/2003/p224-swierstra.pdf

They have unlimited lookahead while still being incremental by doing breadth first instead of depth first backtracking. Another amazing thing is that they can handle ambiguous grammars in polynomial time with an amb combinator: amb : Parser<a> -> Parser<List<a>>. It's beautiful how this integrates controlled nondeterminism in otherwise deterministic parsers so you don't have to buy into nondeterminsm everywhere.

So if you do want to implement backtracking at some point, it's probably worth looking at how they do that.

where is the number parser?

Question about the above? Where is the parser for plain numbers? Would it be:

expr = additive_expr(expr) | multiplicative_expr(expr) | parse_int

Also is there a left recursion problem here?

Added simple backtracking

I have added simple backtracking, the performance hit is not too bad, and its still faster than the simple parser with no backtracking (using the same LL(1) grammars).

I have updated the expression example to the one given by gasche above, modified so that it works (brackets for sub expressions, otherwise the whole thing left-recurses on "expr") and a number parser if expression parsers fail. As this is a combined lexer and parser, prefix spaces are discarded. The complete parser looks like this:

struct return_int {
    return_int() {}
    void operator() (int *res, string &num) const {
        *res = stoi(num);
    }
} const return_int;

struct return_add {
    return_add() {}
    void operator() (int *res, int left, string&, int right) const {
        *res = left + right;
    }
} const return_add;

struct return_sub {
    return_sub() {}
    void operator() (int *res, int left, string&, int right) const {
        *res = left - right;
    }
} const return_sub;

struct return_mul {
    return_mul() {}
    void operator() (int *res, int left, string&, int right) const {
        *res = left * right;
    }
} const return_mul;

struct return_div {
    return_div() {}
    void operator() (int *res, int left, string&, int right) const {
        *res = left / right;
    }
} const return_div;

auto const recognise_space = many(accept(is_space));
auto const recognise_number = discard(recognise_space) && some(accept(is_digit));
auto const recognise_start = discard(recognise_space) && accept(is_char('('));
auto const recognise_end = discard(recognise_space) && accept(is_char(')'));

parser_handle<int> const additive_expr(parser_handle<int> const e) {
    return log("+", all(return_add, e, discard(recognise_space) && accept(is_char('+')), e))
        || log("-", all(return_sub, e, discard(recognise_space) && accept(is_char('-')), e));
}

parser_handle<int> const multiplicative_expr(parser_handle<int> const e) {
    return log("*", all(return_mul, e, discard(recognise_space) && accept(is_char('*')), e))
        || log("/", all(return_div, e, discard(recognise_space) && accept(is_char('/')), e));
}

parser_handle<int> const expression = (
        discard(recognise_start) && (
            additive_expr(reference(expression)) || multiplicative_expr(reference(expression))
        ) && discard(recognise_end)
    ) || all(return_int, recognise_number);

What concerns me is that the backtracking stack can get pretty big even with simple parsers. It really needs a 'cut' operator like Prolog (although I am not keen on cut in Prolog)... I think I could apply the same solution as my logic interpreter, with nested meta-interpreters (parsers).

My other concern is that the left hand parser result is thrown away if the operator does not match.

I still have to read the paper on Polish parsers, so maybe that will give me some ideas as well.

An idea that occurs to me is that we could build a grammar tree from the parser combinators (so that it appears to support full backtracking), but then interpret that tree in an efficient way.

Backtracking at the hand of the language designer

In my experience the amount of backtracking relies on the language designer. It's a reason why I don't like your parser works on character stream. Informally, not tokenizing blows up the amount of backtracking with a rather large constant.

But backtracking LL(k) isn't a good solution on highly ambiguous grammars out of your control, I agree.

Tokenizing makes no difference

Tokenizing would make no difference to the stack size, as we simply store the stream position on the backtrack stack and then seek to it. The number of choice points and hence the backtrack stack size is the same tokenized or non-tokenized. The stream-buf buffer size is fixed, so there is no difference there, and its up to the operating system how it uses its page cache to satisfy seek requests. Userspace buffering seems to make very little performance difference.

The problem has little to do with the ambiguity of the grammar. Take a CSV file where each number is separated by a comma or a newline. before parsing each number we have to create a backtrack checkpoint in case something fails later, so with a 1000 by 1000 CSV you get 1 million checkpoints on the backtrack stack. So its each choice point (in this case comma or return) that causes the problem, you really need a cut operator after such choices to indicate that later failures cannot undo earlier choices. What I really need is a backtrack stack of deferred parsers, rather than stream positions. That way you can perform a cut by just popping the top parser off the stack...

The problem with tokenisers is sequences like "name123" and "name 123". You have to have a global definition of what a name looks like, and then there are literal strings etc...

Too bored to answer

Tokenizing would make no difference to the stack size, as we simply store the stream position on the backtrack stack and then seek to it.

Man. Too bored to answer and read your code. The above statement can't be true unless you're reinventing a form of bottom-up parsing. Maybe you are, no idea.

Does anyone still use tokenizers?

I didn't think anyone still used tokenizers due to the problems with string literals etc. You end up with a horrible mess trying to feed information back to the tokenizer from the parser...

You're a bit peculiar?

Uhm. Everybody uses tokenizers? And all attempts to do without have failed? Unless you know more than I do, of course.

Holy grail of LL(1)

As you note it also lists the disadvantages of LL(1). Ever seen a TeX error? Though I guess there were good reasons to design TeX the manner in which it was designed and nobody has done a better job so far.

Knuth is one of the few computer scientists I take seriously.

As I stated, backtracking LL(k) is a 95% solution for PLs. The there mentioned author is interested in 100% solutions. Unfortunately, backtracking higher-order attribute grammar parsers probably go beyond what is solvable in his system. But, yeah, research.

Anecdotal Evidence

I have never used a tokenizer outside of academic requirements. All the real world parsers I have implemented have been scannerless. I have a lot of respect for Knuth, and am proud to be an account holder at the bank of San Serriffe, although I could do with saving a bit more :-)

In any case, maybe I could arrange it so you could use an optional tokenizer, I just wanted to avoid another template parameter, or the cost of virtual functions.

I am now seriously considering building an AST of the grammar from the combinators, allowing grammar composition, followed by a tree rewrite into a character at a time parser with no backtracking. What do you think?

Don't think a lot

Well, if you can pull it off with your language you should. I am not stopping you.

You might try a grammar work bench

An idea I have. I actually used a grammar work bench a long time ago to check whether a certain grammar met certain conditions. Maybe that's an idea? You'll be able to test whether your language meets LL(1) reasonably fast. Not sure they're still in existence though.

Hah. I found your bank account!

I am green with envy.

-

Double post

LR Parser

I'm not happy with the backtracking implementation, and I really like the Polish parsers. I am sure these can be implemented in C++ with object-oriented approach, and lots of runtime polymorphism, and I might have a go, but this would best be done from a clean sheet. So I will branch the simple backtracking and revert the head to the LL1 parser which although limited, I am happy with. Complex parsing can still be done by building an LR shift/reduce parser using the user functors.

My concern with the above is that the grammar is static, so in theory you can statically transform the combinator description into an LR parser, after all that's what tools like yacc and bison do. Neither Haskell nor C++ would seem to be able to do this (The polish parsers paper relies on polymorphic recursion). This is where my language comes in, defining the transform from combinators to shift/reduce in the logic meta-language and the components of the LR parser in the language itself. Of course this is something I can use to implement its own parser later in the bootstrapping, so I may not want to spend too long on the C++ parser.

Backtracking

I have realised I was wrong about the backtracking memory usage. The backtracking stack only grows in the recursive case, and is limited to the choices within one expression. Having started to update a few projects to use the new combinators its clear the backtracking version is easier to use, so I have promoted it back to the master branch in the GitHub repository...

Lazy Tokenisaition

I have been thinking about the comments regarding tokenisers, and have realised there is a mapping between a certain kind of recogniser and a tokeniser. If all recognisers have the format:

tokenise(recogniser) = discard(recognise_space) && recogniser

When you build a parser out of "tokenised" recognisers you get lazy tokenisation, so you don't need a separate tokenisation pass.

In other words you can think of a functor that maps tokenisation as a function on characters, to a function on recognisers. In C++ this functor would be:

template <typename R> auto tokenise (R const& r) -> decltype(discard(many(accept(is_space))) && r) {
    return discard(many(accept(is_space))) && r;
}

Improved Error Reporting

The parsers now have automatically generated EBNF names, which can be overridden with user-defined names, along with the ability to mark parsers as 'strict' which promotes failure to an exception. For strict parsers and for alternatives in an 'or' the beginning and end of the characters consumed by a failed parser are recorded and used in the error report. This results in error reports that look like this:

error parsing let: line 1, column 4.
let x = a b £ c in let y = a b c in y 
   ^--------^
<name - "in" - "let", '=', {expression}-, "in", {expression}->