C++ Parser Combinator Library

I have put a C++ parser combinator library on GitHub https://github.com/keean/Parser-Combinators. There is an example to show how it is used, and there is a simple recursive decent parser for comparison. The parser-combinators benchmark 20% faster than the simple parser. I would appreciate comments on the code, the choice of combinators (I want to keep the number to a minimum without compromising usefulness), and the comparative readability of code using the combinators vs the simple parser, or any other thoughts.

There are two basic string recognisers ('accept', 'expect'), two nullary parsers ('succ', 'fail'), and two parser-constructors ('all', 'any') that take a user supplied functor and a variadic argument list of parsers or recognisers. The functor for the parser constructors is variadic, so it is passed one result argument for each parser, 'all' only calls the functor if all the parsers succeed and provides all arguments, 'any' calls the functor as soon as the first parser from the left succeeds with an index number indicating which argument has the valid result (the remaining arguments are initialised with the default constructor). Finally there are four combinators that can be used on both recognisers and parsers '&&', '||' which behave as they do in boolean logic (with short-circuit evaluation), 'many' and 'discard'.

Edit: The latest version of this library supports full backtracking parser combinators, and is different from other C++ parser combinator libraries because it separates static and dynamic polymorphism. By limiting dynamic polymorphism to only where actually needed, the compiler can inline the combinators (which are function-objects) resulting in performance better than the non-combinator hand-written recursive descent parsers.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Vs. Boost::Spirit?

Sorry for the easy comment, but how does your library compare with Boost::Spirit? Is the only difference being LL(1)?

Even if you know about Spirit, you need to discuss the difference somewhere.

Small and Independent

The quick answer is that it is a small library (~500 lines) and does not depend on boost. It was designed from a clean sheet, and tries to implement a minimal set of combinators. Spirit tries to fit operators to parsing so that they obey the C++ operator precedence (including overloading *), this only uses operators where their semantics obviously fit (&&, ||) and uses names elsewhere in line with the Concepts (as in the Stepanov definiton) underlying these operators.

In more detail, I think its a cleaner implementation, it doesn't use the curiously recurring template pattern (which I don't really understand the need for as you don't need a base class for static polymorphism). Also all parsers are treated as const values, as are any user-functors. You can of course pass a mutable object into your user-functor as a constructor argument for state (as shown in the example_expression parser). Parsers do not return a match object, rather you pass in a data-structure to store the results as an optional argument, which if null means the results are never stored. There are no parser-contexts, this is achieved through the user-functors and 'any' and 'all' combinators. There may be more differences I haven't noticed yet.

A further difference is that the purpose of posting it for discussion is to get feedback that may alter and improve the design, something that is harder to do with an established library even if you fork it (due to code size, complexity and dependencies).

Edit: It appears the principal difference is in the handling of polymorphism. My library handles nearly all the polymorphism statically, allowing inlined 'compiled' parsers to be generated and specialised for each grammar fragment. Dynamic (runtime) polymorphism is only used where explicitly requested by the programmer by use of a 'parser_handle'.

Genius

Man, I think you're a genius in C++ but I just don't believe you got it right; or it is severely restricted. One of the nicer things of functional parser combinators is that you can toss analytical information in and I don't think you can get that right in this manner. (I also doubt you can do that in Boost.Spirit.) Maybe you shouldn't care about that since almost all compiler tools only allow for synthesized information.

The other thing is that parser combinators allow, with a bunch of higher-order operators, to be combined into new parser combinators and I don't see that either. You seem to stick parsers together explicitly in code with for and while loops which, somewhere, totally goes against the spirit of the whole exercise.

It's interesting, I am in awe of your skills, but I don't see it. (But I'll admit I have a hard time reading your code.)

which one are you looking at?

Test_simple is the example of using a simple non-combinator parser. Test_combinators is the combinator example, and you can see two functors used in a higher order combinator (all) to produce a parser that returns ints, from comma separated values on a single line, and one an order higher that combines that into a multi-line parser. If I understand you right, you do get the analytical bit, for example you can pass a functor to calculate the sum, rather than return the result in an array. Perhaps an expression evaluator would be a better example?)

Here is the csv parser example:

struct parse_int {
    parse_int() {}
    void operator() (vector<int> *ts, string &num, string &sep) const {
        ts->push_back(stoi(num));
    }
} const parse_int;

struct parse_line {
    parse_line() {}
    void operator() (vector<vector<int>> *ts, vector &line, string &line_sep) const {
        ts->push_back(move(line));
    }
} const parse_line;

auto const recognise_number = some(accept(is_digit));
auto const recognise_space = some(accept(is_space));
auto const recognise_separator = option(accept(is_char(',')) && discard(option(recognise_space)));
auto const parse_csv = some(all(parse_line, some(all(parse_int, recognise_number, recognise_separator)), option(recognise_space)));

You can see we define three string recognisers using combinators, and two user defined 'aggregation' functors. 'all' is a higher-order-function the first argument of which is a map function, the first example parse_int, is used here all(parse_int, recognise_number, recognise_separator) it is passed the result of recognisers recognise_number and recognise_separator as function arguments (only if both succeed), as you can see here void operator() (vector<int> *ts, string &num, string &sep) const

I guess you'll need to show me

Yah. I don't really know what parse_line does though I can take an educated guess at its semantics. (Ah, it's a functor.)

I would like to see this evaluated with your combinators:

(ADD 3 (MUL 5 (DEC 7))

Okay. I think I get it. The confusing thing was the naming. "parse_int" isn't the actual parser but the epsilon rule/attribute evaluation. I would have preferred it if you would have called the parser expression which starts with "all" "parse_int"; and people likely would find it more readable if the epsilon rule would come as the last argument; or you just wrap it around whatever is done in all, it looks too specialized somehow.

(Then the other thing is that the parser combinators I once wrote are probably around a hundred lines in total. I don't get why you wouldn't be able to trivialize it further.)

Makes Sense

Yes, that makes sense. I could call it "epsilon_int" and define "parse_int" as suggested to make it more readable. I'm afraid the variadic syntax needs the variable arguments to come last, so it has to be first. If you think that all arguments after the first are user-defined it kind of makes sense. Also it follows the pattern of a fold function where the accumulator is the first argument.

Fold

Yah. We discussed that and I told you I don't believe fold will buy you much in defining parsers. Attributes just have different types, "SUM" is a piece of text and "1" is an integer. (Though you also do type tricks in C++ I didn't think were possible.) And though lists of things emerge sometimes, they aren't that important usually that you would like to specialize in that direction.

The trick to observe is that if you can't do the attribute evaluation directly in the second level of the CFG part, or maybe don't want to, that you then push the attribute evaluation into epsilon rules which you then trickle around in your grammar. (Bison is an affix grammar with the epsilon rules last.) That would also hopefully keep the (intermediate) types of your parsers/production rules sane.

I didn't think it much trough but a specialized combinator probably would take care of that.

Example Expression Evaluator

I got rid of the 'fold' (which was polymorphic) as the variadic form is much neater (the 'all' and 'any' combinators), you don't need multiple template definitions for each parser type, as single functor receives all the parsers as the appropriately typed argument. You can see this in "return_exp2", the first argument is the result, the rest are the parser results which are of type 'enum op' and 'int' respectively. These types must match the types returned by the parsers in those positions of the 'all' combinator, and this is enforced by the type system.

Here's an expression evaluator, no operator precedence however. These are the static definitions (const objects and types), a nice point to note is that no templates are needed in the user code, thats all contained in the library. The complete example is now on GitHub.

enum op {add = 0, sub = 1, mul = 2, div = 3};

struct return_int {
    return_int() {}
    void operator() (int *res, string &num) const {
        *res = stoi(num);
    }
} const return_int;

struct return_op {
    return_op() {}
    void operator() (enum op *res, int choice, string &add, string &sub, string &mul, string &div) const {
        *res = static_cast<enum op>(choice); // this relies on the order of parsers being the same as the enum.
    }
} const return_op;

struct state {
    int acc;
    state() {};
};

struct return_exp1 {
    state &s;
    return_exp1(state &s) : s(s) {}
    void operator() (int *res, int &left) const {
        s.acc = left;
        *res = s.acc;
    }
};

struct return_exp2 {
    state &s;
    return_exp2(state &s) : s(s) {}
    void operator() (int *res, enum op &opr, int &right) const {
        switch (opr) {
            case op::add:
                s.acc += right;
                break;
            case op::sub:
                s.acc -= right;
                break;
            case op::mul:
                s.acc *= right;
                break;
            case op::div:
                s.acc /= right;
                break;
        }
        *res = s.acc;
    }
};

auto const recognise_number = some(accept(is_digit));
auto const recognise_space = many(accept(is_space));
auto const parse_operand = all(return_int, recognise_number) && discard(recognise_space);
auto const parse_operator = any(return_op, accept(is_char('+')), accept(is_char('-')),
    accept(is_char('*')), accept(is_char('/'))) && discard(recognise_space);

Here is the stateful bit called from inside a function, there's no global state so it should be thread-safe:

state my_state;
return_exp1 e1(my_state);
return_exp2 e2(my_state);
auto const parse = all(e1, parse_operand) && many(all(e2, parse_operator, parse_operand));
decltype(parse)::result_type a;
b = parse(in, &a);

This correctly evaluates 3 * 2 + 1 as 7. There is one slight 'trick' used here, because the result type is a plain int (parsed by nullable-reference, aka pointer) it gets repeatedly overwritten as the parser progresses, as return_exp1, and return_exp2 return the current accumulator state. You could easily replace this with a vector<int> and push each update into this, which would give a history/timeline of the accumulator as the expression evaluation progressed.

Nice but not good enough for me

Well. That took me a while. You're an interesting exercise in getting my own arguments right, as well as upgrading my C++.

Passing information between parsers under water with a hidden state looks like a pretty bad idea to me. This works on your trivial grammar without operator precedence rules which evaluates left to right but I don't see it scale. Why don't you just wrap the parse definition in another "all" rule? (If you can't, then I'ld say that hints at an expressive weakness in your library.)

I understand that stuff automagically going out of scope and turning into garbage is a problem in C++ but I don't think this is the manner to handle that.

Ah well. You rely on deterministic LL(1) over character streams, which simplifies a number of problems, whereas I want (backtracking) LL(k) over a search space which introduces its own quagmires. It's just not general enough for me.

as many as you like

You can wrap with as many 'all' combinators as you like.

Passing information in hidden state is the same way monad based combinators work. I can make it so there is no hidden state, but why would I want to do that. Are there any actually reasons not to do this, or is it just a vague personal preference. I would need to be convinced the alternative is better.

I think you should look at Boost Spirit and Phoenix, as it does LL(k) and every combinator can have "attributes" using the Phoenix functional library.

Complex interpreters

If you want to do more than just evaluate tiny arithmetical expressions it quickly becomes obvious that you don't want to have a state shared between different expression parsers.

Of course I look at Boost Spirit, I am figuring that out at the moment. But I have higher-order backtracking LL(k) attribute grammar specifications. And I use everything of that. Fifty lines of code.

I doubt Spirit can express analytical information and I am almost completely certain it doesn't have higher-order functionality. But still reading.

It's just an academic exercise to keep me busy though. I don't care much about it.

(Ah well. They call analytical information "inherited" these days. Lousy. Oh. It's something different.)

Passing information between

Passing information between parsers under water with a hidden state looks like a pretty bad idea to me.

I think it could be done pretty well with (for example) a concatenative programming metaphor. Parse this, parse that, pop two parsed items off the stack then combine them...

He already does that with map

But in your vocabulary, he doesn't "pop" explicitly between parsing. It just lingers somewhere in the global void. Like a half-baked product.

Not sure what you mean

I am not sure what you mean? Which value is lingering in the global void? Each parser _returns_ its result (as a pass by reference parameter).

What is the simplest example that shows (requires) the higher order behaviour you want?

State

state my_state;

If state would be a symbol table it would make sense. But this? For a piece of synthesized information, an integer? And now it's halfway in between synthesized and global information anyway. Lousy.

(Ah well. I guess you wanted to show that you can thread something like a symbol table. But then make a language where that makes sense like evaluate "1 + 2 * myaccount" where "myaccount" is in a thread scoped context.)

Version without "state"

Here is a version that communicates only using the return values of the parsers without any threaded state:

enum op {add = 0, sub = 1, mul = 2, div = 3};

struct return_int {
    return_int() {}
    void operator() (int *res, string const& num) const {
        *res = stoi(num);
    }
} const return_int;

struct return_op {
    return_op() {}
    void operator() (
        enum op *res, int choice, string const& add,
        string const& sub, string const& mul, string const& div
    ) const {
        *res = static_cast<enum op>(choice);
    }
} const return_op;

struct return_left {
    return_left() {}
    void operator() (int *res, int left) const {
        *res = left;
    }
} const return_left;

struct return_right {
    return_right() {}
    void operator() (int *res, enum op opr, int right) const {
        switch (opr) {
            case op::add:
                *res += right;
                break;
            case op::sub:
                *res -= right;
                break;
            case op::mul:
                *res *= right;
                break;
            case op::div:
                *res /= right;
                break;
        }
    }
} const return_right;

auto const recognise_number = some(accept(is_digit));
auto const recognise_space = many(accept(is_space));
auto const parse_operand = discard(recognise_space) && all(return_int, recognise_number);
auto const parse_operator = discard(recognise_space) && any(return_op, accept(is_char('+')), accept(is_char('-')),
    accept(is_char('*')), accept(is_char('/')));
auto const parse = all(return_left, parse_operand) && many(all(return_right, parse_operator, parse_operand));

Now the parser is entirely const, so all the definition is in the const type section, with just a simple call in the function:

        decltype(parse)::result_type a;
        b = parse(in, &a);

Is that better?

Oh

It was already quite good; and the C++ is excellent as far as I can judge (I can't, really.) You would need to test it against a number of grammars but it convinces me one can write reasonable LL(1) interpreters over character streams with your solution.

You're still stuck with a bit of an academic exercise, sorry to inform you. Some people will still prefer pointer bumping for lexing, even if this solves their problem more declarative, and some lexemes are sometimes complex even when written as regular expressions (which will be a pain to express LL(1)).

If I were you I would put it into a blog post, explain some of the ideas behind it, and explain how it's used. Then forget about it. Unless you think they're doing it wrong in Boost Spirit; if you can show people to do it better, that's always interesting.

Meanwhile, I'll think over my own problems and see if I can steal some of your code to fix my own problems. It's a nice manner of learning C++ and (I really don't know yet) you have some more modern ideas on type abuse in C++.

Dogfood

Actually I have a whole heap of things that use parsers that I intend to use it for (including the language I am working on). I am more concerned with the academic exersize than how many people actually use it. That's why I am looking for feedback here. I am looking to contribute towards an ongoing discussion (in code) of the definitive definitions of such things. I really like Stepanov's concept of algorithms as a shared mathematical dialog. I still don't think there is a place for publishing such things which is a shame.

My conclusion is that Boost Spirit is doing it more or less right, and that Phoenix is interesting. I think if I were to take the next steps to actually using Phoenix then I would be better using Spirit too, although I prefer the simpler way I handled return values polymorphically, leave threading state to the user, and don't try and overload every operator breaking the algebraic axioms of many operators. For example unary * normally is a pointer dereference, using it as a prefix kleene star makes things hard to read at best.

What is the simplest example for the higher order behavior?

Parsing a parenthesized term. With the functional approach you can write a user-defined function which takes a parser and returns a parser which parses the production rule enclosed by parentheses. Looks like your parsers are functions, so it should be trivial.

(It's higher-order since you can write functions over parsers and impossible to express in, for example, Bison or that Moony Early thing here. It's a nice to have.)

That'll be nice to showcase.

polymorphic recursion.

I have been working on a sort of fix point combinator, and I have got it working so that you can add runtime polymorphism at certain points in the parser. Because the parsers all have different types, and its recursive, this is a kind of polymorphic-recursion.

As expected we have to add some runtime polymorphism which is achieved in c++ with a base class with a pure virtual method, and a teplated subclass to hold the parser of any type. We can't directly instanciate the base class so we need a handle class to hold a reference or pointer to the base-class. We are almost there, but we need to handle copying and assignment of the handle. If we do this in a const way we need to cope with the self-reference being uninitialised in the definition, and as the parser objects copy their arguments we need something const they can copy that can point to the as yet uninitialised self reference. If we do this dynamically we need something slightly different, where we have a shared pointer in the handle so it can be copied when assigning the self referential parser to the polymorphic handle.

I just need to either get these two methods to work together, or if I can't, chose the one I prefer. This how the static version looks:

parser_handle<int> p = ... && ref(p);

And the dynamic:

parser_handle<int> p;
...
p = ... && p;

I moved

Well. All the type theory question and language philosophy got me wondering about module systems, the role of impure constructs, and how they solved that in ocaml. Did you notice they already solved the question of applicative vs generative in Ocaml? I.e.: If a class is a reinstantiatable abstract module declaration, then they solved the question.

Since I now believe I want fundamental hard-core impurity over functional methods in my language I wonder how they solved that in ocaml. And how to design my language that way.

I don't know why you do fixed points now, but the question is a bit passe for me at the moment. Sorry. Bit of a butterfly mind.

please elaborate

"fundamental hard-core impurity over functional methods"?

Feelings

I want OO in the sense of reinstantiable module declaration since the reinstantiable implies it'll buy me more than any other module system; at least, in terms of software engineering in the large concepts.

So, a module systems where values, or a value, is part of the interface declaration and the rest is mostly pure, but sometimes impure, so impure anyway. (If default OO is OO over an imperative language I want OO over an impure functional language. So, Ocaml.)

Think: reinstantiable signatures with values as first-class citizens of the language.

(It'll be done somewhere else. So I am looking at Ocaml, and I'll have a look at OHaskell sometime. Possibly I don't need much more than the interfaces I have in my language anyway. But I am just goofing around with my spare time; no idea whether I'll do it. Maybe I'll just decide that if type theoretical questions can't be solved I'll just not solve them and go untyped anyway.)

(Then again. Without types I'll end up implementing Lisp/Javascript so that doesn't add up too.)

You mean First-class modules/virtual classes?

I'm trying to understand if what you mean in the end is some variation of "first-class modules" (in the functional programming community) and/or some variation of virtual classes — including Scala itself, path-dependent types, and lots more I'm not expert on (including the Beta language, dependent classes). If so, Ocaml is not appropriate IIUC because it does not have first-class modules. Viceversa, Scala is designed to subsume ML modules — see the intro of "A nominal theory of objects with dependent types".
Moreover, those names might help you in your search.

If you go this road, you'll end up facing a different question (which often is mixed up with the above): should you just make ML modules first-class, or should you switch to mixin modules (as in Scala or in Derek Dreyer's work)?
IMO, mixins face similar tradeoffs with inheritance — they're really convenient (because they remove explicit wiring you'd often need with delegation) but they don't enforce modular reasoning (similarly to aspects — aspects and before/after advices come from CLOS mixins, so the similarity is rather deep). It is somewhat clear by now that aspects are not the right direction, but the field is still open. (Some work simply generates the wiring as needed, such as "Forsaking Inheritance: Supercharged Delegation in DelphJ", but I don't know the details).

Final warning: it's probably not obvious that the different things I'm talking about have anything to do with each other. IMHO studying the literature makes it somewhat clear (see "A nominal theory of objects with dependent types"'s intro, again), and that's probably known to the experts in the field, but I don't know if anybody's bothered to write it down. If not, such a write-up could IMHO be published (I'd be interested in working on this, in my copious free time).

First Class Modules, Meta Classes

This is close to where I am going, which is first class modules (hence the type system with principle typings). Classes and other similar things will be implemented as meta language constructs (predicates on types).

Scala does not subsume ML modules

Nit: Scala does not subsume ML modules, see e.g. my recent answer on SO. Also, the νObj calculus is quite different from actual Scala: more expressive but less decidable.

More feelings

No, as I said, and you indicated to understand with the word "road," it's a general idea of the direction I want to take a typechecker once I get interested.

The idea is to conflate, possibly, higher-rank type impure type constructors as a first citizen within the language.

And since I want neither MLs value restriction as well as neither co-/contravariance that implies a number of things.

Don't think I will solve it but, yeah, interested.

Static Polymorphism

So the parser combinators build a parser using static polymorphism. This is like compile time generative programming. The result of executing the template program at compile time is a specialised parser for the grammar. That makes it fast and efficient. Rather than using runtime polymorphism everywhere and paying an order of magnitude performance penalty we invoke runtime polymorphism explicitly where needed. This turns out to be precisely any point of recursion (hence reference to fix-point). I have uploaded the code to github as I am mostly happy with it. There are currently two types of reference I would like to combine, and I need to update the example.

F-ing modules

You probably already know about this paper, but if not, maybe it can help you.

Yeah I scanned it

Not interested. Or not interested, yet. Unsure. (I am reading the Scala language definition which is far more appropiate for the simple things I want.)

Anyway, the general informal ideas I have are:

I want to be able to type

f = \x. (1, f)

I would like to type, without higher ranks,

g = \f x y. (f x, f y)

And I want to conflate data type, mutable object, and module. And avoid covariance and contravariance.

And that all in some bash replacement; uh, no REPL though. I.e., implemented in C, and reasonably fast. Uh, and the other stuff I had in my language I guess, like exceptions and type cases and stuff.

Updated Expression Parser Example

I have updated the example_expression to parse nested binary expressions using the polymorphic recursion combinators I added to the parser combinator library. I also added a logging parser to help debug these recursive parsers. You can see the example in full on GitHub, but here is what the actutal recursive parser looks like in the two available methods (minus all the simple parser definitions). Method 1:

parser_reference<int> expression_parser1;
auto const expression1 = discard(recognise_start) && all(return_exp,
        log("left", expression_parser1 || parse_operand),
        log("op", parse_operator),
        log("right", expression_parser1 || parse_operand)
    ) && discard(recognise_end);

Those are const global, you need to close the recursion dynamically in code before using like this:

expression_parser1 = expression1;

Method 2:

Has the advantage of describing the recursion entirely in the const top-level definition, at the cost of having to wrap every self-reference with a "reference" parser combinator.

parser_handle<int> expression_parser2 = discard(recognise_start) && all(return_exp,
        log("left", reference(expression_parser2) || parse_operand),
        log("op", parse_operator),
        log("right", reference(expression_parser2) || parse_operand)
    ) && discard(recognise_end);

As the methods are quite similar, I may remove one after writing a few parsers to see which one is easier to use in practice.

Open recursion

If I correctly understand that the first allows open recursion, then you may want to keep it for modularity reasons (despite the fact that most often you don't actually need the extra flexibility and the second form is just fine). Here is the use-case I have in mind:

  additive_expr(expr1) = (expr1 '+' expr1) | (expr1 '-' expr1)
  mutilplicative_expr(expr1) = (expr1 '*' expr1) | (expr1 '/' expr1)

  expr = additive_expr(expr) | multiplicative_expr(expr)

You are able to separately define the additive and the multiplicative operation, and later tie the knot to get expressions that may have either form of operators at any recursion level. Note the difference with:

  expr1 = (expr1 '+' expr1) | (expr1 '-' expr1)
  expr2 = (expr2 '*' expr2) | (expr2 '/' expr2)
  expr = expr1 | expr2

Backtracking

That would make sense if the parser was LL(K), it seems of limited use with LL(1)... but might still be useful. Maybe I will just have to implement more backtracking.

In another discussion I was

In another discussion I was pointed to uu-parsinglib. From what I've seen so far it is really amazing. See also this paper: http://www.staff.science.uu.nl/~swier101/Papers/2003/p224-swierstra.pdf

They have unlimited lookahead while still being incremental by doing breadth first instead of depth first backtracking. Another amazing thing is that they can handle ambiguous grammars in polynomial time with an amb combinator: amb : Parser<a> -> Parser<List<a>>. It's beautiful how this integrates controlled nondeterminism in otherwise deterministic parsers so you don't have to buy into nondeterminsm everywhere.

So if you do want to implement backtracking at some point, it's probably worth looking at how they do that.

where is the number parser?

Question about the above? Where is the parser for plain numbers? Would it be:

expr = additive_expr(expr) | multiplicative_expr(expr) | parse_int

Also is there a left recursion problem here?

Added simple backtracking

I have added simple backtracking, the performance hit is not too bad, and its still faster than the simple parser with no backtracking (using the same LL(1) grammars).

I have updated the expression example to the one given by gasche above, modified so that it works (brackets for sub expressions, otherwise the whole thing left-recurses on "expr") and a number parser if expression parsers fail. As this is a combined lexer and parser, prefix spaces are discarded. The complete parser looks like this:

struct return_int {
    return_int() {}
    void operator() (int *res, string &num) const {
        *res = stoi(num);
    }
} const return_int;

struct return_add {
    return_add() {}
    void operator() (int *res, int left, string&, int right) const {
        *res = left + right;
    }
} const return_add;

struct return_sub {
    return_sub() {}
    void operator() (int *res, int left, string&, int right) const {
        *res = left - right;
    }
} const return_sub;

struct return_mul {
    return_mul() {}
    void operator() (int *res, int left, string&, int right) const {
        *res = left * right;
    }
} const return_mul;

struct return_div {
    return_div() {}
    void operator() (int *res, int left, string&, int right) const {
        *res = left / right;
    }
} const return_div;

auto const recognise_space = many(accept(is_space));
auto const recognise_number = discard(recognise_space) && some(accept(is_digit));
auto const recognise_start = discard(recognise_space) && accept(is_char('('));
auto const recognise_end = discard(recognise_space) && accept(is_char(')'));

parser_handle<int> const additive_expr(parser_handle<int> const e) {
    return log("+", all(return_add, e, discard(recognise_space) && accept(is_char('+')), e))
        || log("-", all(return_sub, e, discard(recognise_space) && accept(is_char('-')), e));
}

parser_handle<int> const multiplicative_expr(parser_handle<int> const e) {
    return log("*", all(return_mul, e, discard(recognise_space) && accept(is_char('*')), e))
        || log("/", all(return_div, e, discard(recognise_space) && accept(is_char('/')), e));
}

parser_handle<int> const expression = (
        discard(recognise_start) && (
            additive_expr(reference(expression)) || multiplicative_expr(reference(expression))
        ) && discard(recognise_end)
    ) || all(return_int, recognise_number);

What concerns me is that the backtracking stack can get pretty big even with simple parsers. It really needs a 'cut' operator like Prolog (although I am not keen on cut in Prolog)... I think I could apply the same solution as my logic interpreter, with nested meta-interpreters (parsers).

My other concern is that the left hand parser result is thrown away if the operator does not match.

I still have to read the paper on Polish parsers, so maybe that will give me some ideas as well.

An idea that occurs to me is that we could build a grammar tree from the parser combinators (so that it appears to support full backtracking), but then interpret that tree in an efficient way.

Backtracking at the hand of the language designer

In my experience the amount of backtracking relies on the language designer. It's a reason why I don't like your parser works on character stream. Informally, not tokenizing blows up the amount of backtracking with a rather large constant.

But backtracking LL(k) isn't a good solution on highly ambiguous grammars out of your control, I agree.

Tokenizing makes no difference

Tokenizing would make no difference to the stack size, as we simply store the stream position on the backtrack stack and then seek to it. The number of choice points and hence the backtrack stack size is the same tokenized or non-tokenized. The stream-buf buffer size is fixed, so there is no difference there, and its up to the operating system how it uses its page cache to satisfy seek requests. Userspace buffering seems to make very little performance difference.

The problem has little to do with the ambiguity of the grammar. Take a CSV file where each number is separated by a comma or a newline. before parsing each number we have to create a backtrack checkpoint in case something fails later, so with a 1000 by 1000 CSV you get 1 million checkpoints on the backtrack stack. So its each choice point (in this case comma or return) that causes the problem, you really need a cut operator after such choices to indicate that later failures cannot undo earlier choices. What I really need is a backtrack stack of deferred parsers, rather than stream positions. That way you can perform a cut by just popping the top parser off the stack...

The problem with tokenisers is sequences like "name123" and "name 123". You have to have a global definition of what a name looks like, and then there are literal strings etc...

Too bored to answer

Tokenizing would make no difference to the stack size, as we simply store the stream position on the backtrack stack and then seek to it.

Man. Too bored to answer and read your code. The above statement can't be true unless you're reinventing a form of bottom-up parsing. Maybe you are, no idea.

Does anyone still use tokenizers?

I didn't think anyone still used tokenizers due to the problems with string literals etc. You end up with a horrible mess trying to feed information back to the tokenizer from the parser...

You're a bit peculiar?

Uhm. Everybody uses tokenizers? And all attempts to do without have failed? Unless you know more than I do, of course.

Holy grail of LL(1)

As you note it also lists the disadvantages of LL(1). Ever seen a TeX error? Though I guess there were good reasons to design TeX the manner in which it was designed and nobody has done a better job so far.

Knuth is one of the few computer scientists I take seriously.

As I stated, backtracking LL(k) is a 95% solution for PLs. The there mentioned author is interested in 100% solutions. Unfortunately, backtracking higher-order attribute grammar parsers probably go beyond what is solvable in his system. But, yeah, research.

Anecdotal Evidence

I have never used a tokenizer outside of academic requirements. All the real world parsers I have implemented have been scannerless. I have a lot of respect for Knuth, and am proud to be an account holder at the bank of San Serriffe, although I could do with saving a bit more :-)

In any case, maybe I could arrange it so you could use an optional tokenizer, I just wanted to avoid another template parameter, or the cost of virtual functions.

I am now seriously considering building an AST of the grammar from the combinators, allowing grammar composition, followed by a tree rewrite into a character at a time parser with no backtracking. What do you think?

Don't think a lot

Well, if you can pull it off with your language you should. I am not stopping you.

You might try a grammar work bench

An idea I have. I actually used a grammar work bench a long time ago to check whether a certain grammar met certain conditions. Maybe that's an idea? You'll be able to test whether your language meets LL(1) reasonably fast. Not sure they're still in existence though.

Hah. I found your bank account!

I am green with envy.

-

Double post

LR Parser

I'm not happy with the backtracking implementation, and I really like the Polish parsers. I am sure these can be implemented in C++ with object-oriented approach, and lots of runtime polymorphism, and I might have a go, but this would best be done from a clean sheet. So I will branch the simple backtracking and revert the head to the LL1 parser which although limited, I am happy with. Complex parsing can still be done by building an LR shift/reduce parser using the user functors.

My concern with the above is that the grammar is static, so in theory you can statically transform the combinator description into an LR parser, after all that's what tools like yacc and bison do. Neither Haskell nor C++ would seem to be able to do this (The polish parsers paper relies on polymorphic recursion). This is where my language comes in, defining the transform from combinators to shift/reduce in the logic meta-language and the components of the LR parser in the language itself. Of course this is something I can use to implement its own parser later in the bootstrapping, so I may not want to spend too long on the C++ parser.

Backtracking

I have realised I was wrong about the backtracking memory usage. The backtracking stack only grows in the recursive case, and is limited to the choices within one expression. Having started to update a few projects to use the new combinators its clear the backtracking version is easier to use, so I have promoted it back to the master branch in the GitHub repository...

Lazy Tokenisaition

I have been thinking about the comments regarding tokenisers, and have realised there is a mapping between a certain kind of recogniser and a tokeniser. If all recognisers have the format:

tokenise(recogniser) = discard(recognise_space) && recogniser

When you build a parser out of "tokenised" recognisers you get lazy tokenisation, so you don't need a separate tokenisation pass.

In other words you can think of a functor that maps tokenisation as a function on characters, to a function on recognisers. In C++ this functor would be:

template <typename R> auto tokenise (R const& r) -> decltype(discard(many(accept(is_space))) && r) {
    return discard(many(accept(is_space))) && r;
}

Improved Error Reporting

The parsers now have automatically generated EBNF names, which can be overridden with user-defined names, along with the ability to mark parsers as 'strict' which promotes failure to an exception. For strict parsers and for alternatives in an 'or' the beginning and end of the characters consumed by a failed parser are recorded and used in the error report. This results in error reports that look like this:

error parsing let: line 1, column 4.
let x = a b £ c in let y = a b c in y 
   ^--------^
<name - "in" - "let", '=', {expression}-, "in", {expression}->

Improved language design for better parsing diagnostics

Improved language design can result in much better parsing diagnostics.

Please see the following:
Language Design Issues

Regards,
Carl

C++ Parser Combinators Performance Improvement

The parsers are now refactored to take an iterator and a range, this allowed some optimisations that improved performance by 25% to 40% (the more backtracking the bigger the improvement). By substituting the iterators based on the C++ streambuf for ones based on an mmap-ed "file_vector" a further 2x performance improvement is possible, and this is now included as a working example for comparison:

https://github.com/keean/Parser-Combinators/blob/master/example_expression.cpp

C++ Parser Combinators and Attribute Grammars

I have added a threaded state to the parser combinators which is similar to the inherited attributes in attribute grammars, where the original parser result is similar to synthesised attributes. I have added an example parser for Prolog like logic languages which includes the construction of an abstract-syntax representation as the output of the parser. See:

https://github.com/keean/Parser-Combinators/blob/master/prolog.cpp

Memoization?

Are you memoizing results in case a backtrack leads to retrying some of the same parse?

In BNF

 S ::= A B C 
     | A B D

In some code for a parser combinator library I haven't posted, I had a database of parse results indexed by the pair (character-stream-offset, parser-id). It was highly influenced by a parser combinator system I found somewhere written in JavaScript. I did not evaluate this for performance.

Vs. PEGTL?

Just stumbled over this post and was wondering how it compares to the PEGTL (Parsing Expression Grammar Template Library) (disclaimer: of which I am an author). Both libraries seem to solve the same problem, but also seem to have a very different look-and-feel.

For example the PEGTL has many more combinators and rules, which your library doesn't seem to require, and I haven't quite grokked yet whether it's because of the different design, or due to less included-but-technically-redundant convenience combinators...

Grammar validation

One of my thoughts/complaints about parser combinators is that they facilitate unknowingly ambiguous languages. This is also an objection to PEG grammars: conflicts are suppressed by strong rule ordering, rather than checked to ensure that no errors are present. Sometimes that's the right answer or just the best practically feasible answer, but for new language designs it seems to me that ambiguities are not something we want to accept. Certainly not unknowingly.

As I've considered a combinator library for BitC, one of the things I'd like to be able to do is flip a mode switch somewhere in which the same source-language program using combinators, instead of implementing a parser, produces a suitable data structure to check for shift/reduce errors and the like according to some specified method, whether LL(1), LL(K), or what have you.

Is such an alternative interpretation of the combinators possible in your approach with a suitably designed alternative header?

Forcing no backtracking

In the current version of these parser combinators, you can force no-backtracking by deleting the copy constructor on the inherited-attribute (threaded state). This works because backtracking has to undo side effects of failed parsers on this state, and this required copies of the state to be made. For an example see the sample Prolog parser:
https://github.com/keean/Parser-Combinators/blob/master/prolog.cpp

Prolog parsers

What makes prolog's parser interesting is that one can define new operators INLINE in code being parsed, specifying left, right or none associative infix operators, prefix operators and postfix operators with arbitrary precedence.

Prolog also has a built in parsing technology, definite clause grammars, which are more powerful than parsers generally in use but is surprisingly useless for implementing the preceding inline operator definition feature.

I don't know if Parser-Combinators have either power, I doubt it since DCG can even be somewhat context sensitive.

Does your prolog parser implement the custom operator feature?

Can't respond for OP, but if

Can't respond for OP, but if I understand the question correctly I'd say that it should be possible in the sense that the PEG/parser combinator approach does not prevent it...

For the PEGTL we have examples with the operators, their precedence, and associativities, being part of the structure of the grammar (Lua 5.3 example), and with a run-time data structure that can handle arbitrary many binary operators of arbitrary precedence and associativity and could be changed at any time (calculator example) including from a semantic action in the middle of parsing something.

This is probably not textbook PEG, but in practice it's easy to use additional data structures, see also how we handle the Lua long/raw string literals (which start with [=[ and end with ]=] where the number of '=' is arbitrary but must match).

I think you'll find that doesn't work

if you analyse it with low precedence operators, you'll find that each operator has to go into the grammar in an arbitrary number of points depending on the other operators.

Using any kind of grammar that doesn't explicitly have operators has that problem.

Right Associative Operator Support in Prolog Example Parser

I have just added right associative operator support for the example prolog parser (left will follow once I have decided how to declare the difference). You can see the updated source for the example parser at: https://github.com/keean/Parser-Combinators/blob/master/prolog.cpp

The combinators can be named, and can generate an EBNF like printout, that is used in the error reports, which look like this (if you don't catch any exceptions from the combinators):

example.cl
terminate called after throwing an instance of 'parse_error'
  what():  failed parser consumed input at line: 46 column: 84
    expr(Fun, C1, arrow(A, B)), a(X) -> b(X) -> c(X), a(X -> Y -> Z), X -> Y -> Z, Z     expr(Arg, C2, A),
                                                                                   ^----^
expecting: variable, operator, op-list
where:
	atom = lowercase, {alphanumeric | '_'};
	op-list = term, [operator, op-list];
	operator = {punctuation - ('_' | '(' | ')' | ',')}- - "." - ":-";
	struct = atom, ['(', op-list, {',', op-list}, ')'];
	term = variable | struct;
	variable = (uppercase | '_'), {alphanumeric | '_'};

Aborted

I think that's pretty readable, and its auto generated from the combinator declarations. I am interested in any suggestions for improving.

I also think the combinators currently require quite a lot of boilerplate user-functors when combining parsers with different result types, and I am interested in ways of reducing this.

But what about validation?

Disabling the backtracking is useful to be able to do, but what about extracting things like LR items and first/follow sets so that I can run a check for parse ambiguities? Not during a production run, but perhaps during a distinguished test or validation compile.

Ambiguities

I guess I am not seeing where the ambiguities come from. If you allow backtracking the grammar can be ambiguous, but without it I don't see how you can build an ambiguous parser. To me the parser would be correct by construction and not need a validation pass, but maybe I am not seeing how the ambiguity can get in?

Lookahead == Backtracking

I haven't looked at your implementation, so perhaps you have done something clever that I don't know about. Many combinator packages implement recursive descent naively, relying on backtracking to deal with things that (in a traditional static parser) would be disambiguated by lookahead.

What I think I'm saying is that it's very easy to write grammars in any system that would have shift/shift or shift/reduce conflicts without realizing that one has done so. This can easily result in matching a language other than the one intended, and that is still true if the recursive descent mechanism is non-backtracking. The question is how to have confidence that what I wrote is what I actually intended.

So: I'd really like a way to take a combinator-written parser, extract the kinds of tables used by a conventional statically compiled parser, and run the algorithms necessary to determine that such conflicts and potential ambiguities do not exist.

Merely defining them away by enforcing determinism certainly eliminates the conflicts, but it doesn't address whether I'm matching what I intend to match.

Lookahead is finite, backtracking isn't

Also it's easy for a backtracking parser to include some context sensitive elements or do semantic tests and processing while it parses, doing those with lookahead would require new parsing constructs and is probably less powerful.

for instance a couple examples of the known context sensitive grammar
anbncn in prolog

s --> a(N), b(N), c(N).
a(0) --> [].
a(M) --> [a], a(N), {M is N + 1}.
b(0) --> [].
b(M) --> [b], b(N), {M is N + 1}.
c(0) --> [].
c(M) --> [c], c(N), {M is N + 1}.

or

s --> symbols(Sem,a), symbols(Sem,b), symbols(Sem,c).
symbols(end,_) --> [].
symbols(s(Sem),S) --> [S], symbols(Sem,S).

Generating Tables.

Parser combinators like these allow no lookahead or backtracking normally, and all combinators are ordered (which parser is tried first is clear from the syntax). In this case it is an error for a parser that has consumed input to fail. So a combinator (A || B), will throw an exception if A fails after consuming input. So grammars have to be LL(1) and have no ambiguity.

You have to intentionally introduce ambiguity by using the 'attempt' combinator (attempt(A) || B), which saves the parser state before trying A and restores it (backtracking both the input position and the state of inherited attributes) before trying B in the case A fails.

Parsers are tried left to right, so there is really no ambiguity in the grammar because we know A will be tried first, then B (hence the choice to use the short-circuit 'and' and 'or' operators for the combinators in C++).

As there is no ambiguity, would generating parser tables still be useful?

Talking past each other

shap's point is not that the described grammar may be ambiguous (it isn't by property of PEG parsers), but that the *intended* grammar may be ambiguous, and that creating a path of least resistance where users will never be told about the ambiguity in their *intent* (because the translation into the tool is lossy by default) is harmful in some cases.

Intended Abiguity

If I want ambiguity I need to explicitly indicate it. Therefore if your intent is ambiguous you will have explicit "attempt" parsers. So you are told about it. What am I missing?

The point

This point is made with much details in this blog post.

Another strategy is to abandon context-free grammars completely and use a formalism like parsing expression grammars that is unambiguous by definition. Parsing expression grammars avoid ambiguity by forcing all grammar rules to be defined in terms of prioritized choice, so in cases where multiple grammar rules match the input the first one is correct by definition. [...]

Prioritized choice is a great tool for resolving some ambiguities; it works perfectly for the solving the dangling else problem. But while this has given us a tool for resolving ambiguity, it hasn't solved the problem of finding ambiguities. Every rule in PEGs is required to be defined in terms of prioritized choice, which means that every PEG rule could be hiding a "conceptual" ambiguity [...]

// Is this PEG rule equivalent to a <- c / b ?
a lt;- b / c
 
// We can't know (it's undecidable in general),
// so every rule could be hiding an ambiguity we don't know about.

I call this a "conceptual" ambiguity because even though a PEG-based tool does not consider this ambiguous, it still looks ambiguous to a user. Another way of thinking about this is that you have resolved the ambiguity without ever being aware the ambiguity existed, thus denying you the opportunity to think about it and make a conscious choice about how it should be resolved. Prioritized choice doesn't make the dangling else problem go away, it just hides it. Users still see a language construct that could plausibly be interpreted in two different ways, and users still need to be informed which option the parser will choose. [...]

And unlike GLR, Packrat Parsing (the linear-time algorithm for parsing PEGs) doesn't tell you even at parse time if the input string is ambiguous. So with a Packrat-Parsing-based strategy, you are really flying blind about whether there are "conceptual" ambiguities in your grammar. It's also possible for entire alternatives or rules of a PEG to be unreachable (see here for a bit more discussion of this). The net result is that with PEGs you know very little about the properties of your grammar.

So far none of the options we've discussed can actually help us find ambiguities up-front. Surely there must be a way of analyzing our grammar ahead of time and proving that it's not ambiguous, as long as the grammar isn't too crazy? Indeed there is, but the answer brings us back to where we started: LL and LR parsers.

Prioritised Choice

I think I understand. You are saying that even though there is no ambiguity because of prioritised choice, it is useful to see where there would have been ambiguity if we ignore the prioritised choice?

I am not sure why this would be the case. For example consider expressions containing only '1' and '+':

1 + 1 + 1

There are two ways to parse this ignoring associativity, which means there are two valid parse trees:

(1 + 1) + 1
1 + (1 + 1)

but if the parser only outputs one, even if it is arbitrary, why does it matter?

It would probably matter

It would probably matter more with an example where the two parse trees have different semantics. Change those pluses to minuses.

But I think the even more important example is where you have two unrelated expression forms that interact ambiguously. For modularity reasons, it's nice to be able to view various expression forms as unambiguous choice, rather than prioritized choice - with prioritized choice you have to think about all pairs of expressions and figure out which production should come first. It's much easier if you can just throw all of the productions at the compiler and have it warn you that this pair here might conflict. Bonus if it can also give you an example of an ambiguous parse. Then you can insert an explicit prioritization to resolve the ambiguity if that's the right solution.

You can't (easily) do that

My original intent with the prolog parser was to construct obvious parsers for things like expressions and then join them together. The problem is that backtracking gets out of hand rapidly, and we have to cope with backtracking changes to the already parsed abstract-syntax when it backtracks.

In practice it seemed easier to build the prolog parser mostly LL(1), with controlled backtracking only where necessary.

I am not sure if the idea of combining separate parsers actually works well, for example:

exp = var | exp, +, exp | exp, *, exp

it seems easer to do

exp = var, option((+ | *), exp)

Validating powerful grammars

What is the best way to validate powerful intuitive grammars, e.g.,
the one at the following location:
Intuitive Grammar

For some definition of "best".

Best is not a very good choice of word here. Do you mean fastest, or perhaps easiest for the programmer to compose and read?

Criteria for technology validating powerful grammars

Excellent question!

We are looking for intuitive ways to validate grammars that produce understandable diagnostics.

Is it good enough.

Parser combinators seem a good way to express the parser part, providing you have a good minimal set of primitives. The parsers are declarative and much easier to change than hand written parsers.

Errors can be reasonably good because you can track the choice points, and where things go wrong (with backtracking) and you can generate something approximating EBNF for what the parser was expecting.

Synthesised attributes seem easy as the parser should be able to report errors during sythesis. Inherited attributes seem harder and you might want to keep a reference to the parser that generated the attribute.

I think a compiler from a formal attribute grammer to code (or a language with built in support for attribute grammars and backtracking) would be more in line with your goals.

The goals here are maximum understandability and good error reporting without compromising performance, and embedded in the host language (C++). In my project this will eventually be replaced by a self hosted parser that should be a lot nicer, but a long way to go to get there.

Parser combinators are not good enough

Unfortunately, parser combinators are not good enough to make a competitive IDE for a sophisticated language like ActorScript.

Because the syntax for ActorScript is very tight, it is possible to have very good interactive diagnostics. However, it does not seem to be feasible to express the diagnostic process using parser combinators.

Interesting

Can you describe the problem in more detail? How does the diagnostic process work and why does that have anything to do with parser-combinators?

How to produce good diagnostics as a program is typed in an IDE

The problem is to produce good diagnostics as a program is typed into an IDE.

Diagnostics

I am not sure what you mean by diagnostics, but I don't see a problem with the parser combinator approach. The parser combinators can compose a data structure that represents the parsing, that you can then interpret how you want including giving interactive feedback.

My research topic! But on

My research topic!

But on the way, I found actors to be incredibly unsuited to the incremental computations to solve such problems.

-

Duplicate post.