[ANN] YARD 1.0: C++ Template Metaprogramming Parsing Framework

Three years after the initial announcement here at LtU of the YARD parsing framework, I've finally posted version 1.0 of YARD. I figure despite being self-promotional (I figure it is a forum quality, if not quite meritorious of front-page status), this post should still be relevant to LtU readers because of its overlap with other LtU topics:

Of course many readers of LtU are language implementers and may find a new parsing framework useful or inspirational.

Below is the official announcement of YARD:

Version 1.0 of the YARD parsing framework for C++ is now available for download at http://sourceforge.net/project/showfiles.php?group_id=126822. The YARD project now also has a new tutorial, written with the help of Max Lybbert, that provides an introduction to language parsing and parsing expression grammars http://yard-parser.sourceforge.net/cgi-bin/index.cgi?dest=Documents&doc=tutorial.

The YARD framework uses a novel template meta-programming technique to construct efficient recursive-descent parsers at compile-time. YARD parsers are constructed as parsing expression grammars expressed using templates in a form resembling EBNF. Parsers constructed using YARD combine lexing and parsing phases, and can automatically generate abstract syntax trees, without requiring a separate code-generation phase.

The YARD framework has been under development for three years and has spawned other related projects (e.g. the Biscuit parsing library, http://p-stade.sourceforge.net/biscuit/index.html ). YARD has been used in commercial tools (e.g. http://www.tmg.de/admin.local/lib/antenna/linux/ReadMe.txt) and various open-source projects (e.g. http://www.cat-language.com).

The YARD library is public domain ( http://creativecommons.org/licenses/publicdomain/ ) but for those who require a release with a specific open-source license, can request one on the discussion group ( http://sourceforge.net/forum/forum.php?forum_id=432769 ).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

YARD vs Spirit?

What's the advantage of YARD over Spirit? taking as an example the CAT language, it seems to me that expressing its grammar with YARD is not as elegant as with Spirit, although YARD may be a little faster due to its static nature.

YARD vs Spirit

I've posted a brief biased comparsion between the two at http://cdiggins.com/2007/07/10/c-parsing-frameworks-yard-vs-spirit/. I don't want to go deep into a YARD vs Spirit comparison, but LtU readers may be interested to know that both YARD and Spirit use expression templates to represent parsers. In YARD these expression templates are directly defined by the programmer, whereas in Spirit the expression templates are generated implicitly through operator overloading.

The other major difference is that Spirit separates scanning and parsing phases whereas Cat uses the PEG approach of combining scanning with parsing.

YARD vs. Spirit

Your comment on Spirit is wrong. There is no special scanning process in Spirit, you can use Spirit generated parsers with a separately built scanner (lexer), though.

Correction

I'm sorry about the mistake. This wasn't very clear from the documentation at
http://spirit.sourceforge.net/distrib/spirit_1_8_3/libs/spirit/doc/scanner.html.

However to support what you say, I did find the following quote:

Typical parsers regard the processing of characters (symbols that form words or lexemes) and phrases (words that form sentences) as separate domains. Entities such as reserved words, operators, literal strings, numerical constants, etc., which constitute the terminals of a grammar are usually extracted first in a separate lexical analysis stage.

At this point, as evident in the examples we have so far, it is important to note that, contrary to standard practice, the Spirit framework handles parsing tasks at both the character level as well as the phrase level. One may consider that a lexical analyzer is seamlessly integrated in the Spirit framework.

Although the Spirit parser library does not need a separate lexical analyzer, there is no reason why we cannot have one. One can always have as many parser layers as needed. In theory, one may create a preprocessor, a lexical analyzer and a parser proper, all using the same framework

However, in the documentation there is much reference to a distinction between "character level parsing" and "phrase_level" parsing which indicated to me a distinction between tokenization and parsing. Furthermore the notion of "skip parser" is precisely a tokenizer. I am unfortunately unclear at this point whether the full range of parsing operators in Spirit apply equally to both "character level parsing" and "phrase level parsing".

B =>* epsilon

In your YARD documentation you write:

Zero-width rules are a feature of parsing expression grammars that are absent in context free grammars, and can be used to optimize a parser.

If I understand the use of your zero-width rules correctly, these correspond to CFG rules that derive epsilon (the empty string). It seems that where you use such rules to trim the search space, this corresponds to computing lookahead sets (SLR, LALR, etc) and reducing non-terminals (without consuming input) only for those inputs.

My understanding of PEGs is not that they subsume CFGs, but rather that they cover the subset of CFGs that can be parsed by the recursive descent method (and take directives to minimize backtracking).

Thanks

Thanks Kalani. I just reviewed http://pdos.csail.mit.edu/papers/packrat-parsing:ford-ms.pdf and I see that a PEG expresses a TDPL which is not a superset of CFG. Sorry about this mistake.

However with regards to zero-width rules, in a parsing expression grammar these can be of arbitrary length and sophistication as opposed to single or even N token look-ahead.

From Brian's paper:

Another important part of this expressiveness is provided by the `&' and `!' operators demonstrated above. These operators represent syntactic predicates [16], which allow arbitrary syntactic patterns to be used to direct decisions without actually consuming any
input text.

I prefer my grammar to be readable like EBNF.

The advantage of Spirit is that the grammar is more readable than YARD's, and I think this advantage far outweighs any advantage of YARD over Spirit. In other words, it's more important to be able to read the parser as in EBNF than being a little faster or having no big dependencies, especially in the context of grammars being changed frequently.

What I don't like in both approaches is the way semantic actions are handled. I would prefer actions to be declared separately from the grammar.

Not so different

The readability is not so different:

Spirit Example (from the docs):

    struct calculator : public grammar<calculator>
    {
        template <typename ScannerT>
        struct definition
        {
            definition(calculator const& self)
            {
                group       = '(' >> expression >> ')';
                factor      = integer | group;
                term        = factor >> *(('*' >> factor) | ('/' >> factor));
                expression  = term >> *(('+' >> term) | ('-' >> term));
            }

            rule<ScannerT> expression, term, factor, group;

            rule<ScannerT> const&
            start() const { return expression; }
        };
    };

YARD:

    namespace calculator
    {
       struct Expression; 

       struct Group :
          Seq<Token<'('>, Expression, Token<')'> > { };

        struct Factor :
          Or<Integer, Group> { };

        struct Term :
          Seq<Factor, Star<
            Or<
              Seq<Token<'*'>, Factor>,
              Seq<Token<'/'>, Factor>
            > { };

        struct Expression :
          Seq<Term, Star<
            Or<
              Seq<Token<'+'>, Term>,
              Seq<Token<'-'>, Term>
            > { };
    }

Notice YARD doesn't have a scanner, the scanner is integrated into the grammar. This is another big difference between the two.

As for the speed difference, it is more significant than you give YARD credit for.

Here is another take on it.

Your example, simplified:

rule<> group = '(' >> expression >> ')';
rule<> factor = integer | group;
rule<> term = factor >> *(('*' >> factor) | ('/' >> factor));
rule<> expression  = term >> *(('+' >> term) | ('-' >> term));

Much more readable than YARD in my opinion.

As for the speed difference, it is more significant than you give YARD credit for.

You will be hard pressed to find significant differences in performance. Modern CPUs execute code through double dispatch (either by function pointer or table indexing) in the same speed as static calls, provided that the double dispatch target does not change much, which is the case in parsers.

Some other questions

Does YARD handle the left recursion problem?

How does YARD handle backtracking? Spirit throws an exception, methinks.

How does YARD handle the stack overflow problem? in recursive-descent parsers, it's quite easy to overflow the stack, especially in grammar with recursively nested definitions.

May I remind you an old discussion on these subjects:

http://lambda-the-ultimate.org/node/1599

I have done an implementation of the idea expressed above using two stacks: one for the 'continuation', i.e. what to do next, and one for the backtracking. When a rule fails to be matched, the parser picks a continuation from the backtracking stack.

Questions answered

The YARD-Spirit specific questions I've answered on my blog.

As for the discussion at http://lambda-the-ultimate.org/node/1599, I wasn't aware of it, thanks for bringing it up.

I have some issues with the basic premises of the original post:

1) the stack can easily overflow, especially when parsing deeply-nested rules.

This is not a problem I have ever encountered in a recursive-descent parser, and I use them exclusively for all of my parsing work, and you know how much that is :-)

2) some of the results may be thrown away, making parsing slow.

How slow depends on how many results are thrown away. Interestingly one solution is "memoization" used in packrat parsing. In general speed has never been an issue for me R-D parsers.

3) left-recursive grammars can not be handled.

This is not a limitation of R-D parsing, but of LL grammars.

4) backtracking has a cost.

Not a significant cost: just a few return instructions, and pointer ... sorry I meant input iterator ... assignments.

I have done an implementation of the idea expressed ...

Can you share it?

stack overflow AST nodes

cdiggins: This is not a problem I have ever encountered in a recursive-descent parser, and I use them exclusively for all of my parsing work, and you know how much that is :-)

It's a good idea to check for stack overflow when a parser is deployed where a malicious bit of source code might take down a server using a recursive descent parser on client requests. This is especially true in servers using threads to parse requests. (Single threaded Unix apps might assume the stack can get aribitrarily large and spiral into death by page thrashing before death by overflow.)

For example, if you're using pthreads and the default stack size is 1MB and this seems too large for many threads you might want to have in pool, you might choose a smaller default size (like 256K I saw recently). Then suddenly it's not hard to construct a request that's not especially big that can overflow this stack in a parser.

What I've done in the past is reify stack overflows as an explicit AST node. (The node says "you overflowed the stack here blah blah blah" so this propagates to the server response.) A lot of error conditions can be turned into parse tree nodes to put off what should be done about them, so the parser needn't do it. [cf]

Very good point.

This is a very good point. Thanks for bringing it up.

This is not a problem I

This is not a problem I have ever encountered in a recursive-descent parser, and I use them exclusively for all of my parsing work, and you know how much that is :-)

It's very easy to overflow the stack. I had encountered the problem when I coded a Java-like grammar with many operator levels. Expressions like foo * bar(3, 2) + (new Item({1, 2, 3})).getSum()...) can easily create an overflow, especially if they are deeply nested (expressions in statements in methods of anonymous classes declared deeply inside expressions of methods of classes).

How slow depends on how many results are thrown away. Interestingly one solution is "memoization" used in packrat parsing. In general speed has never been an issue for me R-D parsers.

And for me, speed has never been an issue, since I did it only for testing. But it might be an issue in real world scenarios.

This is not a limitation of R-D parsing, but of LL grammars.

But it can be handled easily, especially if you use stacks.

Not a significant cost: just a few return instructions, and pointer ... sorry I meant input iterator ... assignments.

Very significant cost, especially if the error is in many levels deep. For example, if parsing fails in a complex expression in a statement in a method in a class, return from so many functions is incredibly significant.

Can you share it?

I only I had it...but if I have time, I will try to make something. It's not too difficult though to try it yourself. Parser classes should have a method where they accept a parser context, where the parser context has two stacks: a stack of continuations and a stack of backtrackings. Sequence parsers place the next element of the sequence in the continuations stack, where as choice parsers place the alternative path in the sequence of backtrackings, along with the parser status (i.e. an index of the current token at the point of backtracking).

The parser loop pops a parser from the continuation stack and invokes its 'parse' method. If that method returns failure, then the parse context is popped from the backtracking stack, and execution continues from there. Parsing succeeds when all tokens are consumed and the continuation stack is empty.

With a little imagination, you can handle left recursion as well.

My next attempt on parsers is to make a table-driven parser constructed lazily out of a Spirit-like grammar: each visitation in a parser node will create a table of tokens; next visitation will not ask the parser node to parse anything but it will do a table lookup. Each table entry will contain a token id and a pointer to the table that represents the next parser state. Gradually, as the input is parsed, a table-driven parser will be built up, allowing for merging of the same tokens at the same positions, resolving of the problem of left recursion and identification of shift-reduce and reduce-reduce errors.

EDIT:

after an hour of furious coding, here it is:

parser1.zip