LALR grammar of C++

Recently i had a look on Roskind's C++ grammar. It is used by Yacc to generate parser for C++. But in gcc and even in eclipse CDT's parser these people havent used parser generated by Yacc. I tried to find out why LALR grammar of C++ is not a popular choice for these tools. Somewhere I read that it is difficult to build the symbol table with LALR parsing of C++. Is it correct? why it is so?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

C++ parsing is *hard*

See this overview, which compares and contrasts several different attempts.

It all boils down to the fact that parsing C++ needs semantic information as early as the lexer. Thus attempts to organise a parser to be clean (and therefore have some confidence that it is correct) is always going to be severely hindered.

C++ Parser

If you need a C++ parser, consider the "elsa" component of Oink.

it's harder than you think

Rosskind's grammar is somewhat dated.
IIRC, C++ can require infinite lookahead.
GCC used a bison parser for years but switched to recursive descent for reliability, error handling, and standards compliance. There is a Phd thesis called Fog that discusses using LALR and then resolving the ambiguity in a seperate pass.

Fog thesis link

CDT moving to LALR

Actually the Eclipse CDT community is starting to experiment with using various parser generators to build new parsers for C and C++. So far we have looked at LPG (an LALR parser generator thats part of the IBM research SAFARI project) and ANTLR. The problem we are trying to solve is to allow new languages and new dialects of C/C++ to be added to CDT. A parser generator like LPG helps out a lot because you can take an existing grammar file, add the new rules for your language extension, and generate a whole new parser.

I myself have been using LPG to create a parser for C99 with extensions for Universal Parallel C. There have been two huge headaches so far...

1) The preprocessor. It seems that these parser generators assume you have a single lexer and a single parser. Getting a token-based implementation of the C preprocessor to live between the lexer and the parser in LPG has been significant work. I've had to write an entire preprocessor by hand because I really don't think its possible to generate a preprocessor using a parser generator. Maybe someone here can shed some light on this.

2) Ambiguities in the grammar. The grammar for C99 isn't as complicated as C++ but it still suffers from ambiguities. Traditionally these can be resolved by feeding type information back to the lexer, the lexer can then generate either an 'identifier' or a 'typedefname' token based on this information. Unfortunately we can't use this trick because in CDT land we have a policy to not do any collection of semantic information during the parse. Thats right, we're parsing C/C++ with no symbol table. It improves performance because you don't need to parse #included files (macros can be retrieved from the index), but it inevitably leads to inaccuracies in the parse. So far we have been able to live with this (CDT isn't a compiler after all).

If our work with LPG is successful then we will be looking at using it for C++ going forward. I have to admit I'm a bit apprehensive about this because it is a difficult problem. If anyone has any pointers I would appreciate it.

C++ parsing is always going to involve evaluation

Before the moaning:

  1. Have you looked at Wave for a preprocessor and lexer?
  2. I use CDT whenever possible, and it helps immensely with nontemplate code, but...

The biggest deficiency with CDT and Visual C++ is the lack of aid for the hardest part of getting modern C++ correct: templates, or at least, the use of them. No one gets by without using the STL and/or Boost. However these libraries are, by design, template heavy. The core idea is to push evaluation into compile time so that the runtime is as efficient, if not more so, than a hand-written C version. This unfortunately means that tools must actually understand the C++, at least as far as parsing it goes. Perhaps a push for there to be marked up (standard documentation format) so that some automated typedef-filling may be available for at least the STL?

C preprocessor & C parser together

Many years ago I used an LALR(1) parser generator to create a "front-end" that did both preprocessing and parsing of C (simultaneously). It was a two parser system and worked fine, although it was just an experiment. I'm quite sure my latest LR(*) parser generator could handle this situation. I would be interested in helping with this project. See LRSTAR.ORG for the download.