What to do about comments?

This question has been bugging me for a while: whats the standard practice of handling comments during parsing? Ignore them or annotate the parsetree? My first thought is to simply let the lexer discard comments. But if the goal is to reproduce the source of the program via a pretty printer or some such, the comments need to come a long for the ride?

What leads me to this line of inquiry is: If I'm a client purchasing a parser should I expect my $20,000 C++ parser (for example, I've read C++ parsers are expensive) to preserve comments?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

If you are buying a C++

If you are buying a C++ parser, you are really just paying for the support needed to fix the inevitable problems that will come up when the parser can't parse code you care about (C++ is that evil). In this case, comments are secondary (though I'm sure if you pay enough support money, your C++ parser vendor can preserve them).

Depends on your compiler. At least javadoc-ish comments on methods/classes/fields should be preserved in some documentation modes. You could also save comments in a map and then in your pretty printer match them back up to the trees they are related to. Alternatively, you could have a comment field in your parse tree data structure. I guess there are lots of options.

I concur

If you haven't dealt with heavyweight industrial parsing before (of which C++ is the standard candle), reread the above comment. I'm going to embrace and extend the thoughts there, but you should get the main point in Sean's concision before moving on to my verbosity.

The underlying theme is that "industrial parsing is hard (expensive), so know what you need". If your output is destined solely for machines, you don't care about comment handling. If your output is for humans (such as javadoc or a translation), losing comments is intolerable. As mentioned, if you're paying then anything is possible, but overapproximation of feature needs is a dangerous financial approach. This answers your first question about standard practice: it's whatever is most effective getting the job at hand done without too much extra effort. Comment retention is rare in the literature and off-the-shelf machinery because those areas are dominated by parsers aimed for compilers.

Sean's other vastly important point is that when shopping for such technology, extensibility will be as important as any particular feature. Comment preservation will mean little if your source doesn't parse right in the first place. As mentioned, parsing is a seriously non-trivial problem for C++, especially in the context of the C preprocessor. Chances are you'll then be in the market for name resolution, and things get plenty more interesting from there on. To again wantonly abstract the premise, one has to think about fundamental limitations.* You can wall yourself off from practical extensibility by "baking in" systemic assumptions. Comment handling might be layered on to an existing parse system without undue pain, but you also have to consider, for example, whether the parser will scale to your codebase, whether it can handle multiple input/output languages simultaneously, etc. Some of these features are hard to come by if they weren't there from the get-go.

Per having a comment field in the parse data structure, this is exactly the solution I've used practice. Shap beat me to the punch on that, so I'll post details there.

* proper credit for these abstractions belongs to Ira Baxter, architect of DMS and my boss at Semantic Designs. We've apparently only been bitten by bad assumptions a couple times in fifteen years (10 million lines of code is big enough for anyone, right?)

Preprocessing. Just say no.

Comments are typically handled by the C preprocessor. And while comments are a little annoying to deal with, the preprocessor's other features cause the most trouble. C/C++ compilers normally run the preprocessor as a separate step -- the compiler proper only reads in fully preprocessed source code.

Handling the original source in a single step is difficult. You'll need an AST structure much more complicated than that of a compiler (and even then, I don't think it's possible to handle every corner case). You can have an open brace in one file, followed a #include that brings the close brace in from another file. #ifdefs can also cut across your AST in unstructured ways.

There's also the well-known C problem of the parser needing to know all the type names. So before you can finish parsing a file, you have to descend into all the files it #includes.

I have a tiny bit of experience with the Edison Design Group's commercial C++ parser. While their software seemed to me to be of really high quality, they're target market is compiler writers. Their AST doesn't preserve preprocessing directives (though they might have the option of preserving comments).

Try looking into open source software that needs to analyze code structure as well as re-display source code, like Doxygen or LXR.

how hard can it be, really?

CPP emits "#" directives to report filenames and line numbers properly. C compilers already deal with that and already decorate the AST with that information. They typically keep track of columns, as well.

To fully handle preserving comments and other pre-processor information, all that a typical parser needs to do is build an index that maps file/line/column positions to a number of datums: the extents of the previous, following, and enclosing comments (if any); the "include stack", and the "macro expansion stack". A "positition" as a compiler might emit it in an error message is then "file X, line Y, column Z, as expanded by the macro definition at file A, line B, column C, as expanded by [....], as included from [....]".

In other words, ASTs *already* contain (as node attributes) file / line / column information -- and that is completely sufficient to index a record of "what CPP did". It shouldn't "hair up" parsers or lexers, much at all.

A separate question is the idea of defining new languages in which comments are constrained as to their location and are explicitly attached to a specific AST node (e.g., a comment attached to a function definition or attached to the declaration of one of the parameters to that function).


Representing all information, not just a slice

Yes, there are ways to preserve line-number information for error reporting. But I was talking about an AST that represents all the information in the original source code (as opposed to just a single preprocessing of it).

For example:

int f
#ifdef TEST
  = 2;
  (int param) { return param * 2; }

What would the AST for this code look like?

comment status in new (academic) languages?

This brings some interesting questions about comments (in programming languages).

  • comments seem necessary; every language provide them
  • comments are often structured in an external way -w.r.t. the language specification- - for example, the doxygen (or javadoc) comment syntax.
  • most languages treat comment at the lexical stage; a comment is essentially a space or a token delimiter.
  • comments are designed to be skipped by the parser. It is usally not expected (and not easy) to keep the comments in the AST.

    IIRC some old versions of Lisp had a (comment "string") expression which meant the same as () ie nil but was a comment kept in the AST.
  • comments are embedded in the source code, not the other way. But see the Literate Programming idea about that.
  • still today, most programming languages view comment as unstructured. But it could be otherwise (e.g. a comment being an HTML or LaTeX block, with a LaTeX or XHTML compatible syntax)

I also think that comment status is related to the status of source files. Still today, every programming language (with probably an exception for old smalltalk or lisp machines) seems organized around the idea of a source file edited with an external editor. Few languages are designed with some other ideas of interaction with the programmer: e.g. languages with hypertextual source code, or languages with a web wiki like interface to their developer.


Source-to-source transformation

If I'm going to use a pre-existing parser as a front end for a source-to-source transformation, whether a pretty printer or a cross-language translator, then I'd probably want the comments to be preserved.

IIRC some old versions of

IIRC some old versions of Lisp had a (comment "string") expression which meant the same as () ie nil but was a comment kept in the AST.

Well, Emacs Lisp and Python have documentation strings, e.g.:

From the GNU Emacs Lisp Reference Manual:

  (defun capitalize-backwards ()
            "Upcase the last letter of a word."
            (backward-word 1)
            (forward-word 1)
            (backward-char 1)
            (capitalize-word 1))
               => capitalize-backwards

You can then access the documenation string at runtime by using (documentation 'function) in Emacs Lisp or print function.__doc__ in Python.

Comments, annotations and pragmas

I think that in-AST comments, such (comment "...") or Python docstrings, are the way to go. If you want to take comments into account when working on the program structure, they need to be part of the structure. Existing heuristics to attach free-floating comments to some nodes are fragile and we could design something more solid.

Comments are related to other construct that are not directly part of the core language : annotations and pragmas. Annotations have traditionnaly been attached to specific AST nodes, and we may handle comments as a special case of "meaningless" annotations.

I think there is a continuum between those features, which may or may not impact the execution, static semantics, or tooling of the program. Annotations that drive code generation (such as Python decorators) have strong semantics. Some compiler pragma may change the compilation optimizations, but should not change the observable semantics of the program. Other (@deprecated, @nowarn("foo")) should change the compiler behavior at compile time, and finally some are only useful for non-semantics tools such as documentation generators. In-code comments are the less structured of all.

Are those different instances of a generic design? Should they be handled uniformly? If not, where should the frontier stand?

did you consider clang?

The llvm project has been working pretty hard on their c/c++/objective c frontend called clang. It's designed to be used in both simple code gen, and hooking into a IDE and doing static analysis and refactoring. They've got a business friendly BSD license too. They haven't finished the c++ parser yet though, but it's getting pretty far along.

First-class comments

I'm working on a project where we maintain a Javascript translator. Among other things there are backends for pretty-printing and documentation generation. Both depend on handling comments. Currently, comments are added as special nodes "dangling" from a proper AST node (and guessing which node a comment actually belongs to can lead to funny results at times). I've been thinking about changing that and making comment nodes "first-class", adding them to the current parent node wherever they appear, basically to retain as much as possible of the sequencing in the original source code. While this is fine for comments between statements in a block, it gets trickier for comments in, say, the control expression of a 'for(;;)' loop. You have to make the parser ignore those comment nodes when checking phrase structure, and your usual 'first' and 'second' branches for binary operators won't do anymore.

Comments are non-syntactic

The core problem is that comments can appear anyplace that white space can appear. For this reason, they are generally deleted by the lexer. Retaining them isn't a problem per se, but deciding what AST to attach them to is a problem. In

x /* intervening comment  */ + y

should the comment attach to the "x", to the "+", or to the expression AST? What about when multiple comments appear? Is white space between them significant? Generalize this and you will soon conclude that the lexer basically needs to handle comments as quasi-syntactic elements in order to sensibly preserve them. If the only goal is to be able to spew them back out, fine, but the minute you start doing AST transforms the question will become: how should comments be handled while re-writing?

Things like documentation comments work because they are quasi-syntactic.

In the LISP family of languages, documentation comments have traditionally been handled with strings to avoid this whole mess.


The question would be how close can we get. In your example, the '+' operator would usually be parsed as the root node of the expression tree. The parser could incarnate this as something like

  child[0] = parse('x')
  child[1] = parse('/* intervening comment  */')
  child[2] = parse('y')

That would still be ambiguous, since you would lose the information whether the comment was before or after the '+' in infix. But you retain the information that the comment was between operators 'x' and 'y'. Further comments could come in between, others might be added before 'x' and after 'y', always trying to keep them as far down the parse tree as possible. The child list would retain their relative order. That appears quite neutral to me and might be close enough, even for rewriting purposes, because comments are kept 'near by' and further decisions are delayed as much as possible.

This is doable

In fact, if you install the CDT package for eclipse, and hand it input involving missing macro definitions, you will find that it produces a quasi-AST using very much this sort of idea.

My personal opinion is that it is enough to attach the comment to some plausible AST node (hopefully in a well-defined way) along with some information about the relative position between the comment and the AST node. Most of the reason to preserve comments is for the sake of transformation, and this would probably be enough. Perhaps more to the point, I'm not clear how to do better in principle.

Just as one possible example, every serious compiler defines a notion of "location" for every AST node. If one simply adds a linked list of comments to the AST node, each of which has a location, I think that might do fairly well.

This is indeed the

This is indeed the industrial solution that we* use; comments are attributes that decorate abstract syntax tree nodes. There are several other considerations to make in the design and implementation, though:

  • Does the comment attach to the token before or after? (we call these pre- and post-comments)
  • Where do you attach file-global (header, footer) comments?
  • Handling sequences of comments
  • Where do they go under tree compaction? (concrete to abstract tree, should be automatic)
  • Where do they go under tree manipulation? (transformation/rewrite, needs manual mangling)

We have answers to all these, though sometimes only as far as our customers' needs have pushed us.

* Semantic Designs

In some sense, the source code

is meta information for the compiler. Most compilers don't seem to carry around the comments so much as the carry around pointers to the source. The main emphasis being to point to the line of code which does not compile, or to be used in the debugging process (hard to debug with the raw assembly).

So I'd imagine that most compilers carry around the comments as part of the baggage of source meta information.

Retaining comments when refactoring code

Retaining comments when refactoring code collects comments separately from the AST (while keeping positions) and then reassociates them with the AST on demand.

A final pointer (just for

A final pointer (just for the records) Douglas Crockford's Pratt parser for Javascript keeps track of comments within the AST, as properties of nodes (and not nodes themselves).