Syntax Desugaring Algorithm Question

Hello all!

I'm building a lisp in F# using FParsec. The language is called AML. I have a reader and an interpreter running, and I just now built a 'desugaring' function to rewrite AML code. For example, the expression a.b desugars to (access :m:b a). This is all well and good, but there is a problem. Once the desugarer outputs the transformed code text, it will be passed to the reader. As mentioned, the reader uses FParsec to create the AST. When FParsec encounters an unparsable character sequence, it propagates some nice error position information, a somewhat useful error message, and a description of the where the code text cannot be parsed. The issue is that this information will not be very meaningful to the user once the reader gets its input from the desugarer rather than the original user's code.

How do interpreters typically preserve this error information across transformed code so that the user can get error messages in terms of his original code?

Any practical or research material is nice. For run-time performance and implement-ability, I prefer a simple algorithm to a more sophisticated one. In truth, I am a language development newbie in many respects.

Thank you greatly!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

PLT

I know that PLT has done some work on tracking errors back to their source across macro transformers and the like.

I expect you'll need to include a lot more metadata in your processing, which might mean avoiding text->text translations and instead using a more structured model after parsing (i.e. ASTs).

Their PLDI paper about

Their PLDI paper about 'languages as libraries' advocates keeping a hashtable for each AST node. As compiler passes (macros) work on the AST, they update the hashtable. Each transformation can store whatever it wants, such as type information or raw source information (line numbers). This is very flexible and grew out of practice, but I also found it to be the least elegant part of the paper's discussion of language composition. I emphasize the practical point: evident from the paper's examples, it avoids a lot of the type hackery typical of Haskell DSL literature.

PLT have BTDT

Another relevant paper by the PLT folks: Debugging Macros.

I suppose that I could make

I suppose that I could make the desugarer produce an AST which carries positional data et al metadata, then have the reader consume that AST rather than text. That would be a lot of changes. May be necessary though if I can't just modify the reader to do it all.

BTW, thanks everyone for the reference information! It's helping me to conclude that it will be preferable to avoid a sophisticated approach, at least for now.

No erroneous output

You should make sure that your desugaring code never generates erroneous output. If there is an error in the input, it should be reported directly rather than being passed through to the next level. If the second stage sees bad syntax, that counts as a bug in the first stage.

Two prongs

I agree with John insofar as this is good hygiene for any compositional system. The problem of "original source location" persists past the parsing stage, however, once one considers runtime traceability. That is, semantic errors encountered when executing on a valid parse tree should also report information traceable to the original location.

Exposing information past a transformation raises more subtle problems when the transformation introduces significant alteration. Fortunately, when the action is a simple "desugaring", target-language errors are usually fairly evident in the source-language.

Semantic checks

It's very common to delay semantic checks until after some level of desugaring. For one thing other than tracking source location semantic checks are easier to do on a smaller language.

Maybe I should just punt here?

Looking at the feedback I'm getting, it seems I ought to discard the separate desugarer and make the existing reader able to read the same AST node in different ways. EG, it would read the Accessor node as either (access m:b a) or a.b. This will complect the reader code, but I don't yet see a reasonably easy solution to the problems caused by intermediate textual transformation.

FWIW

In the compilers/interpreters I've written, I leave open a position for term annotation. For example:

data Term a = Const a C | Var a V | App a (Term a) (Term a) | ...

The parser adds an annotation with source information (i.e.: it produces 'Term (Filename, (Lineno, Colno), (Lineno, Colno))' values). However, desugar phases (or any stage of compilation really) just blindly copy that annotation from the source terms from which they are derived.

Position Information

The way I did, and I have no idea whether it's the best approach, is to include position information in the AST. Your transformations, the desugaring, should then carry the position information correctly over to the simplified AST such that, whenever a semantical error occurs, the position in the original source can be reported.

In this manner, with desugaring, you lose the ability to present nice beautified AST nodes for error reporting (since the that will very likely not represent the original AST well), but you keep the ability to point at the correct place in the source where the error occurred.

Functions for Annotations

Rather than extending the AST with meta-data, a possibility I was pursuing a few years ago focuses on use of functions for annotations. An example function looks like:

anno :: a -> b -> b
anno _ x = x

That is the runtime semantics, but the annotation function allows us to carry an extra `a` value in the source tree, which can be processed in a dedicated manner.

I can go many ways with this - either create dedicated annotation functions for certain types of data (like code position) or use `anno` for everything but specialize data types based on the purpose of each annotation. I figured annotations would be used for other things, such as comments and optimization or parallelization hints.

With a few tweaks, annotations could also be modeled with monads or arrows. E.g. a monadic approach might update an abstract register with data about location in code, or treat this as generating a stream of meta-data while running the code. An arrows approach would have some advantage in supporting a more static analysis of metadata.

Monorefs not enough anyhow

When you try to track source refs across things like macros expansions, inlining or other such like things, you will often find yourself asking: which of the two source references should I keep?

The general answer to this question is: both, and of course the general result is at least a list. Clearly you must do this to explain a simple chain of inline function expansions, since the compiler cannot know which call site the user thinks an error is located at: invariably it will never be the very lowest level and the top level is nonsense since it the whole program.

I have some interesting ideas here. One is this: lay down a function with parameter replacement, and you get two sets of references which themselves are far apart. First, you replace all close references with a nice range, so now you have exactly two in the expression. Then you extract the two ranges, combine them so one can distinguish caller and callee, and relabel all the terms with the combined reference.

Another technique I am thinking to implement is to maintain a reference stack, this is the static analogue of the backtrace a debugger gives at run time. You add "begin function f" and "end function f" at the start and end of every function, and after inline expansion you have a way, given a point in the code, to look back to find how that piece of code got where it is.

Most C compilers already do exactly this to trace #include files, so the technique isn't new.

Whatever method you use for reference tracking, consider that you must report locations not only for compile time errors, but also run time ones. Its not much use seeing "match failure" without a reference indicating which match failed!

So actually I suggest two methods: some "in AST" references which merge low level parametric replacements with their contexts, and, a higher level representation such as the begin/end annotations, which requires some scanning to calculate. Whatever you do, tracking source reference is a lot of work, hard to get right, and will cost a substantial fraction of your compile times .. too bad: most compiles fail, so generating good code is far less important than reporting on bad stuff :)

Traceback Patches

I've written an article a while ago about patching Python tracebacks. It relies on two properties of the runtime: 1) tracebacks can be fully examined, 2) error reporting can be customized ( assigning a handler to sys.excepthook in the Python case ).

The algorithm maps line information of the transformation target code Q onto the original user written code P. The central idea goes like this.

Suppose T is the failure token in TQ which is the token-sequence of Q and V is the string value of that token. We want to determine where T comes from in TP. If T doesn't have a pre-image T' in TP because the value V doesn't exist in P we can also examine a neighbor of T.

Assume now that V exists in P. There may be n candidates for T' in TP. For each of those candidates we select a small token sequence in TP of length k ( e.g. k = 7 ) with T' in the center ( if possible ). We do the same with the error token T in TQ. In the next step we create strings from those token sequences beginning with the sequence associated with T, say ST. Those strings can be easily compared using the Levenshtein distance. The string ST' in P with the minimal edit distance to ST wins.

Actually it is a little more complicated, not only because there may still be many candidates for the optimal ST' but some of those candidates fit much better with another token X in P which has also the string value V. So one rather attempts to find an optimal mapping of sets of token in Q having value V onto sets of token in P with the same value. For more details I recommend taking a look at the article.

The algorithm is relatively robust with respect to elisions and loss of token information and the insertion of syntactical debris like delimiters. It is not language specific, doesn't hack symbol tables and stays apart from atrocities like retrofitting lexical information into AST's where it has been removed when AST's where created from parse trees.