archives

Writing a Compiler Compiler...

Greetings Folks,

I've posted here in the past about a compiler framework I'm writing. I've also been *trying* to write a Compiler Compiler (with respect to the tokenizer/parser automation), the issue I've run into is the area of Named Captures. When I started this project, I didn't know a whole lot about parsers in general, but I've learned quite a lot in the process.

A little context might prove helpful:
The program uses a single file format to define rules:

ExternAliasDirective                       ::=
    "extern" "alias" Identifier:AliasName;   ;

and Tokens:

Identifier := 
    '@'? IdentifierOrKeyword;

UnicodeEscapeSequence :=
    "\\u" HexChar{4};

IdentifierOrKeyword :=
    IdentifierStartCharacter IdentifierPartCharacter*;

IdentifierStartCharacter :=
    LetterCharacter                         |
    '_'                                     |
    UnicodeEscapeSequence                   ;

IdentifierPartCharacter :=
    LetterCharacter                         |
    CombiningCharacter                      |
    DecimalDigitCharacter                   |
    ConnectingCharacter                     |
    CombiningCharacter                      |
    FormattingCharacter                     |
    UnicodeEscapeSequence                   ;

LetterCharacter :=
    [:Lu::Ll::Lt::Lm::Lo::Nl:];

CombiningCharacter :=
    [:Mn::Mc:];

DecimalDigitCharacter :=
    [:Nd:];

ConnectingCharacter :=
    [:Pc:];

FormattingCharacter :=
    [:Cf:];

Tokens allow for more flexibility in the actual patterns, such as character ranges (which can contain Unicode classes). Tokens aren't recursive so they can refer to other tokens, so long as the other tokens, within the tokens scope or the token itself, doesn't refer back to itself.

For brevity if you have a large number of tokens that you want to classify under a single name (like 'keywords') you can simply create an alternation between the terminals, giving each a name, and so far it's smart enough to detect these groups and create proper state machines for them (they're a little different in that they don't have repetitions, character ranges, or refer to other tokens.)

LinqKeywords                               :=
             "ascending":Ascending;         |
                    "by":By;                |
            "descending":Descending;        |
                  "from":From;              |
                "equals":Equals;            |
                  "into":Into;              |
                  "join":Join;              |
                   "let":Let;               |
                 "group":Group;             |
               "orderby":OrderBy;           |
                 "where":Where;             ;

If a language only refers to an alternation, like the one above, by the classification name (such as 'LinqKeywords') only one 'symbol' for the language is created; however, if one of the named elements is referenced, each of the items from the classification are given their own symbol. Differentiation between which was actually selected is still possible, but for the sake of the parser, the data is irrelevant. This is important for context awareness, since if you require 'as' to be present at a given point, but 'ascending' is another part of the possible symbols, LinqKeywords won't even enter the picture because it's not relevant.

I have enough information from the file format to make these relationships, and build the state machines for the lexical analysis phase. The issue is: the lexer has no captures, and thus, if I was even able to get the parser working, without a viable means to capture the information it's validating.

Does anyone here have any suggestions on where to go next? I'm using a deterministic state machine for the lexical analysis phase, it does the usual set comparisons necessary to differentiate which path to follow, the context associated to the symbols in play is available, but the issue is I don't know how to properly represent repetitions, and the like. Is my approach wrong, or is a non-deterministic approach more feasible?

For further context: I used the standard programming language means of defining a string to represent a simple terminal symbol because I'm self-taught, and when I started this project (a ... while ago,) it seemed the most 'natural'. Case sensitivity is easily represented by using '@' before the string to denote case is unimportant. For a state machine this representation is trivial.

Edit: I guess information about goals is helpful.
This project aims to do the following:
Generate a lexer which properly creates tokens of the kinds provided. Create an object model to represent the structure of rules, and generate a LL(*) parser which can construct the object model in the way it was defined. I've handled left recursion on my own successfully, so the main issue is figuring a way to automate it all. A friend of mine suggested writing a parser from scratch and then build a mental map of how I could do it through automation. I thought this is a good idea, but the language I'm working on has 181 token symbols and 277 rules. Suggestions on starting smaller (such as what language, what test would be helpful to this end)?

Edit 2: As far as automation goes, I'm not unfamiliar with the process, I've written a program to generate a ECMA-335 compliant metadata parser (minus Blob signatures, that had to be done by hand) because the format was straight forward, and the variable sizes of the fields within the tables was handled through a simple state machine created by analyzing the references within the tables and assigning each potentially variable-length reference type a significant bit in the state machine's internal state.

IDE/UIs for multiple dispatch?

Subtext was (I think) going to give us a UI for dealing with the complexities of conditional logic. Whenever I read about more-than-single-dispatch, a spreadsheet UI comes to mind. Nice to know others have thought of it, and also complained about it. From Object-Oriented Multi-Methods [pdf] in Cecil, 1992:

Hebel and Johnson developed a special browser to manage the highly-stylized double dispatching code for arithmetic over numbers and matrices in Smalltalk-80 [Hebel & Johnson 90]. Their browser presents a two-dimensional spreadsheet-like view of all combinations of numeric and matrix argument types for a particular arithmetic message, with each entry reporting whether the corresponding argument type combination defines a method or merely inherits one from either the receiver inheritance chain or the argument inheritance chain. Programmers can manipulate implementations of arithmetic messages through the interface provided by the browser, and it will in turn generate many of the double dispatching functions automatically. While this browser makes double dispatching more manageable, it does not completely solve the problems with double dispatching. Users who examine numeric classes through the normal Smalltalk-80 browser still are confronted with a large number of (automatically-generated) dispatching routines. Additional browsers would be required for other kinds of messages (such as the displayOn message) to receive the same sorts of benefits. In effect, their browser partially simulates the functionality of multiple dispatching in the programming environment; we argue instead for uniform language support of multiple dispatching. Nevertheless, their interface might be useful even for a multiple-dispatching language such as Cecil to help display and organize multi-methods.

Is anybody these days working on some UI avalon panacea along these lines per chance?