Writing a Compiler Compiler...

Greetings Folks,

I've posted here in the past about a compiler framework I'm writing. I've also been *trying* to write a Compiler Compiler (with respect to the tokenizer/parser automation), the issue I've run into is the area of Named Captures. When I started this project, I didn't know a whole lot about parsers in general, but I've learned quite a lot in the process.

A little context might prove helpful:
The program uses a single file format to define rules:

ExternAliasDirective                       ::=
    "extern" "alias" Identifier:AliasName;   ;

and Tokens:

Identifier := 
    '@'? IdentifierOrKeyword;

UnicodeEscapeSequence :=
    "\\u" HexChar{4};

IdentifierOrKeyword :=
    IdentifierStartCharacter IdentifierPartCharacter*;

IdentifierStartCharacter :=
    LetterCharacter                         |
    '_'                                     |
    UnicodeEscapeSequence                   ;

IdentifierPartCharacter :=
    LetterCharacter                         |
    CombiningCharacter                      |
    DecimalDigitCharacter                   |
    ConnectingCharacter                     |
    CombiningCharacter                      |
    FormattingCharacter                     |
    UnicodeEscapeSequence                   ;

LetterCharacter :=
    [:Lu::Ll::Lt::Lm::Lo::Nl:];

CombiningCharacter :=
    [:Mn::Mc:];

DecimalDigitCharacter :=
    [:Nd:];

ConnectingCharacter :=
    [:Pc:];

FormattingCharacter :=
    [:Cf:];

Tokens allow for more flexibility in the actual patterns, such as character ranges (which can contain Unicode classes). Tokens aren't recursive so they can refer to other tokens, so long as the other tokens, within the tokens scope or the token itself, doesn't refer back to itself.

For brevity if you have a large number of tokens that you want to classify under a single name (like 'keywords') you can simply create an alternation between the terminals, giving each a name, and so far it's smart enough to detect these groups and create proper state machines for them (they're a little different in that they don't have repetitions, character ranges, or refer to other tokens.)

LinqKeywords                               :=
             "ascending":Ascending;         |
                    "by":By;                |
            "descending":Descending;        |
                  "from":From;              |
                "equals":Equals;            |
                  "into":Into;              |
                  "join":Join;              |
                   "let":Let;               |
                 "group":Group;             |
               "orderby":OrderBy;           |
                 "where":Where;             ;

If a language only refers to an alternation, like the one above, by the classification name (such as 'LinqKeywords') only one 'symbol' for the language is created; however, if one of the named elements is referenced, each of the items from the classification are given their own symbol. Differentiation between which was actually selected is still possible, but for the sake of the parser, the data is irrelevant. This is important for context awareness, since if you require 'as' to be present at a given point, but 'ascending' is another part of the possible symbols, LinqKeywords won't even enter the picture because it's not relevant.

I have enough information from the file format to make these relationships, and build the state machines for the lexical analysis phase. The issue is: the lexer has no captures, and thus, if I was even able to get the parser working, without a viable means to capture the information it's validating.

Does anyone here have any suggestions on where to go next? I'm using a deterministic state machine for the lexical analysis phase, it does the usual set comparisons necessary to differentiate which path to follow, the context associated to the symbols in play is available, but the issue is I don't know how to properly represent repetitions, and the like. Is my approach wrong, or is a non-deterministic approach more feasible?

For further context: I used the standard programming language means of defining a string to represent a simple terminal symbol because I'm self-taught, and when I started this project (a ... while ago,) it seemed the most 'natural'. Case sensitivity is easily represented by using '@' before the string to denote case is unimportant. For a state machine this representation is trivial.

Edit: I guess information about goals is helpful.
This project aims to do the following:
Generate a lexer which properly creates tokens of the kinds provided. Create an object model to represent the structure of rules, and generate a LL(*) parser which can construct the object model in the way it was defined. I've handled left recursion on my own successfully, so the main issue is figuring a way to automate it all. A friend of mine suggested writing a parser from scratch and then build a mental map of how I could do it through automation. I thought this is a good idea, but the language I'm working on has 181 token symbols and 277 rules. Suggestions on starting smaller (such as what language, what test would be helpful to this end)?

Edit 2: As far as automation goes, I'm not unfamiliar with the process, I've written a program to generate a ECMA-335 compliant metadata parser (minus Blob signatures, that had to be done by hand) because the format was straight forward, and the variable sizes of the fields within the tables was handled through a simple state machine created by analyzing the references within the tables and assigning each potentially variable-length reference type a significant bit in the state machine's internal state.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Little unclear

It seems to be a little unclear what you are trying do - I'm fairly sure that your question is in this sentence that I cannot parse:

The issue is: the lexer has no captures, and thus, if I was even able to get the parser working, without a viable means to capture the information it's validating.

When you say that you want an object model to represent the structure of your rules - is this a tree? The example grammar looks like a variant of EBNF, so it's not clear what you mean by how to handle repetition - your grammar already seems to encode it. If you mean how to represent it in the output then could you not simple have multiple children of a node with the same non-terminal type?

Given that you are interested in LL(*) parsers and automatically generating the output format as well as the parser - have you looked at the changes in v4 of ANTLR as the main motivation there is the same?

Re: Structure

The representation of the rules and tokens is in a simpler form than a tree. On parse, they're placed in (Rule/Token)ExpressionSeries for the whole token/rule and an Expression for each alternation of the rule/token within the ExpressionSeries. Each Expression represents the concatenation of the elements defined within the grammar. Tokens which reference other tokens are 'inlined' into the parent token once the references all check out.

Once the structure is determined in full, I construct a non-deterministic state machine to represent the token/rule (which is later reduced into a deterministic automation), passing the source information on each node as either Initial (for starting out), Intermediate for internal nodes of a given item, and Final for the edges of those entities. The major Issue I'm having is repetitions and how to handle ambiguities in the event that the edge state of a given source element continues further, potentially overlapping part of the next element in a repetition (ex of repetition: ([A-Z]+'>'[A-Z]+)+) Granted the total number of sub-states that would be needed would be one and the longest would win out. On more complex expressions there's the chance that there could be more overlap if the rank of the repetition is higher: (([A-Z]+'.'[A-Z]+) ('.' [A-Z]+)?)+. Granted a bit unrealistic, I'm not sure how I should handle such situations.

These considerations are only relevant to tokens which capture information about the lexical pass over the text. Transducers (not the formal variation), which are tokens that represent a series of symbols by name, under a common classification (such as keywords,) don't require such measures because there's only usually one valid result. There are outcomes where your grammars need both possible results, such as >> and > overlapping, these ambiguities need to be defined explicitly.

The third type is recognizers: these aren't concerned with capturing anything other than the match itself, which means they can perform a few more state optimizations on the right-hand side since how you reached your goal is less important, so long as you don't change the function of the automation.

An example would be: "state" | "ablate", once the 'st' in state is observed, or the 'abl' in ablate is observed, the automation unifies, because the only possible match at this point is 'ate'. Doing this doesn't change the function of the automation, but it does obscure which path you followed to get there. Not ideal if you're trying to capture which keyword you're matching (in the form of a symbol index, versus the whole keyword's characters), but useful for reducing complex automations further when you're only concerned with what characters were within its match.

I think part of the reason I'm having such difficulty is the scope of what I want to do. Here's the following description of a Number in the language I'm writing:

Number := 
    "0x" HexChar:WholePart;+  
    (
        @'U':Unsigned; @'L':Long;?          | 
        @'L':Long; @'U':Unsigned;?
    ):ValueType;? |
    DecimalNumber:WholePart;+ 
    (
        (
            @'U':Unsigned; @'L':Long;?      | 
            @'L':Long; @'U':Unsigned;?      | 
            @'D':Double;                    | 
            @'F':Single;                    | 
            @'M':Decimal;
        ):ValueType;                        | 
        '.' DecimalNumber:FloatingPart;+ 
        (
            @'e' 
            (
                '+':Positive;               | 
                '-':Negative;
            ):Sign;Default=Positive;? 
            DecimalNumber:Value;+
        ):Exponent;? 
        (
            @'D':Double;                    | 
            @'F':Single;                    | 
            @'M':Decimal;
        ):ValueType;Default=Double;?
    )?;

HexChar                                    :=
    [0-9A-Fa-f]                             ;

DecimalNumber                              := 
    [0-9]+                                  ;

As you might notice, there are cases where the number matches constant sets, one goal is to extract their structure and enable the resulted object model to note whether it was marked with D, U, F, L, M, or the like. Extracting this information is fairly straight forward, but linking it to the automation is where it becomes difficult (mostly for complex object lifts, I think I'm close to working it out... but I figure someone here might have some insight.)

Example

Could you expand on your example a little bit. Say I try to parse "0xfeL", what would be the exact output: ValueType=Long, WholePart=HexChar=f, WholePart=HexChar=e perhaps?

I'm a little bit unsure what information you are trying to recover at the end of the match as the properties / attributes that you've tagged seem to overlap.

In general though: have you considered a stack of attribute/value pairs and annotations to some of the edges in your state-machine so that when the edge is followed the attribute/value is pushed onto the stack and at the end you pop everything off to recover the merged trace?

Re: Example

The end goal, if I can manage to pull it off, is to write a parser generator which will parse the text and give you an object which represents the text you gave it. Similar to regular expressions, except cleaner, with expression references, recursion for rules, and an object model which maps directly to your syntax outline.

In the case of 0xfeL (0xfel, 0xFeL, 0xFel, 0xfEL, 0xfEl 0xFEl, or 0xFEL, since the '@' before 'L' means 'case insensitive'), it would give you ValueType=Long, WholePart= 'fe' and 'WholePartType'=HexChar. Since there's no means to inject what to do when you match something, when determining the numeric value of the result you would have to use the context it defines about the parse to determine its value. In essence you'd be somewhat reparsing it, but the context of how to do it would be handed to you.

The idea is: if you're going down two different parse paths, the 'names' in play can overlap. Things like WholePartType would only be necessary if two values of the same name have differing types but the same underlying result value (both would capture a string.)

Calculating the value of a number in a language is trivial, especially when you know the base you're working in.

Ideally, if you gave it the string 65.25e-7m, you'd get a Number with a ValueType=Decimal, Exponent wouldn't be null, with a Sign=Negative, Value=7, WholePart=65, WholePartType=DecimalNumber, FloatingPart=25. That's the goal, at least.

Trivial?

Calculating the value of a number in a language is trivial, especially when you know the base you're working in.

Ideally, if you gave it the string 65.25e-7m, ...

What's your definition of "trivial"? For most of us mere mortals, floating-point textual representation I/O was a black art until 1990 when we were shown how to do it properly by Clinger (Input), and, Steele and White (Output). See "Proceedings of the 1990 ACM Conference on Programming Language Design and Implementation" for details.

Reference on 'Trivial?'

Can you point me to "Proceedings of the 1990 ACM Conference on Programming Language Design and Implementation?"

I think we're talking different things here. I merely meant that the result would be:

private static double CalcResult(string WholePart, string FloatingPart, ExponentSign ExponentSign, string ExponentValue)
{
    int state = 0;
    double result = 0;
getDoubleVal:
    string current = string.Empty;
    double curResult = 0;
    if (state == 0)
        current = WholePart;
    else if (state == 1)
        current = FloatingPart;
    else if (state == 2)
        current = ExponentValue;
    for (int i = 0, c = current.Length; i < c; i++)
        curResult += ((double)(current[(c - 1) - i] - '0')) * Math.Pow(10, state == 1 ? -(c - i) : i);
    switch (state)
    {
        case 0:
        case 1:
            state++;
            result += curResult;
            goto getDoubleVal;
        case 2:
            result *= Math.Pow(10, (ExponentSign == Program.ExponentSign.Positive ? 1 : -1) * curResult);
            break;
    }
    return result;
}

Sorry for the late reply, but I had to go to work. Did we mean different things?

See session 4 of

See session 4 of http://www.informatik.uni-trier.de/~ley/db/conf/pldi/pldi90.html

Re: Session 4

Thanks for the reference.

I haven't done much work with floating point arithmetic, increasing the precision beyond the target data type makes sense, since in the example I gave, each step is an approximation, it would likely yield a suboptimal result.

The main focus was that hexadecimal/decimal/standard base conversions are fairly straight forward when you know the source base and your target base (10, in this case.)

I don't know enough about IEEE standard to say whether it's trivial or not. I'm going to guess not due to the fact that there's a standard behind it...?

Edit: You were correct, increasing the precision of the type used for the calculation increased the precision of the result, though I'm sure there's a better way to do this, such as encoding the results directly in the format the floating point value is stored in. I'm thinking the best method would be to serialize the bits of the floating point value directly.

AST

It sounds like the piece you are missing is the abstract syntax tree (AST). Once you look to do anything besides just accepting the grammar, you'll want your parser to produce some kind of AST representation.

Getting a reasonable looking AST out of the grammar definition is not quite as simple as it looks. I would suggest working backwards from what kind of AST you want to work with, and then figure out what you need to do in your grammar to get that. For example, given the definition of Number in your language, one way to represent it in Java is something like:

class Number
{
}

class HexNumber : public Number
{
   HexNumber(String digits, ValueType valueType) {
   }
}

class DecimalNumber : public Number
{
   ...
}


class Expression
{
}

class AddExpression : public Expression
{
   AddExpression(Number a, Number b) {
   }
}

You may need to annotate rules in your grammar to get the output you want. For example, some production rules might be syntactic only, and you want to still represent those with just a string rather than a tree element.

The other suggestion I would have is to try and get your parser generator working with small trivial test languages. That way it actually is feasible to build it by hand. Any problem your large language is going to have will should also be exhibited with some small language.

Digraphs

Well after some messing around, I worked on creating digraphs of the rules and tokens of a given grammar.

I decided I needed a way to visualize what I was working with, since the compiler side of things isn't far enough to write out code yet. I looked up a program called GraphViz and added a command line argument to generate dot grammar files from the language.

The result is the following for number, and string. The actual .gv file is located here. I might note: I've added to number to allow for base 2, 12, and 27.

The main reason I did this is I needed a way, other than the text output I generate for state->state transitions as it is, to see what the resulted structure of a given automation was. I found a bug or two in the process. Insight more than welcome.

Original Question

Going back to your original question: you have productions in your grammar that are being used like subroutines and you were looking for a simple way to keep track of how to recombine the sub-results.

If you pick a test-case (say for a number) that hits such a sub-routine and work out the trace for the automaton - what sequence of edges do you walk across and how does it relate to the resulting value that you want to build? If you store those edges / values in a list (basically a stack that you will only push to) then how does reading them off in reverse relate to the value that you want to construct?

Re: Original Question

The basic idea is constructing the state machine like you usually would. You have an input character and each iteration of the machine determines the next state. In intermediate steps for a capture, you'd push the data for it to a buffer and once entering its final state you'd mark the subexpression as matched.

Using GraphViz has greatly helped my understanding of how the machines work. In constructing these digraphs, I've realized that certain automations can be reduced when you see an inefficiency. If, for example, you're matching a large sequence of 'modifiers' for a struct, that have a large set of valid ordered elements (say, 50+), if your concern is simply: capture what was found, and ensure it's in the proper order. Then how you get to the end state of a given set doesn't matter. So you can reduce that set much further. To help with this, I added a new modifier character on elements '$'.

Take StructDeclaration for example. It's defined like so:

StructDeclaration                         ::=
    Attributes:Attributes;?
    (
        TupleR1<Modifiers.New, "partial", "private">  |
        TupleR1<Modifiers.New, "partial", "public">   |
        TupleR1<"internal", Modifiers.New, "partial">
    ):Modifiers;?$
    "struct"
    Identifier:Name;
    ClassOrStructHeader<No>
    '{'
        StructMemberDeclaration:Members;*
    '}'                                     ;

Previously the automation for matching a struct was quite complex; however, after adding the modifier (and code to interpret the function of the modifier), it reduced quite a lot, but maintained the functional aspect.

As far as the production rules go: Presently I've constructed a set of information that denotes each path into and out of a given rule's states. So if you're on Expression, I can look and see what follows it by checking each path that entered the rule and what occurs in that production upon accepting the target rule.

My simple idea to accept left recursion is to analyze its structure: any valid left recursive rule will have a variation which is not left recursive. Write the resulted parser to check these paths first, and try the code again with the left recursion enabled. When you first parse it, you condense it onto the stack as a symbol so the look ahead stack retrieves that rule as what's 'next'.

If you find that after that symbol there's no valid entry, the parse has completed for that production. -- Suggestions on this, I currently use this in the conditions associated to #if directives Though it lacks finesse since the early version of the parser was written when my experience level was ... much less, so my options were limited when I wrote it in.

I also had an idea on constructing the production rules. Often when you're writing a parser, productions might share underlying structure, and some parse code, with conditions set to determine what's valid. Given I'm already doing path analysis, I wonder if I could allow for you to specify conditions for a given path's inclusion. An example is ProductionRules and Templates in OILexer, they share an underlying structure on the individual rules, the only difference is templates have a different set of directives that are allowed (for defining rules, adding productions to rules, throwing errors when the data presented to the template is bad, and so on.)