User loginNavigation |
Writing a Compiler Compiler...Greetings Folks, I've posted here in the past about a compiler framework I'm writing. I've also been *trying* to write a Compiler Compiler (with respect to the tokenizer/parser automation), the issue I've run into is the area of Named Captures. When I started this project, I didn't know a whole lot about parsers in general, but I've learned quite a lot in the process. A little context might prove helpful: ExternAliasDirective ::= "extern" "alias" Identifier:AliasName; ; and Tokens: Identifier := '@'? IdentifierOrKeyword; UnicodeEscapeSequence := "\\u" HexChar{4}; IdentifierOrKeyword := IdentifierStartCharacter IdentifierPartCharacter*; IdentifierStartCharacter := LetterCharacter | '_' | UnicodeEscapeSequence ; IdentifierPartCharacter := LetterCharacter | CombiningCharacter | DecimalDigitCharacter | ConnectingCharacter | CombiningCharacter | FormattingCharacter | UnicodeEscapeSequence ; LetterCharacter := [:Lu::Ll::Lt::Lm::Lo::Nl:]; CombiningCharacter := [:Mn::Mc:]; DecimalDigitCharacter := [:Nd:]; ConnectingCharacter := [:Pc:]; FormattingCharacter := [:Cf:]; Tokens allow for more flexibility in the actual patterns, such as character ranges (which can contain Unicode classes). Tokens aren't recursive so they can refer to other tokens, so long as the other tokens, within the tokens scope or the token itself, doesn't refer back to itself. For brevity if you have a large number of tokens that you want to classify under a single name (like 'keywords') you can simply create an alternation between the terminals, giving each a name, and so far it's smart enough to detect these groups and create proper state machines for them (they're a little different in that they don't have repetitions, character ranges, or refer to other tokens.) LinqKeywords := "ascending":Ascending; | "by":By; | "descending":Descending; | "from":From; | "equals":Equals; | "into":Into; | "join":Join; | "let":Let; | "group":Group; | "orderby":OrderBy; | "where":Where; ; If a language only refers to an alternation, like the one above, by the classification name (such as 'LinqKeywords') only one 'symbol' for the language is created; however, if one of the named elements is referenced, each of the items from the classification are given their own symbol. Differentiation between which was actually selected is still possible, but for the sake of the parser, the data is irrelevant. This is important for context awareness, since if you require 'as' to be present at a given point, but 'ascending' is another part of the possible symbols, LinqKeywords won't even enter the picture because it's not relevant. I have enough information from the file format to make these relationships, and build the state machines for the lexical analysis phase. The issue is: the lexer has no captures, and thus, if I was even able to get the parser working, without a viable means to capture the information it's validating. Does anyone here have any suggestions on where to go next? I'm using a deterministic state machine for the lexical analysis phase, it does the usual set comparisons necessary to differentiate which path to follow, the context associated to the symbols in play is available, but the issue is I don't know how to properly represent repetitions, and the like. Is my approach wrong, or is a non-deterministic approach more feasible? For further context: I used the standard programming language means of defining a string to represent a simple terminal symbol because I'm self-taught, and when I started this project (a ... while ago,) it seemed the most 'natural'. Case sensitivity is easily represented by using '@' before the string to denote case is unimportant. For a state machine this representation is trivial. Edit: I guess information about goals is helpful. Edit 2: As far as automation goes, I'm not unfamiliar with the process, I've written a program to generate a ECMA-335 compliant metadata parser (minus Blob signatures, that had to be done by hand) because the format was straight forward, and the variable sizes of the fields within the tables was handled through a simple state machine created by analyzing the references within the tables and assigning each potentially variable-length reference type a significant bit in the state machine's internal state. By Alexander Morou at 2013-04-24 08:18 | LtU Forum | previous forum topic | next forum topic | other blogs | 5608 reads
|
Browse archives
Active forum topics |
Recent comments
27 weeks 1 day ago
27 weeks 2 days ago
27 weeks 2 days ago
49 weeks 3 days ago
1 year 1 week ago
1 year 3 weeks ago
1 year 3 weeks ago
1 year 5 weeks ago
1 year 10 weeks ago
1 year 10 weeks ago