Parser that allow syntax extensions

Hi,
I'm projecting writing a compiler that would allow syntax extensions (some kind of compile-time metaprogramming). I don't want to have a very powerful system, so I've thought about just having:

{syntax: while (condition) do code}
while (condition, code) => // actual execution

and replace every pattern that matches the syntax with a call to the function.
However, I don't know where to start to get the lexer and parser running, because usual tools such as Flex/Bison or ANTLR (I would like to write the compiler in C#) don't seem to allow this.
Could you provide me any direction on where to go next? I've also read that Scheme or Haskell could be better languages to achieve this task. And of course, I'm open to any suggestion about the actual idea to implement them.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

If You Don't Mind Writing Your Parser in C++...

...let me highly recommend Spirit and its dynamic parser support.

dynamic parser != dynamic rule composition

i play around with a very similar idea, but still did not find any suitably flexible parser generator.

spirit is what comes closest, but unfortunately it does not allow to dynamically generate parsers. parsers are always statically generated at compile time.

spirit's dynamic rules allow you to control rule continuation (the 'rhs' of grammar rules) according to some dynamic (runtime) condition. the target rule has still to be preconstructed at compile time.

to be able to create arbitrary grammars at runtime with spirit, the target grammar generator has to be modeled by hand. spirit just provides the parsing primitives in such an application.

i'd be very interested in parser generators that allowed to create parsers at runtime by simply feeding it the corresponding grammar.

the best solution i have at the moment would be to recompile the grammar at runtime and subsequently dynamically load the resulting objects.

Hadrian

(Shameless plug warning)

The hadrian parser library for Java allows parsers to be generated at runtime and extended during parsing. There's a demo of dynamic parser generation here.

Strategy when Syntax Extensions are in your own code

I mean, when an extension is in your code but the actual definition is in another place. The code may have already been parsed... Just to know whether Spirit allows it or there is any paper/document speaking about this.

Where the Extension Is

With Spirit, it's impossible for the extension to be anywhere other than "your own code," since what Spirit does is allow you to write your parser entirely in C++, employing what amounts to a (questionable, due to fixed C++ operator precedence rules) embedding of EBNF in C++ syntax. It has to be seen to be believed.

In any case, perhaps you could post a concrete example of what you'd like to be able to do, and then we can discuss which tools, whether Spirit or otherwise, will help accomplish the goal.

Some examples from the language

The syntax is not very well-thought, but I'll show some examples that are in my mind:

type Natural
{
[symbol: +]
operator add (a : Natural, b : Natural) => // code
[syntax: a, "!"]
operator factorial (a : Natural) => // code
...
}

Then, for any symbol specified, these contructions are allowed (of course, the symbol must not be just one-letter, it could be "div" or "mod"):

a + b; // would call Natural.add(a, b)
(+ a b c); // would call Natural.all(Natural.add(a, b), c)

"syntax" ables to use a set of tokens to call a function instead of the function names. "syntax:" specifies that tokens, and arguments are called by their name. So this code is allowed:

a!; // would call Natural.factorial (a)

What I don't need is meta-programming as found, for example, in Nemerle, where types can be manipulated during compile time or so on. I don't mind restricting the allowed tokens to non-keywords and a set of symbols, for example.
Another thing I need to do is to recognize $ followed by a single letter as a token, because it has a special meaning. For example, Vector-$N defines a type with a parameter that could be changed for any integer: Vector-2, Vector-3, and thus changing semantics.
(It would be better if I wouldn't need C++, because I've not programmed very much on that language, but if it is the only chance, I don't mind).

Which languages do you know

Which languages do you know well? I've not actually done this in Haskell but I know how I'd go about it using Parsec and a state containing an appropriate structure full of parsers and I've done something related (just no need to generate parsers by parsing, but that's trivial - it's the incorporating them into the existing parser for the language as a whole that takes thinking). I imagine other functional languages have parsing combinator libraries capable of a similar solution.

parsec sounds like the thing

is it possible to compose parsec parsers during runtime?

entering wishful thinking, i propose the following example: the target grammar might, without knowing so at compile time, require the following parser:

testOr  =   string "foo"
        <|> string "bar"

given a list of strings that is obtained at runtime from the grammar description, can these somehow be folded into an equivalent parser?

testStrings = ["foo", "bar"]
testOr = foldM (<|>) testStrings

?

this would yield the perfect solution to the problem i am stuck with, and the original posters problem!

Seems good

Yes, Parsec seems to be a very good way to handle that problem. There is a "buildExpressionParser" function, but I can't manage to understand how it works. Does anyone know?

Compose Away

Sure, you can build grammar while parsing with Parsec. Here's a trivial example, matching two identical words separated by spaces, by reading the first word, and building a parser that looks for that exact string.

wordTwice = do word <- many letter
               spaces
               string word

This is one of the major differences between monadic parser combinators (like Parsec) and "arrow-style" combinators (like the Utrect Parser Combinators).

With monadic parsers, the grammer used to parse later input is a function of earlier results, as you can see in the type of the sequencing
operation a -> (a -> Parser b) -> Parser b.
With an arrow-style parser the structure of the grammar must be fixed, and only the values computed by later parsers can be a function of earlier results. You can also see this in the type Parser (a -> b) -> Parser a -> Parser b of the sequencing function. On the positive side, the fixed grammar means the library can analyze the grammar, to do thing like automatic left-factorization.

What about parsing code before the rule

That way seems very easy, but my problem is a bit more difficult: if I find a symbol I would need to also recognise it in code that had been already parsed, and not throw errors for that items.
The only way I see is to make a first pass getting every "syntax" or "operator" rules, and building a table (that table should also contain references to the method or function that implements each piece), and then parsing things again. And things could even go worse if I used overloading based on argument types :S

wether you plan to allow

wether you plan to allow grammar definitions to succeed it's use inside the code or not, uniquely identifying grammar definitions will make it easy to process these on the fly or in a separate first pass.

overloading and similar semantics should not be confused with the parsing stage. these will have to be handled separately, possibly through post-processing a generated abstract syntax tree.

you might also have a look at precedence parsers. your symbol definitions will require a way to specify associativity.

You just parse it in two

You just parse it in two phases. First, a shallow parse, to find the syntax rules. This might tokenize everything, but it leaves much of the program unparsed. The output from this phase might contain a lot of "lists of tokens" -- that is, stuff you haven't tried to parse yet.

Then create the parser for the second phase. Piece of cake. ;)

The plot thickens

What if your syntax-redefinition rules can add new syntax for redefining syntax?

monads never cease to amaze..

on a different note: according to the homepage the utrecht parser combinators states that it is possible to dynamically create parsers with the library. i take it that they reference a different kind of dynamic parser generation?

They just don't mean being

They just don't mean being able to build and use the parser mid-parse, unless I've misunderstood. The idea with arrow-style combinators was to prevent "higher-order parsing" - which is why there's no ArrowApply instance. Doing so makes the parser amenable to analyses that wouldn't be possible up against all the usual turing-complete issues. But returning parsers isn't a problem, only attempting to parse something with the returned parser. Passing parsers 'inwards' doesn't change a thing, it just gives you a simple macro-like facility when writing the parsers - it all evaluates to the same thing at the end of the day.

If you're happy doing two passes, it should work fine!

Syntax Directed Compilers?

It sounds like you might be talking about what is sometimes called a "Syntax Directed Compiler" or a "Syntax Directed Language". Google it and see if that's the sort of thing you're looking for. I've compiled some links at:

http://www.peerbox.com:8668/space/start/2005-11-14/1

I understand that this was something of a hot topic in the 1960's and that there were even conferences on the subject.

Sounds like camlp4

This sounds similar to Objective Caml's preprocessor, camlp4. It implements the core grammar as a recursive descent parser. User code can extend the grammar by adding rules to the parser (albeit via a more heavyweight syntax than yours). All rules added within the same precedence level have common prefixes factored out so that the parser can operate with only one lookahead token and no backtracking. Because these rules can be stored as a tree rather than as a state table this simplifies runtime extension. A better description can be found here in the camlp4 manual.

Did you take a look at Nemerle?

It seems to support what you want out-of-the-box:
http://nemerle.org/Macros_tutorial#Adding_new_syntax_to_the_compiler

Maybe you can get some ideas there.

My original inspiration

Indeed, I've been contributing on Nemerle project for some months. I like it, because the metaprogramming system is very powerful, but what I'm looking for is not rewriting but just a nicer syntax for calling functions.

Generally-applicable syntactic sugar

If you don't need a general purpose AST-rewriting capability, a little syntactic sugar might be enough.

Ruby has lots of these little tricks. For example, functions that take a closure as the last parameter can by invoked as:

func_name (param1, param2, param3) {
   code_block (param4)
}

Scala's automatic closure construction will take care of the conditional part of your "while" example.

Another such trick is Haskell's automatic binary operator feature. If, for example, the function "contains" takes two arguments:

contains my_list my_element       -- normal invocation
my_list `contains` my_element     -- syntactic sugar

These types of features can go a long way towards making things easier to read. Both Ruby and Haskell have many libraries that use features like these to create the "feel" of a domain-specific language.

However, if you really do want to do syntax rewriting there's a research paper and implementation for Java called Maya. It's similar to Nemerle's macros. The paper has a detailed description of how the parser dynamically incorporates new syntax rules.

User defined Syntax in Felix

Felix supports limited user defined on-the-fly syntax extensions, in a way similar to camlp4. The grammar extension is more lightweight, however the semantic rules are more convoluted. The technology used to implement this is written Ocaml and quite tricky because the core grammar and semantic rules use a standard Ocamlyacc LALR1 parser, but the extensions are table driven and provide recursive descent parsing.

As an example, the standard library contains the following definition of the whilst statement:

#statement#
  whilst expr do statements done ; =>#
  macro {
    macro lab1 is new;
    macro lab2 is new;
    lab1:>
      if not _1 goto lab2;
      _3;
      goto lab1;
    lab2:>
  };
#

The grammar production couldn't be simpler: whilst is made into a keyword, do and done are already keywords, and expr and statements are magic keywords which invoke the core LALR1 parser recursively for statements and an expression, respectively. The recursion is open in the sense that it includes user added grammar entries .. including the one currently being defined.

The action rule is to emit the parsed user code following the =># symbol, replacing symbols _1, _2 etc with the parse trees of the corresponding nonterminals in the grammar.

As you can see in the example, the resultant parse tree usually involves macro terms, which invoke the syntax macro processor in the subsequent processing phase. The macro processor is programmed to reduce the syntax tree to pre-defined terms which the following stages of compilation are programmed to handle.

Felix also supports user defined binary infix operators and unary bracket operators.

The implementation is interesting because it does not compromise the high performance of the underlying LR parser engine, and also because of the extreme 'hackery' required to provide the Ocamlyacc parser with state data, without using any global variables.

Luckily, the Ocamllex lexer does allow state variables. So what happens is the lexer, which also handles preprocessor directives such as the #statement directive above, keeps track of of keywords in a symbol table. The table is augmented whilst lexing by user defined keywords, and the entire content of the user defined syntax extension is embedded as an attribute in the keyword table entry. In order to capture the correct environment, it is stored as an executable closure inside the corresponding token, when the keyword is lexed.

Each file is lexed entirely before parsing. Then the parser is fed the token stream via a custom lex buffer object. When the parser sees a statement starting with a user defined keyword, it invokes the embedded closure, and returns the resultant parse tree.

Under the hood, the closure parses the text interpretively, using the user's grammar production, in the process or which the Ocamlyacc parser can be invoked recursively on seeing the magic statements or expr symbols.

There is a hack here: Ocamlyacc, like all LR1 parsers, can 'overshoot' by one token before doing a reduction. To fix this problem, the Ocamlyacc grammar defines special terms for terminated expressions, which never overshoot (well actually they always overshoot expressions by one exactly one token). The terminator is pushed back into the token stream so it can be rescanned by the parent parsing process. This is a hack because it relies on experimentally observed but undocumented Ocamlyacc behaviour.

I comment that the biggest problem here is that recursive descent executable parsers have NO error handling capabilities. Any failure propagates back to the last choice point and the parser simply tries the next alternative. Consquently, the only possible error you can get is a Syntax Error, and it always occurs right at the beginning of the token stream.

LR parsers, on the other hand, always pinpoint precisely the first token causing an error.

To solve this problem, at least partially, Felix keeps track of whether there actually is an earlier choice point, and fails immediately if there is not. Although this is still theoretically a very weak diagnostic, in practice it is miles better than simply knowing that your whilst statement has an error in it .. somewhere! In particular, errors reported by recursive calls to the Ocamlyacc engine throw exceptions which carry precise error locations and perhaps other diagnostic information, and the RD parser can propagate this error if there is no prior choice point.

Consequently it is best if user defined statements all start with distinct keywords, otherwise only errors parsing the last tried alternative can carry sensible diagnostics .. which is unfortunate if the programmer intended one of the other grammar productions to apply.

All in all, this feature allows both ordinary programmers and the Felix developers to easily add pleasing syntactic sugar to the language without rebuilding the compiler. It is intended as the precursor of a more ambitious feature: the ability to program arbitrary Domain Specific Sub Languages (DSSLs) directly in library code.

Parsec has error handling capability

All parsec implementations (Haskell, Java, C#, Ruby) have error handling capabilities, though they are all recursive descent.

Common Lisp's Reader Macros

You might like to take a look at Common Lisp's reader macros. Whenever the Lisp reader encounters a certain character or sequence of certain characters (such as "#c" or "{") the reader calls a specified function which is to read the next expression. That way you could write "#c(1 2)" and have it be read as "(complex 1 2)", or set curly braces to act like parentheses.

It's not as general and powerful as what you describe, but it is simple and pretty flexible.

Reader macros are cool!

In fact, in muSE - an embedded scheme dialect we use at my company - macros expanded at read time are the only kind of syntax transformers. Their tail first evaluation gives a nice consistency and predictability to the result. Quite effective when building DSLs.

parsec has Java/C#/Ruby port

You can check out http://jparsec.codehaus.org

Parsec is implemented in Java, C# and Ruby so far.

Thanks to Java's syntax, jparsec is the most verbose among all. Nothing prevents you from writing dynamic grammar if you don't mind the ugly anonymous classes everywhere.

C#'s version is much better using anonymous delegate.

Ruby Parsec is the most pleasant one. I find it even easier to use than Haskell Parsec pragmatically.

What Niklaus Wirth has to say...

From another current LtU discussion: Good Ideas, Through the Looking Glass, here's what Niklaus Wirth has to say on the subject:

5.2. Extensible languages
The phantasies of computer scientists in the 1960s knew no bounds. Spurned by the success of automatic syntax analysis and parser generation, some proposed the idea of the flexible, or at least extensible language. The notion was that a program would be preceded by syntactic rules which would then guide the general parser while parsing the subsequent program. A step further: The syntax rules would not only precede the program, but they could be interspersed anywhere throughout the text. For example, if someone wished to use a particularly fancy private form of for statement, he could do so elegantly, even specifying different variants for the same concept in different sections of the same program. The concept that languages serve to communicate between humans had been completely blended out, as apparently everyone could now define his own language on the fly. The high hopes, however, were soon damped by the difficulties encountered when trying to specify, what these private constructions should mean. As a consequence, the intreaguing idea of extensible languages faded away rather quickly.

May not be entirely relevant

The high hopes, however, were soon damped by the difficulties encountered when trying to specify, what these private constructions should mean.

I think Wirth is talking about languages where the meaning of these new syntax constructs will have to be specified separately. In the case of lisp macros however, the meaning is implicit - the result of applying the syntax transformer is expected to be a well formed lisp expression that is evaluated as usual. Even the humble #define is very useful as a way to provide different sections of code visible depending on compile conditions.

Professor Wirth's Reply

Here's what Professor Wirth said when I asked him if he meant that it was difficult to specify to computers or if it was difficult to specify to humans:

Both. At the time, we did not know how to specify semantics, except by other programs, algorithms, perhaps in a more primitive langauge. But this is a vain effort. See my efforts to formalize and define the language Euler (Comm. ACM Jan. and Feb. 1966)

Here are links to the Euler articles that he mentions: Part I, Part II

hopefully on topic

D allows lazy parameters in function declarations.

The expression in that position is wrapped up in a delegate and called whenever the parameter is evaluated.

http://www.digitalmars.com/d/function.html

void While( lazy bool cond, lazy void code )
{
for( ; cond() ; ) code();
}

D-Expressions

D-Expressions claim to bring Lisp-style macros to languages with more conventional syntax. The basic idea is to extend Lisp's simple syntax

sexpr ::= atom | (atom*)

with a bit more variation, however ensuring that there always is a defined tree structure. I.e., instead of just matching pairs of "(" and ")" you also allow matching pairs of "[", "]" (as Reader macros do in Lisp) but also "define", "end" or "macro-name", "end".

The last rule has quite some implications, namely that the parser must maintain an environment of macro names, introducing some staging problems (i.e., the code of the macro-function must be available, so usually be in a different module).

Dylan has those kind of macros as a pattern-matching facility and I think you can get quite far with it. D-Expressions allow for even more.

Permissive structural grammar

I have had a go at this kind of thing myself, where I wanted macro definitions to be part of the main language grammar, rather than being pre-processor instructions, as in C macros. This would allow macro definitions to be syntax-checked at the point of definition, rather than at the point of expansion.

The basic problem, as I see it, is that no worthwhile grammar will allow arbitrary sequences of tokens. There has to be some structure. Jorend described a two-phase parser above, where the first phase is just a list of tokens. But that is just a lexer, not a parser.

Languages of the Lisp family solve the macro-in-grammar problem by having a minimalist and permissive grammar that describes the structure of the AST, without special keyword constructs. Special forms, e.g. conditionals, are merely instances of a general structure, with a particular symbol at the head of the list.

In a language that has keywords and special constructs, one can still define new constructs, provided that the grammar is permissive about what counts as an expression. For an extensible imperative language, one could allow blocks as expressions, i.e. you can pass { statement; ... } as an argument to a function. It is quite possible to define this sort of thing in Yacc or a similar parser generator.

Please, I need some

Please, I need some help—someone to review a paper I'm working on, called The Ameba Paradigm. I'm a retired programmer with no one among my friends and family who can review it.

Among other things, the paper features a method for designing an imperative language with an adaptable syntax. It avoids “the difficulties encountered when trying to specify, what these private constructions should mean,” (Niklaus Wirth). Furthermore, this adaptable syntax solution requires a persistent object-base/runtime data structure.

I've searched the Internet for a similar solution, and found none. Thus, I believe my paper describes a novel method of implementing an imperative programming language with adaptable syntax and a persistent object base. That means, some intellectual property is at risk. Being retired with a modest, fixed income means no money for patents. What options do I have?

A draft of the abstract is ready for review, but I will limit the number of reviewers. After reviewing the abstract, a reviewer may request the paper for review. However, before sending my paper to anyone, I need to find out a little about intellectual property protection and prospective reviewers. I am not looking for private information. I will read blog posts, a personal web page, a resume, publications by the reviewer, and whatever a prospective reviewer offers.

I hope no one is offended by my trying to get to know a prospective reviewer, just a little.

To anyone who offers to review my paper: Thank you! Thank you! Thank you!

2010-01-23- I would like to take a look at your paper

Hi Ed Earl Ross
Please, if your paper is already published I would appreciate if you point to me the adequate publication. Otherwise, I would like to know your work. I had worked on the subject of adaptivity (automata and grammar related) while pursuing my PhD: a joint paper with my phd advisor is available at GoogleBooks:
http://books.google.com.br/books?id=gpIBZU_14D8C&pg=PA158&lpg=PA158&dq=Quem+%C3%A9+realmente+C%C3%A9sar+Bravo?&source=bl&ots=TuBY4dD4JP&sig=aELk5cag60rOuaRmnZY89JvMTTQ&hl=pt-BR&ei=hhhbS6L8O4mHuAfkgrzAAg&sa=X&oi=book_result&ct=result&resnum=1&ved=0CAcQ6AEwADgK#v=onepage&q=&f=false

Adaptive Automata - A Revisited Proposal
João José Neto and César Bravo
in
LNCS 2608
Implementation and application of automata: 7th international conference ... Por Jean-Marc Champarnaud,Denis Maurel
Regards
César Bravo