Guppy: trying to make parsing simple, fun, and powerful

I've been working on designing a parser definition language called Guppy. I have not implemented it yet, as the syntax and semantics of it are a moving target subject to my whims. My knowledge and insight is limited, so I want to discuss the design of Guppy with a community.

Edit: I read the policies document after posting rather than before (this is my first post). Rather than having a mega-post here, I delegated most of its content to my blog.

In a nutshell, Guppy's mission statement is to make parsing simple, fun, and powerful. My ideas for its feature set are in flux; the list currently includes:

  • Syntax strongly influenced by traditional regular expressions
  • Indentation-based syntax
  • Ability to nest production rules
  • Ability to isolate sets of production rules (e.g. a set for the "parser", and a set for the "lexer").

See here for an in-depth description of my ideas so far along with snippets of Guppy code.

To start discussion, I'll introduce the biggest fundamental problem I'm currently faced with: should ambiguous grammars be allowed in Guppy? Consider the following:

expr:
    number
    identifier
    ( expr )
    expr '*' expr
    expr '/' expr
    expr '+' expr
    expr '-' expr

I think this is the most obvious way to define a grammar for basic arithmetic, but there are a couple problems. First of all, the parser doesn't know the order of operations here, and telling Guppy the order of operations might require non-obvious syntax. Second of all, the grammar really isn't correct. One could produce "1+2*3+4" by following expr '*' expr first; that would be an incorrect derivation.

The problem is that allowing ambiguous grammars can make writing grammars easy, but it can result in a less intuitive syntax and violations in correctness. How might I resolve this problem elegantly?

Re: What about more complex rules?

The question for me is how to make the expression more readable without defining lots of auxiliary non-terminals which find a representation in the parse-tree.

I feel that in Guppy, non-terminal names serve three purposes:

  • A handle for referencing a rule in another part of the grammar (the obvious purpose).
  • Node names for the resulting AST that guppy produces (I guess you inferred this).
  • A form of documentation.

The third bullet point is the key to making the rule you gave more approachable (though possibly more verbose). Also, thanks to guppy's nested rules and lexical scope, the rule names can be short and sweet. How does this look?

varargslist
		: (args ',')? variadic-args
		| args ','?
	args: arg (',' arg)*
		arg: def default?
			def:  NAME | '(' tuple ')'
				tuple:  def (',' def)* ','?
			default:  '=' test
	variadic-args:  list (',' dict)?  |  dict
		list:  '*' NAME
		dict:  '**' NAME

Note that the rules under varargslist can be flattened, but the indentation helps to make it clear what rules are needed where and to indicate structure visually.

As for all those auxiliary non-terminals, in my vision for guppy, the resulting AST will typically be handled with selectors which can easily ignore node names that don't matter to the application (think CSS).

Also, you inadvertently brought up an important issue. Consider ',' '**'. In the current version of guppy (which can be found in my mind), this forms ',**', which is actually 3 character symbols, not 2. However, '**' is intended to be one symbol. One thing I don't like about bison/flex is that you have to name tokens of more than two characters (e.g. PLUSPLUS for '++'). Perhaps a string quoted in `backticks` should be treated as a non-terminal symbol name. However, there also needs to be a mechanism to indicate that operators like `**` and `++` should automatically be expanded to characters at the end of production (rather than being treated as mistakenly not defining nonterminals).