## Guppy: trying to make parsing simple, fun, and powerful

I've been working on designing a parser definition language called Guppy. I have not implemented it yet, as the syntax and semantics of it are a moving target subject to my whims. My knowledge and insight is limited, so I want to discuss the design of Guppy with a community.

Edit: I read the policies document after posting rather than before (this is my first post). Rather than having a mega-post here, I delegated most of its content to my blog.

In a nutshell, Guppy's mission statement is to make parsing simple, fun, and powerful. My ideas for its feature set are in flux; the list currently includes:

• Syntax strongly influenced by traditional regular expressions
• Indentation-based syntax
• Ability to nest production rules
• Ability to isolate sets of production rules (e.g. a set for the "parser", and a set for the "lexer").

See here for an in-depth description of my ideas so far along with snippets of Guppy code.

To start discussion, I'll introduce the biggest fundamental problem I'm currently faced with: should ambiguous grammars be allowed in Guppy? Consider the following:

expr:
number
identifier
( expr )
expr '*' expr
expr '/' expr
expr '+' expr
expr '-' expr


I think this is the most obvious way to define a grammar for basic arithmetic, but there are a couple problems. First of all, the parser doesn't know the order of operations here, and telling Guppy the order of operations might require non-obvious syntax. Second of all, the grammar really isn't correct. One could produce "1+2*3+4" by following expr '*' expr first; that would be an incorrect derivation.

The problem is that allowing ambiguous grammars can make writing grammars easy, but it can result in a less intuitive syntax and violations in correctness. How might I resolve this problem elegantly?

## Comment viewing options

### Thanks for the edit. I

Thanks for the edit. I glanced at the original message, and you seem to have some very nice ideas that I am sure several members will want to comment on. I suggest people post specific criticisms and suggestions on your site, and comment here (or on both sites) on aspects that are more general in nature.

### Nested production rules

Do the nested production rules add new semantics, or is it just a way of organizing the code? Couldn't you flatten the nested rules into global top level rules and the grammar would still work correctly?

### Nested production rules

Nested production rules don't add new semantics; they just let the user organize them in a hierarchy.

However, nested production rules can be used to denote namespaces. If you have:

number:  int | float
int:
dec:  [1-9] [0-9]*
hex:  '0' [Xx] xdigit*
oct:  '0' [0-7]*
...


You could refer to the int: rule by saying something like number::int.

### Semantics

Your "Manadatory Blog" page indicates that "functions" do add more powerful semantics to Guppy, generating languages that are not context free. Can you give an example of how they work?

Also, note that it is relative easy to describe the generation of context sensitive languages but designers typically stay away from a context sensitive mechanism because they are worried about the inefficiency of the corresponding parsing algorithms. You might want to think about the parsing mechanism in parallel with designing the generating language class. This Wikipedia page lists some parser generators and the parsing algorithms they implement.

On the issue of ambiguity, PEG grammars define language classes that are somewhat similar (but different) compared to context free and are never ambiguous. However I have seen criticism of PEG grammars on the grounds that while not ambiguous, the language defined at various stages of development and debugging isn't the one the language designer intends. So perhaps having some powerful disambiguation tools for prioritizing productions that are easy to use is a good way to go.

Your "Manadatory Blog" page indicates that "functions" do add more powerful semantics to Guppy, generating languages that are not context free. Can you give an example of how they work?

Consider this example I gave in my blog post:

node(indent = ''):
indent value '\n'
indent key ':' value? '\n' node(indent [\t ]+)


The indent = '' is a default argument, which simply means if you say node, it'll automatically be converted to node('').

Now, suppose we start with A: node(''). It expands to:

A:
value '\n'
key ':' value? '\n' node([\t ]+)


Now here's where it gets interesting. To produce (rather than parse) a node([\t ]+), first choose a string matching [\t ]+ (let's say we choose "\t"). We bind this single value to the indent argument, and expand with it:

B:
'\t' value '\n'
'\t' key ':' value? '\n' node('\t' [\t ]+)


Thus, a function binds a single expansion of the given argument to a variable before "calling" it.

Consider the non-context-free (I think) language {anbncn: n is a power of 2} (Edit: I originally said {... : n >= 1}, but that was incorrect). We can express it using this multi-argument function:

F(a='a',b='b',c='c'):
F(a a, b b, c c)
a b c


By the way, you pointed out a discrepancy in Guppy. I made a mistake with this definition of the list function:

list(item, delimiter = ','):
item (delimiter item)*


The problem is that item is bound to one string, meaning that for functions to work as described before, list(item) merely expands to productions of a single string (e.g. int a, int a, int a, not varying strings all matching the given expression (e.g. int a, int b, int c). Users (myself included) might expect list to work like a macro, but it doesn't.

If isolated production rules are allowed, there is an interesting loophole. Instead of having to manually define a list like this:

A: argument (',' argument)*
argument: ...


We can do this:

A: list(argument) -> (
argument: ...
)


What this means is: A expands into a string of 'argument' tokens (which are terminal because no argument: rule is defined in the outer scope. Then, this string is passed into another scope where those argument symbols can be expanded one by one.

I'll have to consider the implications of nesting "isolated production rule" scopes.

### Backtracking and termination

Are you thinking that the automata you would generate for the language {anbncn: n >= 1} would work by first trying to match abc, then trying to match aabbcc, then trying aaabbbccc, and so forth until it matched? What would cause it to terminate if the input string didn't match?

### Backtracking and termination

Are you thinking that the automata you would generate for the language {anbncn: n >= 1} would work by first trying to match abc, then trying to match aabbcc, then trying aaabbbccc, and so forth until it matched? What would cause it to terminate if the input string didn't match?

I'm thinking in terms of producing strings, not matching them. Hence, given the rule:

F(a,b,c):
F(a a, b b, c c)
a b c


If we say F('x','y','z'), a generator can choose between emitting 'xyz' or calling F('xx','yy','zz'). Oops. This produces {xnynzn where n is a power of 2}. I have updated my parent post to correct this.

What I was trying to do is:

F(a,b,c): G(a,b,c)
G(as,bs,cs):
G(as a, bs b, cs c)
as bs cs


If we plug in F('x','y','z'), it calls G('x','y','z'), which can either expand to G('xx','yy','zz') or 'xyz'. The G('xx','yy','zz') can either expand to G('xxx','yyy','zzz') (because the free variables a, b, c are bound to the outside function F, they stay constant through the recursion on G) or 'xxyyzz'.

### I'm thinking in terms of

I'm thinking in terms of producing strings, not matching them.

Matching/parsing is a much harder job in every sense.

### If Guppy allows numerals to

If Guppy allows numerals to be section identifiers one preserves the regularity of the syntax and can assign precedences:
expr:
number
identifier
'(' expr ')'
10:                 # bigger number means higher precedence
expr '*' expr
expr '/' expr
1:
expr '+' expr
expr '-' expr

I'd omit actions / functions from the description and keep the grammar formalism small. Each parser generator has its own peculiar ways to deal with context sensitivity and turning grammars into full programming languages makes them inevitably non-portable across pgens. It is better to provide an OSS Guppy-Reader ( like an XML parser ) which deals with signifcant whitespace than forcing all pgen authors to implement complex functionality. This way also LL(1) or other simple parsing schemes can benefit.

One more question. Is it allowed to write a production rule this way:

number:
int
float

int:
dec:  [1-9] [0-9]*
hex:  '0' [Xx] xdigit*
oct:  '0' [0-7]*
...


or must "int | float" be thunked behind "number:" ?

### If Guppy allows numerals to

If Guppy allows numerals to be section identifiers one preserves the regularity of the syntax and can assign precedences...

I really like that idea.

One more question. Is it allowed to write a production rule this way ... or must "int | float" be thunked behind "number:" ?

In this case, you could simply say:

number:
int: ...
float: ...


Here, since nothing is given after number:, it defaults to |-ing the child rules together.

However, consider this definition of float:

float:  dec | hex
e:  [Ee] ('+'|'-')? [0-9]+
p:  [Pp] ('+'|'-')? [0-9]+
suffix:  [FLfl]?

dec:
[0-9]* '.' [0-9]+ e? suffix
[0-9]+ '.' e? suffix
[0-9]+ e suffix
hex:
'0' [Xx] xdigit* '.' xdigit+ p suffix
'0' [Xx] xdigit+ '.' p suffix
'0' [Xx] xdigit+ p suffix


The thing is, there are two kinds of rules under float: rules we want included (dec: and hex:), and rules we don't (e:, p:, and suffix: ). I've had numerous thoughts about this over time. In an old guppy syntax, I used $symbol as an "expand" operator, and$symbol: would mean both define and expand a rule (while symbol: would mean just define it). It looked like this:

$float: { suffix: [F L f l]? ;$dec: {
e: [E e] ['+' '-']? [0-9]+ ;
[0-9]* '.' [0-9]+ $e?$suffix;
[0-9]+ '.' $e?$suffix;
[0-9]+ $e$suffix;
};
$hex: { p: [P p] [\+ \-]? [0-9]+ ; 0 [X x]$xdigit* '.' $xdigit+$p $suffix; 0 [X x]$xdigit+ '.' $p$suffix;
0 [X x] $xdigit+$p $suffix; }; };  The problem here is that it's hard to know when to use$ and when not to use $behind a rule name. When I decided to stop using the$ sigil and to require string literals to be quoted, I couldn't use the expand-and-define trick as used above anymore. I considered a Python-style def keyword for rules that aren't to be expanded, but then I considered letting the user write something after the : in a definition.

My issue was how to indicate whether a rule is to be used as a child production or as a convenience rule. Saying float: dec | hex to override the default of |-ing together the child productions was what I came up with.

### What about more complex rules?

Here is a transcription of a nasty piece of Pythons grammar.

test: ...
varargslist: ((fpdef ('=' test)? ',')*
('*' NAME (',' '**' NAME)? | '**' NAME) |
fpdef ('=' test)? (',' fpdef ('=' test)?)* (',')?)

fpdef: NAME | '(' fplist ')'
fplist:
fpdef (',' fpdef)* (',')?


It's one of the rules which are hard to parse mentally and people skip reading them when they are encountered.

The question for me is how to make the expression more readable without defining lots of auxiliary non-terminals which find a representation in the parse-tree. It is possibly not a bad idea there are some more and I don't doubt the grammar can't be improved but sometimes we might want them to be inlined and expanded.

### Re: What about more complex rules?

The question for me is how to make the expression more readable without defining lots of auxiliary non-terminals which find a representation in the parse-tree.

I feel that in Guppy, non-terminal names serve three purposes:

• A handle for referencing a rule in another part of the grammar (the obvious purpose).
• Node names for the resulting AST that guppy produces (I guess you inferred this).
• A form of documentation.

The third bullet point is the key to making the rule you gave more approachable (though possibly more verbose). Also, thanks to guppy's nested rules and lexical scope, the rule names can be short and sweet. How does this look?

varargslist
| args ','?
args: arg (',' arg)*
arg: def default?
def:  NAME | '(' tuple ')'
tuple:  def (',' def)* ','?
default:  '=' test
variadic-args:  list (',' dict)?  |  dict
list:  '*' NAME
dict:  '**' NAME


Note that the rules under varargslist can be flattened, but the indentation helps to make it clear what rules are needed where and to indicate structure visually.

As for all those auxiliary non-terminals, in my vision for guppy, the resulting AST will typically be handled with selectors which can easily ignore node names that don't matter to the application (think CSS).

Also, you inadvertently brought up an important issue. Consider ',' '**'. In the current version of guppy (which can be found in my mind), this forms ',**', which is actually 3 character symbols, not 2. However, '**' is intended to be one symbol. One thing I don't like about bison/flex is that you have to name tokens of more than two characters (e.g. PLUSPLUS for '++'). Perhaps a string quoted in backticks should be treated as a non-terminal symbol name. However, there also needs to be a mechanism to indicate that operators like ** and ++ should automatically be expanded to characters at the end of production (rather than being treated as mistakenly not defining nonterminals).

### This looks like special

This looks like special indentation for the RHS of the rule and the flying colon preceding the rule definition feels like a syntax error.

I came up with another idea - a dialect if you want - which uses an 'inline' keyword which is basically a substitution macro which eliminates an auxiliary non-terminal. When the user writes

def: NAME | '(' tuple ')'
inline tuple:  def (',' def)* ','?


no node for 'tuple' is created but 'def' becomes

def: NAME | '(' def (',' def)* ','? ')'


Now one can define an immaterial context for the RHS of 'def'

def:
inline self:
NAME
'(' tuple ')'
tuple:  def (',' def)* ','?


We can sugar this by setting

@self = inline self


which yields

def:
@self:
NAME
'(' tuple ')'
tuple:  def (',' def)* ','?


The 'self' block is supposed to contain the rhs of the rule 'def'. Since it is inlined no non-terminal for 'self' is created. If both 'self' is omitted and there is no immediate thunk after 'def' it is assumed that the complete 'def' block is 'self' - the rhs of the rule.

### The problem is theory, and not implementation

It is the deficiency of theory that limits wide adoption of parsing technology, and not a particular tool. The void has been filled by XML -- apparently, even modest utility trumps ugliness.

The theory has been shaped in the 60s when limitations of processing power triggered adoption of fast but less generic algorithms, so that parsing textbooks emphasize LR, LL, as opposed to CYK and Earley. Then, there is over reliance on particular parsing method; compare it to, say, database management ideas, which would sound in parsing context like this: "given a grammar and a string, parser outputs parse/AST tree, and, BTW, the particular algorithm is merely an implementation detail". Unfortunately, in a majority of parsing tools today a user has to write a grammar to make the parsing system algorithm happy. From database perspective, this would be akin to writing a query in a certain way. Yes, sometimes we write query to make optimizer happy, but in parsing world the alternative way to write a grammar means failure, rather than suboptimal but still correct execution.

Most of parsing technology I came across in the textbooks are cast in procedural terms. For example, given a parse or AST tree, they suggest some traversal method, and the user is supposed to plug in a function, which would be executed at each node along the way. Wait a minute, I want to query parse tree, and not to hack some side effect of tree navigation! "Is the node A ancestor of B"? "What is the least common descendant of the nodes "IF" and "THEN"?

Next, there is automata theory and the concept of code generation. The former is plain ridiculous: "Automata is 5-tuple...". A little research reveals that it is actually much simpler concept than that: it is essentially iterative matrix with initial and final state vectors -- 3 tuple, if you like. Anyhow, neither of these definitions is something accessible to average programmer. The code generation is just dinosaur from the 60s: emulation is adequate for most applications (such as DSL) today.

### I don't think Joey Adams can

I don't think Joey Adams can change computing science of the last 50 years ( or even has to ) but when you take a look at grammars, even those in public specifications of popular languages you'll notice that they are a bunch of unchecked, horribly broken, at least obscure, mess, often equipped with special notations. On the web they are published as HTML only as if grammars don't want to be used but read on a screen ( most funnily terminals are highlighted not through syntax / punctuation but through italic style ). ANTLR has a large repository but it's all coupled to a particular pgen and its own in-grammar programming language. I'd call this programming language envy of grammars. Consequently many programmers just use the notations of their favourite language and invent their own EDSLs.

There is more than just one void that has been filled with XML.

Finding a more accessible formalism to describe FSMs is one thing ( I agree with you here ), another one is to ship total crap. Not everything can and must be treated at the same time by a single author.

### *Abstract* parse tree?

You give Guppy a grammar definition and an input string, and Guppy gives you an AST whose nodes are named after production rule symbols.

Constructing an AST is simple, but not quite that simple. That is, you can trivially construct a parse tree by recording the productions used, but the result would not be an abstract parse tree. To take your basic arithmetic grammar example, a trivial parse tree constructed in this way for an expression 45-3 would be something like

expr(expr(number(int(dec("4", "5"))), "-", expr(number(int("3"))))

A decent AST should be more, well, abstract. Whatever ends up using your parser would probably have much easier time with an output like

subtraction(int("45"), int("3"))

To construct a proper abstract parse tree, you need to distinguish between those productions that should map into parse tree nodes and those that are auxiliary. You can deduce that automatically to some extent, but for a full control of the output AST your grammar language will need some manual overrides.

### Re: *Abstract* parse tree?

I apologize; I neglected to look up the difference between "[concrete] parse tree" and "abstract syntax tree". Most places where I said "AST", I meant "parse tree".

Note that the rule for subtraction could be labeled as such, so the parse tree for this case might actually be something like:

expr(subtraction(expr(number(int(dec("45")))), "-", expr(number(int(dec("3"))))))


To echo what I said in an earlier post:

As for all those auxiliary non-terminals, in my vision for guppy, the resulting [parse tree] will typically be handled with selectors which can easily ignore node names that don't matter to the application (think CSS).

By labeling subtraction: in the grammar, the program handling the parse tree can still identify the operation like it would with a cleaned-up AST. The difference is that, in order to filter out the extraneous nonterminals, such a program would typically be interested in selecting descendants of a node, not immediate children.

A problem with this approach, though, is that it sort of dictates a programming style for the person who writes code handling the parse tree. Converting the parse tree to a cleaned-up AST or ASG is likely a tedious and common task (you're the second person who's brought up the issue of nonterminal clutter). Perhaps guppy should go one step further and provide semantic analysis support. Whether or not it should be embedded into the grammar syntax or stand outside, I'm not sure. Substitution of parse tree nodes could be based on simple pattern-matching and should suffice to provide decent AST construction.

Also, if the user doesn't care to learn about tree filtering right in guppy right away, guppy might provide an API for traversing the tree with selectors.

Another consideration is to support flagging non-terminals for omission from the parse tree. The problem with this is that some applications may be interested in those non-terminals. Remember, guppy modules (grammars for various languages bundled for use with guppy) could find use in several applications and shell scripts.

### Operator precedence

I heard that APL had the right-most operator have higher precedence on each line so you could just read the expression from left to right.

### Semantic actions are a death

Semantic actions are a death trap for each grammar formalism that aims to be toolkit agnostic and portable. One ends up with a grammar formalism for Yacc, one for ANTLR, one for the Guppy Parsing Engine ... and of course XML in the outset. XML has the virtue of not specifying anything like this but outsources it into APIs + a mode of element access ( e.g. SAX ) or secondary languages like XSLT. After all XML seems to be the creation of an evil genius. It is not without flaws though because it is still not verbose and ugly enough to discourage people from specifying XML based scripting languages.

### agreed

That's the gist of the comments I edited out (except the xml rant, but I agree with that too).

### Avoiding semantic actions

Avoiding semantic actions was my objective in designing Recursive Adaptive Grammars — the idea being that an impedance mismatch inevitably occurs at the interface between the separate computation models of grammar and semantic ations, so to avoid the impedance mismatch, do everything within the tree-structured processing of the grammar, by enabling the grammar to call itself recursively. (In principle one could construct an LR parser for RAGs that could handle most reasonably structured RAGs, including reasonably structured Turing-powerful RAGs; I've never pursued that because I got drawn away into fexprs.)

### Context sensitivity

From my point of view the problem which is stated in the introductory chapter 3.1. doesn't have to be expressed on the grammar level. The constraints can be easily applied using post-processing on an available parse tree.

Other context sensitive dependencies might be better expressed using additional terminals that are not available from the lexical analysis process which shall be kept simple. For example a C grammar can use a TYPENAME terminal which is not directly produced by the lexer but requires a prior NAME -> TYPENAME transformation, where NAME is produced by the lexer. NAME and TYPENAME have different token-types but share same token string. Another example is the use of INDENT and DEDENT terminals in Pythons grammar.

The heuristics is to split the parsing process into three independent steps expressed by the following composable functions:

lexer: string -> token-stream             ( lexer CFG )
postlexer: token-stream -> token-stream   ( Turing complete PL )
parser: token-stream -> parse-tree        ( parser CFG )


Each of those steps is relatively simple. Lexer and parser are fully generic whereas the postlexer can use any resources the programmer has available in order to build e.g. a symbol table or extract INDENT and DEDENT token from whitespace. The lexer CFG and the parser CFG might not completely fit together at the token / terminal boundary. If they do, the postlexer becomes obsolete, otherwise one has to establish contracts between lexer CFG, parser CFG and postlexer. This is an open design problem for a grammar formalism, but it is not a hard one.