DSLs and operator associativity/precedence

I've been doing a lot of OpenGL coding in C++ for my work, and it occurred to me that C++ is a much better language for making DSLs than I first thought. C macros (yes, I know, *shudder*), templates and inline functions, and operator overloading offer a pretty good degree of flexibility. But not all is well. From the C++ FAQ, http://www.parashift.com/c++-faq-lite/operator-overloading.html:

[13.7] Can I create a operator** for "to-the-power-of" operations?

Nope.

The names of, precedence of, associativity of, and arity of operators is fixed by the language. There is no operator** in C++, so you cannot create one for a class type.

If you're in doubt, consider that x ** y is the same as x * (*y) (in other words, the compiler assumes y is a pointer). Besides, operator overloading is just syntactic sugar for function calls. Although this particular syntactic sugar can be very sweet, it doesn't add anything fundamental. I suggest you overload pow(base,exponent) (a double precision version is in ).

By the way, operator^ can work for to-the-power-of, except it has the wrong precedence and associativity.

So maybe C++ is not that great for DSLs :) My immediate thought was, how does common lisp handle this, seeing as its macro facility is often referenced as being an excellent tool for DSLs, but then I realized that its prefix notation + s-expressions prevents associativity and precedence from ever being an issue. The order of evaluation is always explicit.

How do languages that don't use lisp's syntax handle this? Are there any that let you hand define the associativity and precedence? The latter seems like a particularly hard problem because different areas of the code could define their own unique operators, etc.

Another limitation not mentioned in the FAQ entry is that you can only overload operators when at least 1 of the operands is a user defined type. So you can't change how primitive and standard library types interact at all.

SML lets you define operator

SML lets you define operator precedence. I've always been divided about this though. It is a useful technique, and I don't like the idea of the in-built operators just being "magic", with no way to implement them in code. But neither do I like SML's approach - it seems like a brittle hack. Yet another part of me wants to leave operators out of the language, apart from the commonly recognised mathematical ones.

It's a difficult issue, and might be best dealt with at the level of a preprocessor like camlp4 along with other syntax extensions.

By rebooted at Tue, 2006-07-04 19:46 | login or register to post comments

Take a Look at Prolog!

Of all the languages I've worked with, Prolog has the best support for creating and managing new operators. You have full flexibility to control precedence and associativity.

When declaring a new operator you can choose from

fx
fy
xfx
xfy
yfx
yfy
xf
yf

where x means the argument can contain any operators of lower precedence
y means the argument can contain operators of the same or lower precedence
f signifies the function itself

The actual declaration then takes the form of a clause like this:

op( [precedence level], [prefix, infix, or postfix & associativity specification as listed above], [operator name]).

Operator invocations are then treated as if they were entered as conventional nested clauses without any special syntax.

See section 5.5 Declaring Operators of Clocksin & Mellish (3rd ed.) (ISBN 3-504-17539-3)

For practical uses of operators in a major application, see "Prolog and Natural-Language Analysis" by Pereira and Schieber (ISBN 0-937073-18-0)

I know of no other language that gives you this level of expressive power with so little effort on your part and when used along the lines of the NLP example it dramatically improves code legibility.

By Peter J. Wasilko at Tue, 2006-07-04 19:51 | login or register to post comments

Check out the OBJ family of

Check out the OBJ family of language (OBJ3 for a specific reference). I've mentioned it here on a mildly related vein before as well as other places.

By Derek Elkins at Wed, 2006-07-05 20:49 | login or register to post comments

Haskell

Haskell allows to define operator precedence and fixity, by having 9 numbered precedences.

It has only binary operators (except two builtin exceptions). You can't change which tokens are operators.

Any identifier can be turned into an operator by putting it inside `backquotes`. Such operators can have declared precedence and fixity too.

Having only binary operators solves the problem of finding boundaries between tokens: a sequence of punctuation and symbol characters is a single operator.

I'm not a fan of elaborate customization of the surface syntax. It puts an extra burden on learning a library. It complicates parsing significantly, especially for external tools (like documentation extraction). It makes name conflicts between libraries less pleasant. For me the ability to form operators from identifiers and to specialize standard arithmetic and relational operators for custom types is more important than inventing new sequences of funny characters and customizing operator precedence.

By Qrczak at Tue, 2006-07-04 22:45 | login or register to post comments

Why operators make Haskell the DSL host language of Choice

Qrczak said:
I'm not a fan of elaborate customization of the surface syntax. It puts an extra burden on learning a library. It complicates parsing significantly, especially for external tools (like documentation extraction).

But what about all the combinator-based DSL of Haskell? I agree, that it adds an extra burden, but it usually pays off. Imagine, the parser combinators without operators, such as:

parseHello = parseText "Hello" <|> parseText "Hi" <|> parseTest "Hey"

Yes, the extra burder is there. Once: for the library maintainers, but it ends up being grammatically equivelant to a language designed for the purpose of parsing. It also (in this example) prevents the need for two layers of parentheses.

When doing a lot of math, it really matters whether you have to type:

brutto = btw * (netto - discount)

Instead of

brutton = times btw (minus netto discount)

Same goes for a parser program (which usually contains little math).
The thing is, its just weird to give math such exceptional status, as to having operator support, eventhough most programming doesn't involve math.

The thing is, should it be one size fits all badly? or one size fits all stretchy? I prefer stretchy and I haven't seen any misuse of user-defined operators in Haskell yet.

It's all about absraction: should we be thinking in function calls, execution order and return values, or should we be thinking in product and sum parsers when writing a parser? I prefer to think in parsers, when writing a parser. To think in math, when writing a math calculation and to think in atomic and sequential blocks when doing transactions.

The more natural an abstraction can be modelled in a language, the better we can forget about its implementation and focus on modelling our problem (rather than modelling how the problem has to be solved by the enforced sematinc and syntactical environment)

By mister r at Wed, 2006-07-05 00:35 | login or register to post comments

smalltalk

As far as I can tell, smalltalk allows for arbitrary creation of operators, unfortunately it does it in one of the worst ways I can imagine--there is no operator precedence, so everything is evaluated left to right. This is probably worse than having fixed precedence; I'd probably even prefer lisp-style forced parens over this.

edit:
Oh, and just so you know, flexible operator precedence is actually somewhat hard to do. Think through it and you'll realize it's not worth it.

By Curtis W at Wed, 2006-07-05 00:13 | login or register to post comments

flexible operator precendence

Curtis w said:
Oh, and just so you know, flexible operator precedence is actually somewhat hard to do. Think through it and you'll realize it's not worth it.

Well in the presence of typing it could just 'work' itself out. Given all possible interpretations of a sentence containg several operators, the type could be used to infer the 'intended' structure of the sentence (i.e. the only legal one).

If you have more than one legal interpretation, you must be dealing with the same type of function associated to the operators, in which case some explicit order would be required. But this precendce information will only need to be provided per operator-function-type.

Which makes it very local. There doesn't have to be any precedense rules between operators of type (Boolean -> Boolean -> Boolean) and operators of type (Number -> Number -> Number). There will always be just one legal interpretation, based on the inferred types.

By mister r at Wed, 2006-07-05 00:48 | login or register to post comments

Smalltalk precedence

I confess I like Smalltalk's left to right evalution of binary operators, without precedence. Note Smalltalk methods don't all have the same precedence. Unary methods take precedence over binary operators, which take precedence over keyword methods.

Curtis W: I'd probably even prefer lisp-style forced parens over this.

Of course in Smalltalk you can use parens to force evaluation order too. So in a sense you can say lisp-style forced parens are a subset of Smalltalk syntax, so you do have it in Smalltalk.

If I recall correctly, Smalltalk doesn't use backquotes for anything, and one might add Haskell style binary operators as an extension, if you don't mind backquoting the operators. (But if someone thinks using parens is too verbose, it's hard to see how using backquotes around operators is less verbose.)

(Traditional Smalltalk binary operators are composed of one or two glyphs from a specific set of special characters. But I don't see why you couldn't use any number of characters from this set; no lexical or grammatical ambiguity results from longer binary method names.)

By Rys McCusker at Wed, 2006-07-05 02:42 | login or register to post comments

Interesting

I've been toying with the idea of occluding operator precidence from my feature list in favour of left to right evaluation order but have been avoiding the idea since I figured that no langugage omitting that feature would be considered 'A Real Languge'TM. Good to see that at least one omitted it; now I don't feel so bad.

When I was taught programming way back when my instructors said you should never rely on operator precidence and use parens to force the evaluation order that you desired. This is great for omitting common mistakes for beginners and of course after a while you know which operations will happen first so you can omit the occasional set of parens, would strict left to right evaluation order be all that bad for a language (since it simplifies the grammar rules and also alows for an easier addition of arbitrary infix operators (such as the ** mentioned above) to the grammar without the need to waste a load of thought on that precidence value)?

Just a thought.

By kruhft at Wed, 2006-07-05 02:56 | login or register to post comments

APL and J also doesn't have

APL and J also doesn't have any operator precidence, but they evaluate from right to left. I think that is somehow more sensible for a none object oriented language.

for example:

1:2:3:[]

x = y + z

a -> b -> c

By Felicia Li Svilling at Wed, 2006-07-05 03:08 | login or register to post comments

APL operator precedence

Here's a quote from K.E. Iverson, "Conventions Governing Order of Execution" in
"A Sourcebook in APL", pp. 31-32:

"As further functions are introduced ... the complexity grows and the utility of any relative priority of execution among the functions decreases. ... Such a system is usually very complex (Algol, one of the best known, has nine priority levels) and can therefore be used efficiently only by a programmer who employs it frequently. The occasional (and the prudent) programmer avoids the whole issue by including all the parentheses which would have been required with no convention."

By dbremner at Tue, 2006-07-11 01:04 | login or register to post comments

avoiding precedence is an option

Precedence is hard to explain to folks new to programming, which might be why it was missing from Smalltalk, since the makers hoped non-programmers would get involved (like pre-teen kids).

Leaving out precedence might be a good idea for languages not really targeted at math wonks (who are apt to be aggrieved if * and / don't have higher precedence than + and -). Or if you want your language source code to be obvious to folks less smart than you (including yourself when you're debugging :-), leaving out precedence makes code smaller, simpler, and more direct.

kruhft: When I was taught programming way back when my instructors said you should never rely on operator precidence and use parens to force the evaluation order that you desired.

That roughly accords with common engineering practice, except precedence among certain operators is usually taken for granted. (Most folks assume everyone knows operator->() has high precedence in C.) I tend to use a lot of parens because I want my code to be obvious to folks who look at my code later, when they barely have an extra neuron left over in the middle of debugging.

The last time I implemented operator precedence was for the ESI (edge side include) markup parser I added to my HTML parser, two servers ago. The ESI spec from Akamai and Oracle had conditional compilation using if forms containing test expressions employing operators with precedence -- it was a required feature I couldn't shirk. But coming up with the elegant expression stack in C++ must have taken several hours to spec, write, test, and debug. But it seemed like a waste of time, even if was kinda neat.

Programmers might get attached to various items of technology because they revel in the joy of their own cleverness from apprehending exactly how an elegant piece of business works. It's fun to implement operator precedence; but the code itself isn't very obvious at first glance. Is there any clear benefit to precedence that's not merely the absence of a few more parens?

By Rys McCusker at Wed, 2006-07-05 04:54 | login or register to post comments

maybe not, but ...

Is there any clear benefit to precedence that's not merely the absence of a few more parens?

Maybe not, but it's what people are used to. For example, I think that prefix or postfix notation erect sort of a psychological barrier in many potential users.
At least, every time I do some calculation with the unix tool dc (which uses polish notation so that "2 3 + p" prints 5) while some team member looks over my shoulder, they are very astonished.

History of programming languages seems to suggest that those with clever precedence rules can be quite succesful. Think of perl, for example. Most programmers are also lazy, preferring to write
drink $beer unless abstinent $you;
instead
if (!($abstinent($you))) { drink($beer); }

By Ingo Wechsung at Wed, 2006-07-05 09:17 | login or register to post comments

Leaving out precedence

Leaving out precedence might be a good idea for languages not really targeted at math wonks (who are apt to be aggrieved if * and / don't have higher precedence than + and -). Or if you want your language source code to be obvious to folks less smart than you (including yourself when you're debugging :-), leaving out precedence makes code smaller, simpler, and more direct.

I suspect anyone who's taken high school algebra is pretty much used to and reliant on precedence in the same way that they're reliant on infix notation. It's much easier to conform than it is to break all the rules and cut loose, especially when what you're conforming with has been reinforced since childhood.

By Curtis W at Sun, 2006-07-09 21:21 | login or register to post comments

I can dance to that tune

Many folks who took algebra didn't seem to understand it. :-) You probably mean the ones with an aptitude for algebra. (I was a local wiz at algebra; the teacher predicted I'd be a nuclear physicist. Instead I became a babysitter for cantankerous digital plumbing.)

Curtis W: It's much easier to conform than it is to break all the rules and cut loose, especially when what you're conforming with has been reinforced since childhood.

It's easier to conform when the only binary operators are those familiar to folks from lessons reinforced from childhood. If there are many more binary operators, who's to say what the precedence should be? Since precedence is implicit where usage occurs, it seems a way to confuse readers, especially as the options for operators increase. Smalltalk has a lot of operators, and could easily have thousands more just by allowing longer binary operator names.

Sometimes it's easier (or clearer) to unify by relaxing constraints or old conventions, if there's no good way to impose a system on additions that dwarf an original corpus of well known items.

The phrase "break all the rules" seems a bit strong, as if precedence for a handful of math operators ought to dominate all operators in all computing domains.

By Rys McCusker at Mon, 2006-07-10 05:56 | login or register to post comments

Many folks who took algebra

Many folks who took algebra didn't seem to understand it. :-) You probably mean the ones with an aptitude for algebra.

True, although the one's that aren't tend to become at least decently good after learning how to program ;-)

It's easier to conform when the only binary operators are those familiar to folks from lessons reinforced from childhood. If there are many more binary operators, who's to say what the precedence should be? Since precedence is implicit where usage occurs, it seems a way to confuse readers, especially as the options for operators increase.

Keep in mind that I was also referring to programmers, who are most likely to use a programming language. I suspect they're used to a bit more operators than normal people and can probably figure out operator precedence; see: logical and and or. It becomes fairly obvious what their precedence is after having looked at code for a while.

Sometimes it's easier (or clearer) to unify by relaxing constraints or old conventions, if there's no good way to impose a system on additions that dwarf an original corpus of well known items.

Sometimes, although I don't see the need for it here. Additional operators are usually at least somewhat intuitive.

The phrase "break all the rules" seems a bit strong, as if precedence for a handful of math operators ought to dominate all operators in all computing domains.

Well, if it doesn't, you most certainly shouldn't keep the same interface, lest you confuse your userbase.

By Curtis W at Mon, 2006-07-10 13:52 | login or register to post comments

I agree with Rys.

I'm already using all of my brain cells to figure out how to solve the particular problem I'm currently working on (fixing a bug in legacy code for example). I don't need to run across something like this in the mean time:

z = x + y >> m | n / j;

Sure, I could look up the operator precedence table to figure out what is going on (maybe this line of code contains the bug?), but those are extra (irrelevant) details that I'm going to have to keep in my head (even if for a short period of time). If the precedence of this code were made explicit, then I wouldn't have to think as much about what is going on.

By Benjamin D. Cutler at Mon, 2006-07-10 14:54 | login or register to post comments

That's true, but having no

That's true, but having no operator precedence is a tradeoff. Specifically, it does make certain pieces of code easier to read, but it can also be deceptive. For example, what do you think the results of this expression would be?

2 + 4*8

In most programming languages and in mathematics, that expression would yeild 34. In smalltalk, it'd be 48. So, it might be true that smalltalk's approach can make certain, rather rare, expressions easier to read, it can also lead to a disparity between what looks right and what is right, which can lead to subtle bugs.

By Curtis W at Mon, 2006-07-10 15:03 | login or register to post comments

The key ingredients to

The key ingredients to writing efficient domain specific embedded languages (DSELs not DSLs) in C++ is expression templates, operator overloading, some generic programming techniques (such as compile-time tag dispatching), type traits and in general template metaprogramming. A little macro metaprogramming for eliminating repetitive boilerplate (meta)code of template metaprogramming.

To make this all less painful you would (or should) use boost mpl, boost fusion v2, boost.preprocessor, and std::tr1/boost type traits.

There are some tricks/hacks to get over the limitations of C++'s operator overloading model to handle unary operators, associativity, and precedence levels of operators.

With the case of intermediate values and C++'s current lack of local inference, one can either use automatic type erasor techniques or non-standard language extension for inferring types of complex expressions.

If one looks at libraries such as boost.spirit they are essentially based on the ideas of combinator libraries.

If you're really interested then i recommend reading the book C++ Template Metaprogramming : Concepts, Tools, and Techniques from Boost and Beyond. The last two chapters are devoted to the subject of DESLs in C++ and in general.

Another limitation not mentioned in the FAQ entry is that you can only overload operators when at least 1 of the operands is a user defined type. So you can't change how primitive and standard library types interact at all.

To certain degree you can mitigate this issue by making both operands parameterized and using expression templates.

By snk_kid at Wed, 2006-07-05 11:05 | login or register to post comments

One thing to note that

One thing to note that template metaprogramming is purely functional in nature & untyped so one has to be prepared for that but boost mpl can make the transition easier since it's modelled around the C++ standard library containers & algorithms, boost lambda library (a DSEL in it's self) etc, etc so it should be much more familiar to the average C++ programmer.

By snk_kid at Wed, 2006-07-05 11:20 | login or register to post comments

Another limitation not

Another limitation not mentioned in the FAQ entry is that you can only overload operators when at least 1 of the operands is a user defined type. So you can't change how primitive and standard library types interact at all.

that is not correct. it is perfectly legal to define your own operators for anything but fundamental (primitive) type only arguments. standard library types are what you call user defined in that context.

you can even hide predefined std:: operators, which allows to you 'redefine' standard operator behaviour. (that is, until koenig lookup bites you...)

on topic:

in my experience the most lacking feature for flexible dsls in c++ are (lexically scoped) closures. missing operators may require a few kludges in the syntax, but the lack of closures seriously restricts interaction between the dsl and c++.

cumbersome workarounds like the different available lambda libraries do not suffice in that regard.

the nice thing is that closures will be added with c++0x.

By peter steiner at Wed, 2006-07-05 12:33 | login or register to post comments

False

Can I create a operator** for "to-the-power-of" operations?

Correct answer: yes.

I'm not claiming that what follows is good practice. Just that the original statement is false, or at least needs some qualification.


#include <iostream>

using namespace std;

class Int {
public:
    int x;
    Int(int x0) : x(x0) { }
    void write()
    {
        cout << x << endl;
    }
};

class Star {
public:
    int x;
    Star(int x0) : x(x0) { }
};

Int operator*(Int a,Int b)
{
    return Int(a.x*b.x);
}

Star operator*(Int a)
{
    return Star(a.x);
}

int pow(int a,int b)
{
    return b==0 ? 1 : a*pow(a,b-1);
}

Int operator*(Int a,Star b)
{
    return Int(pow(a.x,b.x));
}

int main()
{
    (Int(2)*Int(3)).write(); // Product
    (Int(2)**Int(3)).write(); // Power
}

By sigfpe at Wed, 2006-07-05 16:15 | login or register to post comments

Exercise

A fun and related exercise is to write code that allows for chaining of the relational operators <, <=, > and >=. For example, if e1, e2 and e2 are Int-valued expressions with Int values v1, v2 and v3 then e1 <= e2 < e3 should have the same semantics as v1 <= v2 && v2 < v3.

By Per Vognsen at Sat, 2006-07-08 13:37 | login or register to post comments

top-down parsing of inline operators with precedence?

off the top of my head, it doesn't seem like this would be easy. am i missing something?

(vaguely related - looks like grune + jacobs 2nd edn will be out in december; anyone know of an online source for top down operator precedence?)

By andrew cooke at Fri, 2006-07-07 17:51 | login or register to post comments

clarify 'top-down' here?

andrew cooke: top-down parsing of inline operators with precedence? off the top of my head, it doesn't seem like this would be easy. am i missing something?

I'm unsure what you mean by "top-down" -- one pass in recursive descent generating an AST? By inline, you mean the operators are static instead of dynamic (so you can emit inline code)? Code emitted for an esi compiler is basically inline for the test operators.

If you mean that, I can likely page up my memory from a few years ago; the way I did it is almost on the tip of my tongue; maybe I'll have it by tomorrow. It was hard until I had an insight involving a kind of push down stack of nodes for a binary expression tree. Until the trick, the description was nasty. It'll come back to me.

By Rys McCusker at Fri, 2006-07-07 19:26 | login or register to post comments

well, i did mean recursive

well, i did mean recursive descent, but your question made me realise i hadn't thought before opening my mouth. this "just works" provided you take a bit of care with your grammar (it's in the standard arithmetic example and i've written such parsers myself - duh). sorry.

By andrew cooke at Fri, 2006-07-07 19:57 | login or register to post comments

has less obvious extension (no apology needed)

I didn't mean to make it sound like you were missing something obvious. The way something's described can easily point far away from familiar territory. I didn't notice myself we were nearly discussing standard arithmetic grammars with factors and terms in expressions. II tend to be overly literal; that's why I was drawing you out.

andrew cooke: (it's in the standard arithmetic example and i've written such parsers myself - duh)

I was actually trying to find out whether you wanted technique for no grammar and arbitrary levels of precedence, though you can still simulate a grammar and recursive descent. I avoided recursion with an explicit expression stack (basically simulating what you'd get with recursion), which is less code (fewer functions) but probably less clear than recursion.

By Rys McCusker at Fri, 2006-07-07 22:30 | login or register to post comments

that too

I was actually trying to find out whether you wanted technique for no grammar and arbitrary levels of precedence

yes, that's what i had at the back of my mind. in other words, you write the parser before knowing what the operators will be (and also whether they will be infix or prefix).

looking at the code for the parser (thanks to whoever provided that link (i can't see who it was from the "add new comment page")) it seems that does this, but i've not yet worked out how (my intuition is that it's something like the difference between pre and post-order traversal of a tree).

i'm at home this week, so am thinking of going to the university library and digging out the various acm papers i've wanted to read but been unable to access - unfortunately i never thought to make a list over the years....

By andrew cooke at Sat, 2006-07-08 14:27 | login or register to post comments

pseudo code in C++ for arbitrary precedence

(Today I finished a gruesome move from Palo Alto to San Jose, except for the unpacking. Then I saw a good tear jerker of a movie (Lake House) and wrote some sample C++ code below to illustrate one pass recursive descent parsing of expressions with binary operators using arbitrary precedence. Maybe tomorrow I'll resume the gc technote I started drafting recently.)

Please note all C++ code below is pseudo code that hasn't been compiled or checked better than by eye. But it was my intention you could provide class definitions to make it work. Quite a few lines of code are minimal, cursory error handling, since this seemed better than none at all. The style used here attempts to minimize both horizontal and vertical space used, with two space indents and occasional one line stacking of normally multi line expressions.

Before I explain the code, first a basic explanation of the simple technique in plain english, which might not seem really crisp and exact until you look at the sample code. Suppose you start with a null terminated C string containing (say ascii) text something like "3 + 4 * 5**2 - 9 / -3" and assume ** has higher precedence than * and / (which have higher precedence than + and -). The code should generate a binary tree of nodes where all nodes have left and right children even if nil and not used. (Unary expressions are expressed using right child only.)

A basic strategy is to use a root "identity" node that returns the unchanged value of the right child (whatever it evaluates to). This identity operator has lowest possible precedence (say zero) so all other operators will have higher precedence. This guarantees that whenever an identity node is the root of a subtree, it will remain there because no other higher precedence subtree will attempt to swap places with it. The next paragraph explains this in terms of expected patterns in the tree.

In a binary tree of expressions of mixed precedence, when no parentheses are present, all higher precedence nodes will appear lower in the tree, so when emitting post order code the operands will evaluate before the application of operators, and higher precedence operators will go before lower. As you go downward in an expression tree, precedence monotonically increases. When two successive operators have the same precedence, the one appearing first (to the left in the input) becomes the left child of the parent successor operator, so code for first/left operator is emitted first.

The recursive approach reads a left hand side first, then an operator, then the right hand side operand. If another operator appears after that, what happens depends on the precedence. If the new operator has higher precedence, it will become the new right child and steal the former right operand as it's own left operand. Conside what happens after parsing "3 + 4" followed by "* 5". The tree after"3 + 4" looks like this:

      +
    /   \
  3       4

But the * operator steals 4 as its own left child, and * becomes the new right child of +:

So here's how the recursion works: after reading a rhs operand, if it is followed by an operator of higher precedence, a new node is added that rotates the old rhs to the left, the the new rhs of the new operator is read recursively. When an operator is read that is not higher precedence, it is pushed back on the scanner's input token stream, and the stack pops to let the caller try the next token, in case it is higher precedence than the next node up in the tree.

Okay, the following method is pseudo code for an entry point. Assuming you already have a Parser object (which allocates Node subclass instances as a factory), it takes an input null-terminated C string containing the expression to parse, and returns the root node of a binary expression tree. This code makes an identity node with a nil left child, then calls exprRight() to parse the right (and only) child of the identity pseudo operator. Afterwards, the string is expected to be consumed, so the next token ought to be the moral equivalent of eof.

Node* Parser::stringToExpr(const char* str)
{ // e_identity has lower precedence than ALL binary operators
  Node* expr = new(*this) BinaryExpr(ExprScanner::e_identity);
  if (!expr) { this->outofmem(); return expr; }
  if (str && *str && !this->allspace(str)) { // not empty or all whitespace?
    ExprScanner scanner(str, this); // plausible scanner to tokenize strings
    if ( this->exprRight(expr, scanner, /*depth*/ 1) ) { // get expr->m_right
      int tok = scanner.token();
      if ( tok != ExprScanner::e_eof )
        this->exprExpected(ExprScanner::e_eof, tok);
    }
  }
  return expr;
}

The following higherPrecedence() method just returns whether the next binary operator token has higher precedence than that of the parent expression node. (Using this method shortens the test in the while loop in exprRight() below.)

inline bool Parser::higherPrecedence(int token, Node* expr)
{ // true if token has higher precedence than expr precedence
  return this->tokenToPrecedence(token) > expr->exprPrecedence();
}

The first thing exprRight() does is read an "atom" which assumes a grammar something like the following:

expr ::= atom | atom binop expr
atom ::= datum | unop atom | ( expr )
datum ::= literal | variable

So atom here means the subpart of an expression with no binary operators (unless inside parens). The code below gets an atom and then loops on the next token as long as the token is a binary operator with higher precedence. When it's not, the loop ends and the token is pushed back for the caller to handle. When the next token is a higher precedence binop, a new operator steals the current right child as the left child of a new node; then exprRight() is called recursively to read the right child of the new node.

Note the first time exprRight() is called from stringToExpr(), the expr argument will be an identity node with lowest precedence, so any operator seen will have higher precedence. So in the top level call to exprRight(), every time any binary operator is seen, it will push the current expression down to the left and recursively read the right child for a new node.

bool Parser::exprRight(Node* expr, ExprScanner& s, int depth)
{ // get RHS value into expr->m_right
  if ( this->exprAtom(&expr->m_right, s, depth + 1) ) {
    int tok = 0;
    while (s.isBinary(tok = s.token()) && higherPrecedence(tok, expr) ) {
      Node* binary = new(*this) BinaryExpr(tok);
      if (!binary) { this->outofmem(); return false; }                
      binary->m_left = expr->m_right; // old right becomes left of binary
      expr->m_right = binary; // high precedence binary replaces old right
      if ( !this->exprRight(binary, s, depth + 1) ) // recursion
        return false;
    }
    s.pushBack(tok); // let the caller rescan this token
    return true; // succeeded in populating expr->m_right
  }
  return false; // failed to get right child
}

The following exprAtom() method handles calls itself recursively if the prefix is a unary operator. Otherwise it expects a literal, a variable, or a parenthesized expression.

#define Expr_kMaxDepth 50 /*arbitrary; pick any value you like*/
bool Parser::exprAtom(Node** outExpr, ExprScanner& s, int depth)
{ // requires a leading unary operator, or a leading lhs token
  if (depth > Expr_kMaxDepth) { this->exprOverflow(); return false; }
  int tok = s.token();
  if (s.isUnary(tok)) {
    if ( tok == ExprScanner::e_minus ) // binary minus?
      tok = ExprScanner::e_neg; // change to unary
    Node* unary = new(*this) UnaryExpr(tok);
    if (!unary) { this->outofmem(); return false; }                
    *outExpr = unary;
    return this->exprAtom(&unary->m_right, s, depth + 1);
  }  
  switch ( tok ) { // handle several left hand side tokens
    case '(': { // identity nodes have lowest precedence of all:
        Node* paren = new(*this) ParenExpr(ExprScanner::e_Identity);
        if (!paren) { this->outofmem(); return false; }                
        *outExpr = paren;
        if (!this->readExpr(&paren->m_right, s, depth + 1) ) // next method
          return false;
        tok = s.token();  
        if ( tok != ')' ) { this->noCloseParen(); return false; }
        return true;
    } // paren
    case ExprScanner::e_literal: 
        *outExpr = s.literal();
        return true;
    case ExprScanner::e_var: {
        *outExpr = s.variable();
        return true;
    default:
      this->unexpectedLhs( tok);
      return false; // false means an error was already reported 
  }  
  return true; // no error reported
}

Then final readExpr() method is only called from inside parentheses in exprAtom(), and it's similar in nature to the original code in the stringToExpr() entry point.

bool Parser::readExpr(Node** outExpr, ExprScanner& s, int depth)
{ // called only from case '(' inside exprAtom()
  if (!this->exprAtom(outExpr, s, depth + 1) ) return false;  
  int tok = s.token();
  if (s.isBinary(tok)) {
    Node* expr = new(*this) BinaryExpr(tok);
    if (!expr) { this->outofmem(); return false; }    
    expr->m_left = *outExpr;
    *outExpr = expr;
    return this->exprRight(expr, s, depth + 1);
  }
  else
    s.pushBack(tok); // not binary; let caller scan again
  return true;
}

Unfortunately it's now midnight and my mind is slightly fried; so I'll just post this as is. If I think of anything else later I'll include an addendum.

By Rys McCusker at Mon, 2006-07-10 07:09 | login or register to post comments

More code

Here's an approach I learned from Dave Gillespie (author of GNU Calc). Take the expression grammar with productions for each precedence level; write a recursive descent parser parameterized by the precedence level, indexing into the table of operator precedences; and notice that you can short-cut the tower of calls from precedence n to n+1 to n+2... by skipping to the precedence of the current token. With no semantic actions this goes like:

void parse_expr(int precedence) {
  for (;;) {
    parse_primary();
    if (!is_infix_op(current_token) || left_precedence(current_token) < precedence)
      return;
    { 
      int r = right_precedence(current_token);
      get_next_token();
      parse_expr(r);
    }
  }
}

IIRC this is how lcc handles expressions (within a recursive-descent parser for the rest of C), and the Pratt parser is similar but with an explicit stack. (I think that goes for your code above, too, but I didn't try to follow it all.)

By Darius Bacon at Tue, 2006-07-11 05:20 | login or register to post comments

Top-down operator precedence

Searching on pratt parser turns up some implementations.

By Darius Bacon at Sat, 2006-07-08 05:17 | login or register to post comments

Lambda the Ultimate

User login

Navigation