Help with Mixfix in Bison?

As BitC moves towards a more human-compatible (sorry lispers!) surface syntax, we're considering mixfix parsing. Since it really applies only in the expression sub-grammar, it seems like a shame not to be able to use Bison (or something similar) for the rest of the grammar.

As near as I can tell, the only way to implement this is to have Bison simply accumulate a token sequence for expressions without trying to deal with precedence at all, and then apply a rewriter on the resulting AST to apply operator precedence rules dynamically.

Hmm. A kludge may be possible with mid-production actions and GLR parsing.

Is there a known solution to this, or am I just barking up trees?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Maude

Maude does this, and it sounds hard:

The mixfix front end consists of a bison/flex based parser for Maude's surface syntax, a grammar generator (which generates the context-free grammar (CFG) for the mixfix parts of Maude over the user's signature), a context-free parser, a mixfix pretty printer, a fully reentrant debugger, the built-in functions for quoted identifiers, and the META-LEVEL module, together with a considerable amount of "glue" code holding everything together. Many of the C++ classes are derived from those in the rewrite engine. The Maude parser (MSCP) is implemented using SCP as the formal kernel (Quesada 1998). The techniques used include beta-extended CFGs (that is, CFGs extended with "bubbles" (strings of identifiers) and precedence/gathering patterns). MSCP provides a basis for flexible syntax definition, and an efficient treatment of what might be called syntactic reflection, which is very useful for parsing inputs in extensions of Core Maude such as Full Maude, and in other languages with user-definable syntax that can likewise be implemented in Maude. The point is that we often need to parse the top level syntax, for example of a module, and then extract from it the grammar in which to parse the user-definable expressions in that module.

(from sec. 1.2 of the Maude manual, reference edited).

J. F. Quesada (1998). The SCP parsing algorithm : computational framework and formal properties (scanned PDF). Procesamiento del lenguaje natural 23:149-156.

Of course, Maude solves a hard problem: to parse an arbitrary CF mixfix grammar, whilst I guess you are only interested in a fixed set of mixfix operators.

Quite the contrary

We're looking at the question of general, user-introduced mixfix operators. If we back down to a fixed set having fixed precedence relationships, it can all be encoded very conventionally.

However, it is possible (perhaps likely) that we will decide to walk away from this, since mixfix is so readily abused.

Curiosity

If we back down to a fixed set having fixed precedence relationships, it can all be encoded very conventionally.

Indeed. I was in fact guessing that you wanted a fixed set of operators including both mixfix and overloaded, together with some limited facility for user provided operators, which would be not be mixfix, but might create awkward critical pairs with the mixfix ops.

I'm very interested in general about your plans for migrating from sexprs to algol style. It seems like an excellent strategy, to get the semantic content fixed first, in a homogenous syntactic setting, before deciding how to best present its expressive power to the user. I have the idea I have heard of other language design projects adopting a similar strategy, but I cannot recall any offhand. Do you plan to maintain the current sexpr syntax as an alternative source representation?

Probably not

Initially, I had hoped to maintain both for a while, but it is becoming progressively clearer that we need to pick one. there are several reasons for this:

  • The space of legal identifiers in the algol variant is quite different from the space of legal identifiers in the LISP syntax; it isn't merely a matter of rotating tokens around. We could restrict the LISP identifiers, but the result would not feel much like LISP. This is an issue that I am still considering. The particular problem is dealing with operators and precedence while preserving more general identifiers, which is what prompted me to start asking about mixfix.
  • In some cases, there are syntactic constructs in the algol syntax that don't have directly corresponding syntax in the S-expr form, so the two variants each turn out to have some AST cases that do not appear in the other. This isn't horrible, except that the symbol resolver is context-dependent, and the nature of the dependencies varies somewhat in the two cases. Given the complexity of the AST, I'm reluctant to try to maintain two forms for very long.
  • We're still cleaning up corners, and the likelihood that we can succesfully maintain two inputs seems low.
  • There are really only two of us working on this actively.

So I think that in practice my answer is: we'll try to maintain both forms for as long as we practically can, but at some point we're going to pick one.

As a personal matter, I like LISP syntax in all respects other than indentation (which tends to get deep), but I really don't believe that I can sell that to systems programmers.

While the current surface syntax looks like LISP, it isn't. The number of keywords and the complexity of the AST specification is moderately mind boggling.

Finally, as I've said elsewhere, the "S-expr as AST" pun breaks down pretty quickly when static typing is introduced. Once that happens, any residual merit of parenthesis as a surface syntax decays pretty quickly. The more pressing issue at that point becomes whether to design a block-structured language or stick with a block-syntaxed expression language (as e.g. ML did). I personally prefer expression languages, but it's not yet clear where we will end up on that.

It is very easy to drown in the details of language design. Few of them are exciting from a computer science point of view, but they are important if you actually want to have your result used. That desire turned out to be my most serious failing as an academic.:-)

Asymmetric evolution

These sound to me, with the exception of the concern with identifiers, to be compelling reasons for not evolving the two representations in lock-step.

It still might make sense to maintain sexprs to be tied to a subset of the algol-like representation, or to abandon the sexprs while the corners are worked out with the algolesque syntax, to be reintroduced as the design stabilises.

Can you articulate what

Can you articulate what objective is served, in your view, by maintaining an s-expression surface syntax?

Ideological

Maybe none, if the sexprs become too complex, but otherwise there is a certain ideology of program transformation served by sexprs that are manipulated in a manner that can be expressed by quasiquotation (modulo hygiene or not). I guess you must know what I mean, but I shall rehearse anyway: the ideology is sort of summarised in Bawden's paper on quasiquotation, and it asserts the existence of a maxima on a sort of continuum between sh-like string munging on one extreme, and using a properly abstracted interface to some AST manipulation suite on the other.

The value of conforming to this ideology is meant to be that it keeps you honest about syntax, in a way that the ad-hocness of string munging on one hand, and the indirectness of AST manipulation on the other, can't muster.

Postscript: tightened wording of ideology.

I understand, and coming

I understand, and coming from early experience I agree, but see my response to Sam just below.

s-exps and types

Finally, as I've said elsewhere, the "S-expr as AST" pun breaks down pretty quickly when static typing is introduced. Once that happens, any residual merit of parenthesis as a surface syntax decays pretty quickly.

Can you provide a pointer to what you've said about this? It certainly hasn't been my experience with Typed Scheme - the parens seems to have exactly the same advantages (and drawbacks) as in regular Scheme.

An attempt at a response

Hard to give a pointer, since it's a sentence or two here and there.

So first, there are the subjective elements. On the one hand, you are probably right about how things have worked out in typed scheme, but there are two aspects of that assessment that are hard to evaluate. The first is that typed scheme doesn't have nearly as rich a type system (in some respects) as BitC, and the programmer is ultimately entitled to bail on types altogether. Both factors impact user expectations. I have the impression that BitC seeks stronger up-front precision than typed scheme does, and is prepared to give up some interactive convenience to get that. The second issue is that typed scheme, because it is a PLT derivative, is targeted first and foremost at scheme users, where BitC is hoping (eventually) to seduce programmers who currently write in C and C++. These people like curly braces (poor souls)!

On to more serious issues. As a LISP user, it has seemed to me (again subjectively) that the S-expr pun works because of three attributes that are largely peculiar to the LISP family:

  • The overwhelming majority of syntactic forms are "regular", in the sense that there are no optional components, or (as in IF, AND, OR) they are all of a kind.
  • There are very few non-expression forms, and consequently very few forms that well-written macros need to work around. Aside: few macros are well-written w.r.t. things like DECLARE.
  • Being dynamically typed, LISP is not bothered by heterogeneous lists. BitC cannot provide comparable convenience.

As the number of non-expression forms rises, and even more as the number of context-sensitive parse elements rises, the S-expr form rapidly loses its convenience, and the complexity of fabricating correct ASTs from within a macro or a reflective programing system seems (to me) to benefit from typing and field names rather quickly.

At some point, the macro author is starting to rebuild the parser in ad hoc form, and at that point it just seems better all around if the canonical parser is made available from the beginning. A large number of subtly wrong ad hoc parsers does not seem like a feature, and since we need to define the AST for compiler use in any case, there would seem to be little cost in making it official.

Finally, there are multiple uses for ASTs, and in a type-inference based compiler you want to be able to distinguish between types that have been injected by the inference engine and types that were part of the original code. The latter can be thrown away and regenerated at will, and that's actually the right thing to do much of the time — we found several bugs simply by virtue of doing a brute force re-resolve and type after each transform pass. My point, though, is that you want various bits and bobs of ancillary information in the data structure in order for it to be generally useful, and it is surprising how quickly being able to name the fields comes to matter.

Originally I was strongly in favor of the S-expr syntax. Pretty soon, however, we realized that every single convenience syntax we had introduced (e:T, e^, e[ndx]) created an interactive parse lookahead problem. At some point, it's time to stop kidding yourself and admit that the language really does need a first-class parser. OTOH, maybe I got it wrong. Wouldn't be the first time.

I think the summary is that there are a lot of subjective decisions that need to be made in a surface syntax design, and they can really only be validated (and that loosely) in the field.

Originally I was strongly in

Originally I was strongly in favor of the S-expr syntax. Pretty soon, however, we realized that every single convenience syntax we had introduced (e:T, e^, e[ndx]) created an interactive parse lookahead problem. At some point, it's time to stop kidding yourself and admit that the language really does need a first-class parser. OTOH, maybe I got it wrong. Wouldn't be the first time.

You were introducing convenience syntax in the S-expr layer? That's not "S-expr as AST" then, is it? (edit: If you emphasize this wrong it sounds rhetorical in a condescending way - read it the other way)

I'm planning to implement something similar to this, and none of the problems you mentioned would seem to affect my plan (or perhaps I just haven't yet grasped the problems). Are you envisioning any kind of meta-protocol to manage the heterogeneity of your syntax? I think that may be an answer to some of your issues - heterogeneous syntax should be reflected by constructs in the AST having different types.

Every convenience syntax

Every convenience syntax item had a canonical S-expr rewrite:

    e:t => (the t e)
    e^ => (deref e)
    e[ndx] => (vector-nth e ndx)

The problem is that when you see an expression followed by a newline interactively, you need a lookahead token that you don't have. There is a straightforward hacky solution, and it's only a problem in the interactive interface.

On reflection, I think the real issue is that an appalling number of productions are required to support type definitions (as opposed to statements of expression types), and that safe static typing introduces syntax for things like union-case that aren't necessary in LISP. A surprising number of these are irreducible in a let-polymorphic language. LET, for example, cannot be re-written as LAMBDA application. Lots of other examples that look close but aren't.

Types and Parens

I guess I'm still not sure what you're getting at here. For example, I don't think that Typed Scheme's type system is 'less rich' in any dimension that's relevant to the syntax (it certainly doesn't attempt to talk about representation, but that doesn't seem relevant). I'm also confused about your comments about expression forms. PLT Scheme has many things that aren't expression forms (define, require, provide, etc) but macros abstracting over such forms work quite well. Type declarations in Typed Scheme are not expression forms:

(: f (Number -> Number))
(define (f x) (* x 100))

Finally, it seems like you're confusing having types with the desire to appeal to people who currently use typed languages. Some people do thing that () is a bad thing, but that issue seems to me to be orthogonal to types.

Representation matters

So first, I wrote badly when I wrote about the richness of the respective type systems, and I will have to try to reconstruct what I was thinking in order to say it better. I tried to be very clear above that many of the issues are subjective, so I don't think that I'm confusing types and user demands.

Representation does matter, because it significantly increases the total number of forms and it increases the requirement for annotation. Mutation also intercedes:

    x:(mutable int32)
    (let ((y:(mutable int32) x))
      y)

are not the same thing at all. It would be very easy for a let transform to get this wrong without relatively complete access to types at transform time -- particularly in the presence of inference.

But on the whole, it is true that the main motivation for a block syntax has much more to do with the community of target users than with the merits of s-expressions.

Perhaps I am mixing up two things, and people consider s-expressions and prefix operators to be two different things, but that syntax you gave for function types was a definite surprise. I would have expected the -> symbol to be at the left.

Hmm

I'm pretty confused by your example. Is the point that creating a new variable is very different than just annotating an existing one? If so, I completely agree. The Typed Scheme syntax for annotation is either (using your type)

(ann x :: (mutable int32))

or

#{x :: (mutable int32)}

neither of which expands into a `let' (the second is a reader macro for the first).

As for the syntax of `->', types are not expressions in typed scheme, and do not appear in expression positions. Instead, they appear in particular positions that the typechecker knows about, and can therefore be "parsed" instead of expanded. This is what allows me to have the infix arrow. Fundamentally, it's not any different from having the type syntax be (type Number -> Number), where the `type' macro would be doing all the work.

Not quite...

The point was that naive let-introduction done by a macro in a type-oblivious way can have very very surprising consequences. This is one of many examples of ways to go wrong, and they collectively lead me to believe that in BitC it may be better for macros to operate over ASTs.

But in the end, since we are leaning towards the dylan macro approach, this may all be moot.

Take a pass?

Using a third pass seems to be pretty much the way to do it anyway, as it keeps things reasonably modular. Definitely the way to go if you have user-defined operators, and I say that having spent a fair while toying with ways of making libraries like Parsec handle the issue in one pass!
It turns out that there's a fair range of options past there though.

I found Nils Anders Danielsson's recent talk at IFL interesting, the corresponding draft paper can be found here: http://www.cs.nott.ac.uk/~nad/publications/danielsson-norell-mixfix.pdf

Thanks, Philippa, I'll look

Thanks, Philippa, I'll look at that pointer. I confess that I'm less interested in theory at this point than I am in a working exemplar.

Syntactic overloading?

Can mixfix parsing support syntactic overloading of operators? For example, can it disambiguate between application and precedence-overriding via _(_) and (_) or negation versus subtraction in the case of -_ and _-_ ?

My impression is that this

My impression is that this can be handled using precedence in much the way that unary and binary "-" are handled conventionally. Whether it should be done is a different discussion. I'm provisionally inclined to believe that the answer is probably "no".

Maude again

Maude allows mixfix together with overloading, using OBJ-style order-sorted signatures to avoid certain kinds of ad-hocness.

Maude is fairly pleasant to use, but the syntax does have some nuisances, eg. a handling of dots to terminate clauses that is more tiresome than Prolog in that one must often insulate it with whitespace from the clause it terminates. A C-like syntax that had such rigmarole around semicolons would exasperate many, I guess.