Extending Syntax from Within a Language

Forgive me for pointing to my own blog, but I wrote a post that I think might be of interest here. My little language is using Pratt parsing, a dollop of metaprogramming, and incremental parsing to let you extend the grammar from within the language itself, even within the same file. Pratt parsing doesn't seem widely known, and few languages I know of let you do this, so I thought it might be noteworthy.

If you have any feedback on it, that would be better than awesome. I'd hate to find out that I'm painting myself into a design corner, and if anyone can help me from doing that, it's the community here.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Nemerle?

Have you heard about Nemerle? It has a powerful syntax macros facility. Moreover the core of the language is pretty small and the most part of it is written using macros - even such basic operators like 'if/else', booleans ('&&', '||'), etc.

Yup

I have, Nemerle and Nimrod (as well as Lisp, of course) were big inspirations for this.

Well, you know, what

Well, you know, what actually happens with Nemerle - at least what I can see - is that people who are involved in the language development are really obsessed with macros. And there is nothing surprising here - the life of a language developer becomes so much easier thans to them.

Let's take some mainstream language as an example. Say, C#. New anonymous functions in 2.0? Syntax hardcoded into the language. Linq in 3.0? Syntax (and a lot of new syntax) hardcoded into the language. Async upcoming in 5.0? Again new syntax hardcoded into the language.

In Nemerle all such things are implemented just like ordinary libraries. No changes to the compiler, no regression tests. And a lot of people can work on different features at the same time with absolutely no problems.

But that's the one side of the coin. When we take potential language users it appears that they are not too much impressed with this cool language facility that allows you to create new syntax. Why? And why do you need it really? How often do you actually need to create new syntax for your tasks? I can hardly think of a few cases and even in these few cases I would be totally satisfied with just custom operators. But Nemerle has so much more possibilities here.

Do you need custom syntax for DSLs? But again how often do you need *internal* DSLs? And why do you need to create custom syntax for them?

So it kind of looks like that this syntax extension feature appears to be the great thing for those who create language/huge frameworks (like web frameworks) and so on. For an ordinary user it gives almost nothing.

So basically whether extendable syntax is a well placed or misplaced feature heavily depends on what the language is designed for. Nemerle is seen by its current maintainers as a "mainstream capable" language and I really doubt that it is needed for mainstream. For other areas - who knows...

Having anonymous methods, or

Having anonymous methods, or list comprehensions, or async etc. implemented via libraries rather than the compiler has substantial downsides, though: there can be (and probably will be) competing implementations with subtle differences, incompatible with one another; third-party libraries won't risk using these features, or will ship with their own version of the feature, to avoid taking a dependency they can't be sure will exist or can coexist in the target audience's install base. There's a kind of inverse Metcalfe's Law: the more pieces you need to pull together to get your bit of code to work, the less valuable it is.

Tooling also suffers. Getting good debugging support for relatively dramatic transformations, such as those needed to implement closures or coroutines, adds another thorny layer on top of merely getting them working. And that's not to get into editor code completion, documentation utilities understanding the new kinds of types, etc.

These are problems aspirationally "mainstream capable" languages need to handle well. I'm not optimistic that such syntactically configurable languages can ever be a huge success because of them.

Hmmm, I see your point, but

Hmmm, I see your point, but Lisp is still with us after many decades, with primary syntax such as defun often implemented as macros. Macro-rich Racket/PLT Scheme seems to be progressing nicely. I recall reading one piece by the Scala folk touting its syntax definition capabilities via operator definition, lazy parameters, curried function definitions, composable treatment of its pattern matching construct, etc. I'm sure other folks could think of additional or better counter-examples.

I think the issue is a bit more nuanced.

- Scott

Well, Lisp, Racket etc. are

Well, Lisp, Racket etc. are only mainstream for small values of mainstream.

You know, I think Lisps are

You know, I think Lisps are different. They don't really have syntax extension facility. Because in order to have it one should first have syntax :))

I am not trying to say that extendable syntax is not needed. No. I am just doubt that a "general purpose" extendable syntax like we have in Nemerle is really useful for a broad set of everyday task. Probably if you can focus on a more specific designation like building DSLs as a sole purpose you might come up with a more "friendly" product.

I think Nemerle is a very good example to learn. This is a language with a) syntax b) mainstream C style syntax but its macros don't really appeal too much to a lot of programmers.

I don't know

You know, I think Lisps are different. They don't really have syntax extension facility. Because in order to have it one should first have syntax :))

I don't know; some Racket languages look to me like they have a lot of syntax (Scribble, for example)

I am not sure it is a real

I am not sure it is a real downside. For example everybody - theoretically - can implement a GUI framework for .NET. You don't need macros for that. But I don't see a lot of third party implementations though.

Also I think that some of the core libraries may be simply included in the language standard.

But in reality we can only guess. It is hard to say what will happen with a widespread language with an extendable syntax because there is no such language. And yes, there might be a reason for that.

BTW. Nemerle has a relatively good integration with Visual Studio. Have you seen it? I think it works pretty well including intelli-sense. Moreover when you create a new syntax macro intelli-sense in automatically available for it.

But I don't see a lot of

But I don't see a lot of third party implementations though.

I am here at work, struggling with that.

Do you need custom syntax

Do you need custom syntax for DSLs? But again how often do you need *internal* DSLs? And why do you need to create custom syntax for them?

And why do you need awful IDE performance on top of that.

You and Barry both bring up great points, but both seem to be missing an exciting point: the Holy Grail is to do something close to what Nemerle and Racket are doing, but to also co-evolve the language with a very robust IDE.

Personally, Nemerle's macros being limited to ASCII reduce my excitement in the language. The visual languages I have worked with in the past have allowed me to use relational calculus to delete, update, and insert ASCII-like nodes. For example, for a Ruby on Rails-like system, all the things like ActiveRecord and DataMapper are dependencies I should be injecting into my system, and therefore behave like macro libraries, except be much richer so that I can automatically tab through the UI presentation as if it were a spreadsheet or structured editor.

The upshot to using a relational calculus is that it becomes very easy to write unit tests such as "every attribute ending in _pk or _id should be a primary key in the persistent data store", and so on, visually.

Your macros aren't hygienic, though

You should probably take a look at Scheme (as distinct from Lisp) macros. I don't know if you have gensym or not, but there's no evidence that your macros are hygienic.

already commented elsewhere

I already raised that points in the reddit comments, where there has been some discussion of the issue.

(I would rather have posted my comment on the blog directly, but apparently it doesn't support comment and suggests users to comment on reddit instead.)

Correct

That's right. Over in the Java side, there's support for generating a unique symbol, but the Magpie end doesn't have that yet, nor does it have any other support for hygiene. That's definitely on my list of things to fix with the current system.

Gensyms aren't enough

Gensyms, if carefully used a la Common Lisp, prevent names bound in the macro definition from affecting the arguments of the macro call, but they do not and cannot prevent names bound in the macro call's environment from affecting names that are free in the macro definition.

I can see that, but in

I can see that, but in practice is that a problem or a solution? Do people write useful macros that rely on expanding to use a variable in their calling environment, or do they inadvertently do this and curse themselves when it causes issues?

The latter

As far as I know, nobody deliberately writes hygiene-breaking macros of this type (whereas the other type, which export names to their bodies, are not uncommon). However, I am not a serious user of deliberate hygiene-breaking.

Unintentional violations in pre-hygienic Scheme are another matter. For example, it's not uncommon to bind the name "list" to some list, which makes the standard function "list" unavailable within the lexical scope. A non-hygienic macro called in such a context will screw up if it tries to call the "list" function, whereas a hygienic one uses the binding of "list" available in the definition environment, normally the top-level environment.

CL hacks around the problem by not allowing redefinitions of standard identifiers, and by having separate namespaces for variables and functions. This quietly prevents many hygiene violations, but means that they bite even more deeply when they do arise. Lack of hygiene is the macro equivalent of dynamic binding: it means that what an identifier is bound to depends on history rather than being lexically apparent.

I'm told that macros of this type are used to communicate between top-level and nested macros using local macrolets.

CL packages

CL hacks around the problem by not allowing redefinitions of standard identifiers, and by having separate namespaces for variables and functions. — And by using packages to carve up identifier space. Will Clinger, who proposed the mechanism, has said that he thought CL needed this mechanism in the absence of hygiene.

Cf. Peter Sewell's chapter on Programming in the Large: Packages and Symbols, from PCL. There's probably a wonderful resource that I don't know of that explains all the design issues in CLtL and how the choices were made. It probably involves spending time with Guy Steele.

"Nobody"

As far as I know, nobody deliberately writes hygiene-breaking macros of this type...

I'm not sure. It seems to me that Let Over Lambda is basically one long paeon to precisely that type of hygiene-breaking. I'm not sure how widespread it really is among professional Lisp programmers, or whether Hoyte is inside or outside the macrological mainstream. There seem to be conflicting schools of though on this issue, even among Lisp programmers (I mean, those who did not jump to Scheme already for this reason).

I also second Charles's comments on packages. I never really realized the importance of packages for hygiene until recently.

Caveat: I am not nor have ever been a real Lisp programmer.

Hygiene is a no-brainer, IMHO

I think any new language with macros should support hygiene.

As an example, Andre van Tonder's SRFI 72 macro system lets you write the simple swap macro more succinctly than Common Lisp does (because there's no need for gensym), and it's completely safe:

(define-syntax (swap! a b)
  #`(let ((temp ,a)) 
      (set! ,a ,b) 
      (set! ,b temp)))

Given that we have this and similar technologies (e.g. syntax-case), I think there's simply no point in going the unhygienic way. A language without hygiene makes programmers work more, and furthermore, some macros cannot be written safely without language support.

One problem is that the literature about hygiene is very much written by insiders for insiders, and Scheme-centric. Back-to-basics descriptions of hygiene are rare.

furthermore, some macros

furthermore, some macros cannot be written safely without language support.

I'm surprised by this claim. Could you provide examples of such macros?

... After thinking a bit more about it, it appears to me that your `swap` macro is precisely such an example : you need to save `a`'s value into a temporary, so the set! must be under a binder, yet due to the definition of set! the semantics will change if either ,a or ,b is the same name as the temporary.

I think this is due to a misbehavior of `set!`, which actually depends on the *name* passed as argument, not on the value. Therefore `set! x v` and `((lambda (var) set! var v) x)` don't have the same meaning. The first parameter of set! is not handled in a first-class way (cannot be abstracted).

I think this is a design flaw: the `swap` example demonstrate that encouraging use of non-first-class lvalues to a language makes it less robust wrt. metaprogramming, rather than that the necessity of hygiene or language support for fresh name generation.

I think it is important for a language to allow any desired macro to be written safely, because what you do with macros is also what the programmer do when considering, rewriting and reasoning about code locally. If you need language support for generating fresh names (be it explicit or implicit) in a given macro, this means that a programmer doing similar code rewriting by hand will need to think globally, for generating a fresh name is a global, rather than local, operation.

Could you provide examples

Could you provide examples of such macros?

I'll have to think about more interesting ones, but every macro that references a binding is unsafe wrt to rebinding by the macro caller. [Edit: Modulo further mechanisms, such as Common Lisp's package system.]

Say we have a FOO macro that expands to the global variable X:

(defvar x 12)
(defmacro foo ()
  `x)

This macro cannot be used safely, because X may always be rebound by the macro caller:

(let ((x 13))
  (foo))
==> 13

This obviously cannot be prevented using gensym, and requires a language with support for hygiene, where the macro caller's local X does not interfere with the binding intended by the macro writer.

I think this is a design flaw: the `swap` example demonstrate that encouraging use of non-first-class lvalues to a language makes it less robust wrt. metaprogramming

That's an interesting point. Which languages do support such first-class names though?

That's an interesting point.

That's an interesting point. Which languages do support such first-class names though?

Instead of supporting first-class names, you could deprecate the use of names as lvalues in your language. In ML languages for example, the basic device for mutable locations are references, which are (first-class) values. When you write `x := 2` in SML or OCaml, you don't refer to the name `x`, but to the value (which is a reference) denoted by `x`. Therefore `(fun y => y := 2) x` has the exact same semantics, and you can write a safe swap macro (where `!x` reads the value in reference `x`):

swap! a b =
  #`(let va = ,a and vb = ,b in
     let temp = !va in
     va := !vb;
     vb := temp)

All is not well however, as OCaml has a concept of non-first-class mutable location, in the form of mutable record fields. They are different from references and cannot be abstracted: you couldn't factorize

  let temp = a.foo in
  a.foo <- b.bar;
  b.bar <- temp

as `swap-fields a.foo b.bar`. SML does not have that issue as its record selector are first class (they are functions), and it has no concept of mutable field -- added in OCaml for optimization of the runtime representation -- and use fields containing references instead, so the mutable.

Say we have a FOO macro that expands to the global variable X

I'm usually considering macros as code generation tools. What you're asking here is beyond the scope of code generation, as you actually ask for something which is impossible in the source programming language: there is no way in general to designate a precise name that would work whatever the current context is. Usually, when the programmer wishes to be sure that no name shadowing occurred, she uses an "absolute path" name using the standard module/namespace of the denoted value, but even those may -- in most languages -- have been shadowed, though this is not a common practice; that's basically Charles Stewart's comment about Common Lisp packages.

I can very well understand that some people want that kind of power in their macro system, particularly if they see a macro system as a tool that should be extended to fit any desire of syntactic nature, rather than a mere code generation tool. I'm wondering however whether the macro system is the right place to put such features. Maybe what's needed here is, at the language level, a non-ambiguous, non-shadowable way of denoting a specific binding? In a IDE we could well imagine a simple arrow pointing to your desired global variable.

Precise names do exist

you actually ask for something which is impossible in the source programming language: there is no way in general to designate a precise name that would work whatever the current context is

I don't think so. Given a unit of, say, R5RS Scheme (which doesn't have a module system, see below) it is always statically apparent to the programmer what binding an identifier refers to. (That is, unless there are unhygienic and/or hygiene-breaking macros in there. Which aren't in R5RS, but anyway.) If you want to refer to a global binding, you'll just need to check that none of your local bindings shadow it. Since you have control of that unit, this is not a problem, and you do get "precise names".

Usually, when the programmer wishes to be sure that no name shadowing occurred, she uses an "absolute path" name using the standard module/namespace of the denoted value

This is only needed, afaics, in languages with module systems with unqualified import statements (e.g. "import all bindings from module foo"). If all your imports from other modules are qualified (e.g. "import foo.bar as bar and foo.baz as baz"), then we have the same situation as above again: it's statically apparent to the programmer, by looking at the local bindings and the import statements, to determine what binding an identifier refers to.

Update:

I'm wondering however whether the macro system is the right place to put such features.

I think it's beautiful how hygienic macros extend this local, static reasoning to nonlocal code generation tasks.

re Gensyms aren't enough

Gensyms, if carefully used a la Common Lisp, prevent names bound in the macro definition from affecting the arguments of the macro call, but they do not and cannot prevent names bound in the macro call's environment from affecting names that are free in the macro definition.

A simple way to see that that claim is false is to do the thought experiment: "Given a non-hygienic defmacro plus gensym and so forth, can I implement a hygienic macro system?" (The answer is yes and it has been done.) Does that mean that implementing a full hygienic macro system is the only way to use defmacro + gensym to avoid the kind of capture you are talking about? (No.) It is quite reasonable to regard the difference between defmacro+gensym and hygienic macros as a difference in power (the former is more expressive unless you add hygiene breaking features back into your hygienic macros), a difference of emphasis (hygienic macros simplify and beautify many of the (most?) commonly desirable patterns), and, depending on the details, differences in implementation cost (albeit with neither absolutely cheaper than the other).

You also later speculate that nobody deliberately writes non-hygienic macros in which free variables in the macro body are intended to be captured by the macro application's lexical context. It just ain't so. Sometimes I want my caller's CAR or FOOBAR or SELF without having to require my caller to explicitly pass it. I would agree, at least as far as my experience goes, that it is not too common in lisp-family languages to use that trick but I've done it myself and there are times when it is just the right thing. Of course, outside the lisp family (e.g., in C), it is a common and often fruitful practice.

Hygiene for the Unhygienic

"Given a non-hygienic defmacro plus gensym and so forth, can I implement a hygienic macro system?" (The answer is yes and it has been done.)

One example of this is Costanza and D'Hondt's Hygiene for the Unhygienic. Their system requires the following additional ingredients:

- List and symbol macros.

- Local macros, which are a ffected by surrounding local macros.

- Macro expansion functions which operate on local macro environments

gensym != hygienic

These are two separate issues.