A subtle extention to Lisp-style macros

I've been in the process of teaching myself language and compiler/interpreter design. During this process, I have been studying existing languages as much as possible, to see what concepts I can borrow from them. One recurring theme I've seen in the discussions around here is that all things come back to Lisp. Specifically Lisp macros.

Now I can see how Lisp macros are more powerful than, say, C macros because they end up executing a chunk of Lisp code at compile time in order to generate the substitution code (I think I've got that right). Therefore it is claimed that one can create other mini-languages within Lisp. However, the calling of a Lisp macro is still in the form of a function call syntax. So even if other languages implemented Lisp-type macros, this still doesn't solve the need for domain-specific pre-processors such as Oracle's ProC (for using embedded SQL in a C program), or the QT MOC preprocessor used on C++ code.

So here's how I've decided to structure macros in my experimental language (called 2e). Macro / preprocessor code is enclosed in a pair of symbols "/% ... %/" (designed to somewhat resemble C comments). Whatever appears between those symbols is executed as 2e code by the preprocessor. That code then has access to the whole language & libraries; it also sees the entire input program as a character array (or string object). The job of the code fragments is then to use normal string manipulation techniques to modify the code however it wants. This will include normal preprocessor macro function replacements, processing "include" directives, etc. There will be a set of functions defined to perform common operations, but one can also supply DSL-specific routines that in effect extends the language syntax in any arbitrary way. For example, 2e doesn't currently have a CASE statement, but one can write a preprocessor script that adds that functionality (simply search for the pattern of how you define your case statement, and replace it with a "case" function call converting the body of the case statement into anonymous functions that get passed as function arguments). In fact, once this is implemented I can start removing some of the current syntax from the base kernel language and move them into the preprocessor system.

Now my question is if this seems like the ultimate macro system? Or is there any problems with it that I'm not seeing? (they'll probably show up once I start the implementation).

Example:
/* Since the 2e language doesn't have a native "if" statement... */
/%
regex_replace(input_code_string, "if {\(.*\)} then {\(.*\)} else {\(.*\)}", "ifelse(@(\1), @(\2), @(\3))"
%/

The above would search through the input code, replacing any occurrence of "if {...} then {...} else {...}" with a call to the ifelse() function, in order to provide a more C-like "if" construct. (of course, the regexp would need some further tuning to get it to work properly -- this is just a quick example)

Other items, such as embedded SQL (as processed by Oracle ProC or the Postgresql equivalent) can be handled in a similar way.

Does any other mainstream language have a similar capability (that is, the preprocessor or parser being able to execute code fragments that can directly manipulate the input programs code stream)?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

the other macro systems (or equivalents)

Other than lisp and scheme (whose syntax-case macros are entirely different from lisp style macros), there are a number of languages that provide the same kind of facilities. You many want to look at these at least.

TCL
- it uses upvar and upeval to get the same effect as macros. (upvar and upeval executes in the named frame (usually the parent function))

FORTH
- create does>

Perl6
- macros which lets you operate on unparsed input or AST

This thread has quite a bit of information on macro systems.

Thanks

I was looking for that thread (thought I ran across it before), but couldn't find it. I'll have to study that in detail.

macros in Forth

CREATE/DOES> isn't Forth's macro system; it's Forth's datatype definition system. IMMEDIATE is Forth's macro system, and it defines macros that are like a cross between Lisp reader macros and standard Lisp macros; the reason they're a hybrid is that Forth, unlike Lisp, doesn't need to be parsed.

I think if you were to build a "macro" system like what you're describing, you'll likely want to build some restrictions into it. The #1 I can think of is to only allow the macros to process source that appears AFTER the macro definition.

One recurring theme I've

One recurring theme I've seen in the discussions around here is that all things come back to Lisp. Specifically Lisp macros.

Really? The vast majority of conversations don't even mention Lisp macros. Most that do are ones along the lines of this one. In a previous comment I reference other previous comments about the actual capabilities of macros. For you the capsule review is this: the expressive power of Lisp does not come from macros.

Your particular scheme is worse than most.
1) There is almost no benefit to your scheme versus using an external tool. The particular case you give could easily be replaced with a usage of sed.
2) You don't want to operate on strings. You want to get to a structured representation as quickly as possible and you want most processing to be over that structured representation. Of course, you can't entirely avoid processing strings if you want to support a different lexical syntax. Common Lisp reader macros, the proposed QuasiQuoting extensions to Template Haskell, Forth parsing words, and/or Camlp4 are things to look at in this regard.
3) In line with 2, your approach to macros composes very poorly. You also seem to completely lack an ability to modularize or localize them.
4) Also extending 2, you can't "simply search for the pattern." Recognizing the pattern will require parsing in the typical case and further should only be applied in certain contexts (and recognizing those contexts requires parsing.) In line with 3, since the "macros" have no way of knowing about each others existence or what the others do they will conflict with each other.
5) To really get something out of language integrated macros you want the implementation to provide hard-to-get or implementation-specific information. Things like the active lexical enviroment, the type environment, representation information, the results of any compile-time analyses. You'll want to be able to query details such as what methods are in a class given a class name, what class a method invocation is part of, etc. Ideally, you want to expose all the information the compiler has.

Basically, as it currently appears, using your "macro" scheme to do almost anything at all would be non-trivial and/or error-prone and create fragile, hard to understand code. It would have no benefit over writing an external tool (in 2e if you like) that processes the code.

The benefits I see...

1) The main benefit I see to this method is that you don't have to use an external tool(this is one of the complaints I here from people that use MOC with the QT toolkit -- although I haven't dug into it myself). For example, a big chunk of the 2e interpreter source code is composed of a bunch of operator handler functions that are mostly identical (the handlers for all the math operators, for example). What I ended up doing is writing a code generator in Unix shell script to generate these functions. I was thinking that this type of preprocessor scripting system would help in this type of situation. Of course, if I would have written 2e in C++ then I could have probably used templates.

2 & 4) I agree with the issue of operating on strings, this is why I want to explore more robust solutions (I'll check out the languages you mentioned). However my example was a bit poorly stated. If I took this approach, then I would have to provide either canned functions that know how to parse 2e code, or write my own regular expression parser with additional symbols that match specific elements of 2e code -- for example, instead of a generic wildcard match, I'd have to add a symbol to match "any valid (sub)expression", with a flag to make it greedy or non-greedy. Would having a smarter regex function help alleviate some of the issues of operating directly on strings? The methods that operate directly on ASTs wouldn't work with my current interpreter, since I'm parsing the source directly into a postfix form (utilizing a shunting yard algorithm) that then gets run by a postfix stack machine.
3) In addition to passing a string to the preprocessor code pieces, a couple of parameters will also be passed indicating where in the code this particular macro starts and ends Combined with additional regex matching symbols (such as matching the beginning or end of a function), I can provide additional locality.
5) If you are referring to introspection (I think that's the right term), then I already have that available at runtime -- that is, you can call a built-in function that will return the object that matches a given string or pattern. There's a other calls that provide similar information

The primary scope, however, was to keep the kernel language as small as possible (it currently supports simple infix expressions, function calls, function definitions, and array dereferencing), and move any syntactic sugar & higher level functionality into a different level (i.e., I have planed to include a default pre-processor code block that implements some of the documented functionality). For example, if I were to include syntax similar to C's preprocessor, then I would have a preprocessor block that looks for any "^#include filename" pattern and replace it with "/% include(filename) %/, which in turn gets replaced with the contents of that file.

Do you think I'd be better off not describing this as a "macro" functionality, but instead document it as an "extendable preprocessor"?

As always, thanks for the insightful input.

The point of the comment(s)

The point of the comment(s) I indirectly linked to is that you can't "keep the kernel language as [simple] as possible" and add "higher level functionality" with macros, at least not in a user-extensible way. What you will have actually done is provide an intermediate language and a compiler to it from some higher level language. If the point of this is just to structure your compiler this way then this is fine. There are a couple of scenarios for the user. 1) If the user wants to add syntactic sugar to the high-level language then this is readily and easily doable (or should be), but the user won't benefit or even see the "simple kernel language." Such changes can be combined. 2) If the users wants to add "high level functionality" on top of your high level language, this too should be easy enough, but again they don't benefit from there being a "simple kernel language" and now such changes can't be combined. The user has written a compiler from one language to your high level language. 3) This is similar to 2 but targeting the low-level language, however, in this case the inability to combine such changes applies to your compiler too. That is, the user will have created an independent language not an extension to your language. In cases 2 & 3 the user probably will not be able to reuse most of the libraries written in your language, at least not as they'd like to, unless some care (and compromise) is taken by them (and you for that matter.) In effect, your built-up high-level language is foreign code to them. One analogy for this approach is Java and the JVM.

The best approach is to make as much as possible be "syntactic sugar." That is, make the language as expressive as possible.

Lisp macros with a syntax

Therefore it is claimed that one can create other mini-languages within Lisp. However, the calling of a Lisp macro is still in the form of a function call syntax. So even if other languages implemented Lisp-type macros, this still doesn't solve the need for domain-specific pre-processors such as Oracle's ProC (for using embedded SQL in a C program), or the QT MOC preprocessor used on C++ code.

Check out this approach: http://www.meta-alternative.net/calc.pdf

It demonstrates how to embed a language with another syntax into S-expressions based language (i.e., Lisp), using macros and regular parsing features.

MBase itself contains a number of embedded languages implemented in a similar way, which includes a language for defining ASTs (N.B. - symbols there can contain complex syntax), a pattern matching language (again, symbols are decomposed as they contain syntax), and some more interesting things.

And, of course, you can replace your parser completely. To some extend you can do this in Common Lisp with reader macros, or you can implement your own "S-expressions+Something" parser and read your program through it. Lisp macros are absolutely flexible, there's nothing you can't do within this approach.

also, your LtU node on this

Do you mean...

...roughly like Perl5 source filters? (See Filter::Simple and Filter::Util::Call.)

Don't forget to expose a way of dumping the generated, final, code.

where do you want to see yourself improve the most...?

...as a language author?

Now my question is if this seems like the ultimate macro system? Or is there any problems with it that I'm not seeing?

Example:
/* Since the 2e language doesn't have a native "if" statement... */
/%
regex_replace(input_code_string, "if {\(.*\)} then {\(.*\)} else {\(.*\)}", "ifelse(@(\1), @(\2), @(\3))"
%/

"Ultimate" depends on your goals. Personally, I can't read what you wrote there. I don't need something that complex. Humble yourself by reading the original source code for the Bourne shell. :) My biggest concern is that I can't infer the complexity of the macro operator. Operator complexity is not just an implementation detail, but a client-facing design decision. How easy is it to do generic programming in your macro system?

As a logician, you probably don't need a practical example and your crazy one will suffice. As a designer, you'll want to reconsider your pursuit. Design is not simply subtracting away the weaknesses you see in one system in order to make a better system. This applies to all kinds of design, not simply languages. See for example Design Paradigms by Henry Petroski, which covers the challenge in civil engineering projects. A language design-related example of this is C# operator overloading vs. C++ operator overloading. In C++, they can be implemented as via a static method, instance method or virtual method. In C#, it is only a static method, which accounts for your practical scenarios. In fact, you'll find that its features these languages both lack (multi-methods) which force kludge solutions to otherwise straight-forward problems (Meyer spends 26 pages of Effective C++ explaining how to properly simulate double dispatch in C++, and doesn't even get to Line Item 51: How not to hang yourself writing c++ to do multi-methods).

To add to what Elkins said:

I view staging as a generalization of macro operators. The fundamentally hard challenge in macro system design, as I understand it from reading Walid Taha's papers on multi-stage programming, is a static typing system where the DSL knows nothing about its parent environment. In other words, you have to deliberately "grow the language" and cannot mix ad-hoc DSL code with general purpose code. As I see it, such design challenge is similar to the partial classes versus aspect-weaving argument. Partial classes advocates argue that the library author should determine what hooks can be defined. Aspect-weaving advocates think existing code should be re-editable, including third party libraries.

also, taken from http://lang2e.sourceforge.net/:

The design of the 2e language had several goals. One of those goals was to see how useful it would be to have an iterative form of the C inline conditional.

This sounds like Icon, a language largely created with the intention of demonstrating the power of generators. Its predecessor, SNOBOL4, was actually used for code snippets in Vol. 1, Theory of Parsing by Aho and Ullman (pre-quel to Dragon book). If so, then you may want to prototype your language in Common Lisp using cl-cont.