The Value Of Syntax?

There seems to be a recurring idea that languages of minimal syntax are difficult for humans to read. Stallman's quote about Lisp looking like oatmeal with fingernail clippings mixed in is an example.

(Edit: Hey! You can edit these things! Who knew?!) As several people have pointed out below, it was not Stallman who said that; rather, it was Larry Wall. I find this amusing, because Perl looks like line noise to me; And yes, I am old enough to remember what line noise looks like. (end edit)

And, actually, I can understand it. In lisp dialects with no reserved words, (such as earlier versions of Scheme) you cannot make any assumptions at all about any subexpression starting with a symbol whose binding you do not know. Because of the parens, you can tell where its influence ends. But if you don't know the binding, you can't even guess from the fully-parenthesized prefix syntax whether it's first-class (a procedure) or first-order (a macro). So, not even the AST structure defined by the parens is necessarily as it appears within such a subexpression. Even more pernicious are symbols whose bindings you *think* you know -- but which may be shadowed by bindings imported from a module which you probably trust not to do anything "insane."

I've heard the same argument against FORTH (which has even less syntax than Lisp) and about operators that don't imply the restrictions that they "ought to" in C++ etc because you have to know whether they're overloaded and if so, how.

I think that it comes down to syntax -- known tokens and relationships between them with truly invariant meanings -- giving people "traction" to reason about code. A little syntax seems to have a significant effect on people's cognitive experience of a language and provides context enabling them to read unfamiliar code, or at least to rule out possible readings inconsistent with the invariant syntax.

But how much does it help? What kinds of syntax provide the most clarity to the most people? Has anybody done comparative studies of syntactically different langauges with near-identical semantics and measured programmer effectiveness? Is there a shred of evidence, in other words, or is this effect merely anecdotal and speculative?

Ray Dillinger

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Cognitive Dimensions of Notation

You might look into Thomas Green's 'Cognitive Dimensions of Notation' and related documents.

There is certainly an aspect of local reasoning involved, i.e. what can we say about the context after observing a chunk of syntax?

Lisp syntax

For S-expression Lisp in its pure form, syntax has no behavioral meaning at all; it's exclusively a representation of data structures, and behavioral meaning only arises because sometimes one applies an evaluation algorithm to those data structures. This approach seemed to me to help students understand how Scheme worked; and the point of syntax is, presumably, to help the programmer understand what the program is telling the computer to do. (I actually touched on this recently when blogging on syntax.)

There's also another way of thinking about the value of the behavior-independent syntax of Lisp. I think I've remarked here before, I favor extensible syntax of a sort, but believe it should have no effect on the structure of parse trees: a human reader should readily discern the entire parse-tree structure of syntax before knowing anything at all about whatever syntax extensions might be in effect. (That was more-or-less the conclusion of my techreport on extensible parsing.) Core S-expression Lisp syntax, whatever its shortcomings, is one way to do that.

Parse before syntax?

I suppose in a strict, effectful language it makes sense to say that macros are a form of 'extensible syntax', but I tend to favor languages where that role (of behavior-pattern abstraction) is fulfilled by pure functions.

My working definition of 'extensible syntax' involves a developer's ability to influence the parser and parse-tree structure. (I assume the parser subsumes any need for a lexer.) I wouldn't usually count Lisp macros (except for reader macros, which are almost never used). Expressive DSLs will often have their own parse structure.

Rather than extensible syntax, I now favor a syntax per module design. Modularity has a lot to do with this decision.

Unfair of me to allude to

Unfair of me to allude to "extensible syntax of a sort" without explaining what sort of beast I'm picturing. If one has always thought of syntax extension in terms of extending the set of possible parse trees, it may not be clear how one can do something that qualifies as syntax extension while preserving what my techreport called cumulativity.

The basic technique is simple: start with an unambiguous CFG that defines the set of possible parse trees, and then treat individual grammar rules within that structure just the way most programming languages treat identifiers — as structures that may or may not be bound. This supports the syntax-represents-data approach of Lisp, preventing syntax from being entangled with behavioral issues.

The arrangement of this sort I particularly have in mind to try out, to see how well it works in practice, distinguishes two scales of syntax. Larger-scale syntax (spanning more than one logical line, but within a single source file) would use linebreaks, indentation, and demarcational pseudo-statements (generalizing begin/end). Small-scale syntax, i.e., within a logical line, would be almost-fully-parenthesized; each expression would be generated by a grammar rule whose right-hand side alternates between keywords (terminals) and subexpressions (nonterminals). The pattern of keywords and subexpressions would be the bindable syntactic structure. Some reasonably simple scanning rules can provide, I think, a readable and quite versatile notation, in which the first two tokens of an expression always suffice to determine whether the keywords are the odd elements of the expression, or the even elements. (Anyway, that's the gist of it, hopefully giving a sense of the sort of thing I mean.)

Invariants in Grammar Definition

I see, you're constraining how grammars are extended such that developers can at least have some familiar meta-structure.

Your proposed model (alternating keywords and actions) is powerful enough for most of what people need (e.g. developer preference of 'Action if Condition else Action' vs. 'if Condition then Action else Action', support for both postfix and prefix operators, etc.).

I suspect it would run into hiccups for other tasks, though, such as capturing regular expressions, or compact APL-like collections processor functions, or other compact DSLs. It would be 'interesting' to support analog literals. And I would sort of like to support interactive fiction, e.g. via near Inform 7 like program text, or more 'markup' styles of programming where display data is the primary element for a region or module.

Hiccups?

I'm not too sure there would really be a problem, once one had all the details worked out. I would expect, for example, that expressions (foo 5) and (** foo) would be unambiguously prefix notation, while (5 foo) and (foo **) would be unambiguously postfix notation. If there were a practical problem with regular expressions, I suspect it would be caused by the depth of nested parentheses.

We're both guessing, of course. Which is why I want to set up a working implementation: I don't know, and there's no substitute for working examples.

Not prefix/postfix hiccups.

Where I refer to APL and regular expressions, I'm aiming to connote the compact, concise formula for disproportionately rich (but domain-specific) behaviors. By 'hiccups' I mean I doubt you can conveniently build these compact, concise forms.

For example, let's have a regular expression to match dates of the form yyyy-mm-dd for the 20th and 21st centuries:

(19|20)[0-9][0-9]-(0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])

This is something a developer could create after a brief intro to regular expressions, but adding it to your language may require a rather significant adjustment to the parser for a region of code. My intuition is that the extra keywords, spaces, and parentheses would turn this one-liner into a grand sprawl.

That said, you might also encounter some prefix/postfix hiccups. (foo bar) isn't very clear.

Hic... hic...

Whether one can define that regular-expression notation depends on subtleties of the syntax mechanism. Starting with how the input stream is tokenized. I don't pretend to have every detail worked out; I like that example, for the questions it raises about just how those details ought to work.

The case of (foo bar), I had noticed, yes. The simplest solution is to disallow it: require that something occur in the first two tokens of the expression that disambiguates which of the first two elements is terminal. The way to do that for (foo bar) is to parenthesize one or the other element, forcing it to become a subepxression:  (foo (bar))  or  ((foo) bar).

Languages with 'Conventional' Syntax

It's true that Lisp lends itself to macro extension, because the source is data to be manipulated. I prefer languages with more traditional Algol-like syntax, for example Lua. Fabien Fleutot's Metalua is an attempt to provide Scheme-like macros within that syntax, and it's no surprise that you have to work harder to define them.

How well do these macros read? Some examples show some possible syntax extensions.

To do it well takes taste and a sense of what works (not everyone is apparently cut out to be a language designer!)

terribly clueless question

if a language system had s-exprs as the basic core for humans, but then layered algolish or m-exprs or whatever on top for humans which then presumably gets translated into s-exprs, could one still let the macros be written in s-expr form so they are easy to write, but still work with the m-exprs because those m's get turned into s's where the macros work?

Look at Boo

Boo is a statically-inferred language for the CLR which happens to look like Python. It has a macro system which manipulates the AST directly.

Generally, languages with 'conventional syntax' are an interesting problem for syntax extensions.

Metalua has special escapes for AST literals which makes it less code-heavy:


X = +{ 2 + 2 }  -- put AST of '2 + 2' in X
Y = +{ four = -{ X } } -- make the AST for four = 2 + 2

But yes, it is perfectly possible for such a language to encode the macros as S-exprs; it's just that this notation is more natural for users of the language.

It is clear that syntax

It is clear that syntax extension mechanisms, used tastelessly or clumsily, can create code whose wretched awfulness transcends what is otherwise possible.

Thanks for the links to blogs and papers, they're very interesting. A lot of people have written excellent papers about the math of syntax definition and extensible parsing. However, I'm noticing a distinctive lack of anything suggesting that anyone has actually studied the effect of syntax or syntax extensibility, specifically, on programmer productivity. With, eg, live human beings whose effectiveness is being somehow measured, and a control group, etc.

I'm working with a hobby lisp implementation. So yes, the surface syntax is all data-structure syntax - parens and atoms. But the syntax as far as the users' brains are concerned (for purposes of having some invariant context to read unfamiliar code, etc) is that, plus the set of atoms whose values (for constants) or bindings (for symbols) are known.

Because the language has first-order functions (another long story) allowing definitions to be declared as immutable(ie, promising the compiler that they will not change during runtime) looks necessary to allow a reasonably efficient implementation - if calls to most first-order procedures are inlined and partial-evaluated (which normally you can't do unless the compiler can prove that binding will never change) it seems like they could be implemented about as efficiently as traditional lisp macros.

So, at this point I'm considering the question of whether the basic core of the language, from lambda to let to while to list-length to cdr, should be bound immutably -- making these things in effect reserved words (or, for the programmer's cognitive purposes, syntax).

There is one valid use for read macros as far as I'm concerned; they are for establishing an atom syntax allowing one's code to include literals (inline constants) of an otherwise unsupported datatype.

Ray

However, I'm noticing a

However, I'm noticing a distinctive lack of anything suggesting that anyone has actually studied the effect of syntax or syntax extensibility, specifically, on programmer productivity.

Cognitive dimension research on C#

I doubt you'll find much research in general on syntax traits or other abstractions. It's hard enough to draw conclusions on wholly formed concrete implementations.

I've a vague memory there

There may have been a study with at least some of the properties you're asking for. With actual programmers actually programming on an actual working implementation of an extensible development environment. If there has been, methinks it would have been with the ECL programming environment, back in the 1970s. For all I know, ECL may still be the most complete concrete extensible programming environment ever implemented. Because ECL was so long ago, digging up such a study could take nontrivial library research (at a really good academic library; if I were going after it, I'd probably arrange a day trip to MIT); it's not overwhelmingly likely to be freely (or maybe at all) available on the web.

It is clear that syntax extension mechanisms, used tastelessly or clumsily, can create code whose wretched awfulness transcends what is otherwise possible.

This is the mark of any programming language feature that can do great things, isn't it, that its linguistic goodness can also be used for evil?

That quote was from Larry

That quote was from Larry Wall, not RMS.

RMS versus Larry on Lisp

Indeed. Since RMS is a well known lover of Lisp, it would be nice if Ray would edit his post to rectify the issue; we don't want to start any rumors on these here internets. ;-)

aha!

The attribution to RMS surprised me but RMS, the iconoclast that he is, frequently says surprising and provocative things so I just thought it lacked context.

Surprising, yes,

Surprising, yes, provocative, yes, but pretty consistent.

Defining Extensions is a hgher-level skill

Apropos the 'wretched awfulness' quote, there is probably a good reason why extensibility tends to be frowned on. There is a wide range of skill levels and talents involved with any community of programmers; e.g in C++ there are a lot more people who can use class libraries effectively compared to those who can write those good class libraries.

Syntax extensions are on a further level, where language designers operate. Even if the mechanics are straightforward, the result has to harmonize with the language. Add multiple extensions from people with different visions and you have the ultimate mad language forking tool.

Meanwhile, in the trenches, programmers like their familiar syntax and get alarmed by novelty; not always appreciated by comfortably multilingual programmers.

Programming as language development

All nontrivial programming is incremental development of new programming languages. I've found this a basic concept that naturally arises when one starts really focusing on abstraction as a general principle; at any rate, it can be found in the early abstraction writings of the 1970s, I quickly arrived at it when I started looking at abstraction in the late 1980s, and I subsequently found Henning Christiansen had independently gotten there around the same time as I — his dissertation on adaptive grammars was titled "Programming as Language Development".

Some extension mechanisms only work well in the hands of an expert. This has been recognized for decades; Thomas Standish observed it about macro-based extensible languages in 1975 (Sigplan Notices, July). It doesn't follow that all extension mechanisms are that hard to use.

This is what I hope for in my 'structured syntax extension' device: programmers don't find it too difficult to deal with defining new symbols, so I'm thinking if one arranges more general notations in a straightfoward enough way, programmers shouldn't find it too difficult to deal with defining those either.

programming as language creation

I've found the view that programming is language creation to be very useful myself. Programming languages have a basic language predefined but the really powerful part is their role as a meta-language for extending the built-in concepts with new ones more appropriate for the problem domain.

The problem that would generally arise would be people creating and using clever syntax too early making the semantics of expressions very opaque to people unfamiliar with the abstractions. Balancing conciseness with clarity is tough to do well but I suspect there are real productivity gains to be had if done right.

Bad thing?

Add multiple extensions from people with different visions and you have the ultimate mad language forking tool.

Is this necessarily a bad thing?

I mean let's assume that we have something like John Shutt describes earlier, where there's some elemental grammar that defines tokenization and generally provides a core for developers to work off. One of the great strengths then of a language with extensible syntax is the creation of DSLs. I wouldn't really expect DSLs to exactly harmonize with others, but if the core language allows you to mix and match DSLs for your particular domains that seems like it would be a big win.

Even for things that would more properly be considered separate languages, they probably didn't get their overnight. If programmers' needs or preferences drive a particular variant that hard, isn't that useful in and of itself? Having an environment that allows languages to evolve and fork over time (like natural languages do) while maintaining some level of interoperability would be an interesting experiment if nothing else.

DSLs are definitely the best use case

But then the DSL design means that you have defined a predictable environment.

The issue is that extensions can not generally be composed. You get fighting grammars ;) So I like what John Shutt says about extensions being a per-module thing.

And totally true, making extension easy pushes language design forward.

Reserved words and Scheme

You've got it backwards: up through R4RS, there were reserved words which should not be used as identifiers. From R5RS onward, there are no reserved words.

You're right, of course.

You're right, of course.

But I don't think I quite understand why this was considered to be a desirable design decision.

Miller vs Dybvig

It's clear from r3rs that the main problem with rebinding keywords was to do with the semantics of special forms. Once a principled semantics of macros was fixed, and special forms could be defined in terms of them, then this objection evaporates, hence the change with r5rs. This makes the "little languages" approach more flexible.

It's a bit hard to see the IEEE approving a Scheme that allows such rebinding, though.

While I'm talking about r3rs and keywords, the r3rs standard's Texinfo source has the following, intriguing commented-out text:

@ignore todo
The Miller vs. Dybvig debate over reserving new keywords. What to
do?
@end ignore

I assume this Miller is the Jim Miller of Multischeme fame. I'm curious as to what the debate was about.

You can, certainly. Does it follow that you should?

A principled semantics of macros made it possible to define special forms in terms of macros. I do not believe that it therefore follows that those definitions ought to be mutable or capable of being shadowed by local definitions.

These are separate questions. I would have argued in favor of introducing immutable and "luminous" (unshadowable) items, even if only for syntactic bindings, to the language in order to allow the definition of special forms.

R5RS made lexical scoping universal

It has always been true in Scheme that there are no reserved value identifiers (variables); one can locally rebind cons or list or < without problem, and in certain contexts this is good style — for example, defining a list-sorting function with two formal parameters named list and < actually contributes to readability.

As long as the list of Scheme syntax keywords was fixed, it was plausible to reserve them, but with extensible syntax, treating syntax keywords and variables uniformly makes sense to me. It's less likely that you'd locally redefine if than < (an instance of Quine's "maxim of minimum mutilation"), but neither seems inconceivable.

conciseness is important

I remember seeing a study a long time ago that people roughly program at the same speed independently of the language they use. (justifying the use of high-level languages ...) I'm sure there were a lot of caveats to that result but I think that generally conciseness really helps productivity and understanding. It seems logical to expect good syntax and in particular syntax appropriate to the problem domain to be important.

Back when I was a student, all my engineering classes had sophisticated, highly evolved notations for almost every kind of problem. These often borrowed heavily from those existing in the sciences and mathematics but had often been adapted for one reason or another. I'm sure that had I looked at the older literature there would have been a lot of poor notations that had fallen into disuse and discarded.

The big problem with extending syntax is doing it in a sufficiently disciplined way that the benefits exceed the costs. At the same time a great syntax likely takes a lot of iterations and a level of maturity that many software systems never attain so the cost/benefit curve will take a long time to look good. Syntax extension is almost certainly not low-hanging fruit but I would not be the least bit surprised if there are important wins to be had.

Problem domain

We've talked about the details of this before like implicit vs. explicit variable declarations.

In general:
the shorter the target program is the more free and expressive you want the syntax to be to make programs short and thereby reduce complexity through length.

Once the program hits approaches about 20 lines it exceeds the ability to be read "at a glance" and more rigid syntax is needed to allow one to differentiate lower from higher structures.

Once programs approach 2000 lines it can't be read and understand completely at all, only in parts. And then even more rigidity is needed.

Long LISP programs end up with lots of syntax.

Your examples about redefining keywords and operators more or less come down to the age old question not so much of syntax but flexibility. Should a language allow you to change its behavior and if so how much? I really think that comes down to

1) compiled vs. interpreted as the core development paradigm. If the language is developed inside an interpretor it should be extremely flexible because your environment of execution is just an extension of the language. Conversely if the language has no interactive aspects you want the compiler to pick up as many issues as possible which means you want syntactical regularity and simplicity.

2) How extensible should the language be regarding its problem domain. The more the language should "do everything" the more you need it to de-facto if not de-jure create DSLs.