Resolved Debates in Syntax Design ?

In the spirit of the What Are The Resolved Debates in General Purpose Language Design? topic, I would be interested in hearing your opinion on the specific Syntax Design problem. In designing a new syntax for a programming language, what are the decisions that are objectively good (or bad) ?

Most syntaxic questions are rather subjective (for example, 'CamelCase' or 'with_underscores' identifiers ?), but I think that some can be answered definitely with a convincing argumentation.

Here is one example : recursive scoping should always be optional and explicit. Recursive scoping is when a defined identifier scope is active at the definition site as well as at the usage site. In Haskell, term definitions have recursive scoping by default, while OCaml doesn't (there is a let .. and a let rec ... syntax). It allows for useful programming idioms such as let x = sanitize x in ... or let (token, i) = parse i. Haskell programmers would sometimes benefit from such a possibility, as can be seen here and here. type definitions are implicitely recursive in OCaml and this is also a pain.

Example of debates that are probably not resolved (yet ?) :

  • identation-sensitive syntax
  • open (if .. else ..) or closed (if .. elif .. else .. end) conditional statements

Do you know of ressources discussing such syntaxic issues in a general way applicable to numerous/all (textual) programming languages ?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Palindromic constructs

Closing keywords that are the opening keywords spelled backwards have clearly gone out of fashion. The last adherents were probably designing (or implementing) Unix shells.

do something od

See also the Promela modeling language, used with SPIN and other model-checkers.

Source encoding; associative array literals

All modern language specifications seem to be converging (with good reason) on treating source code files as UTF-8 by default, sometimes with an option to override the default. But they still differ on which Unicode characters can be used in identifiers and operators.

Most language designers now seem to agree that associative arrays (maps/hashes) deserve their own literal syntax, just like lists and strings. You even see this in new Lisps like Clojure.

Syntactically significant whitespace

I doubt this is going to be a resolved debate any time soon. Proponents like it because it forces similarity between code and removes extraneous syntactic cruft. Opponents feel that code formatting is something for an IDE to control, and compare the practice to making tokens behave differently depending what font it's displayed in. Not much room for compromise, or actual practical implications to help win the debate.

But no, I don't know of any resources on syntax options themselves.

Font-dependent syntax

making tokens behave differently depending what font it's displayed in

Now that is an interesting idea... Maybe we could start with italics, bold etc?


... Just joking.

And what about the line-terminating symbol (semicolon in ALGOL-type languages)? Is it useful/required?

Does anyone think that function calls that include the function or target inside the brackets are better syntax than having it outside?

Lisp: (doit one two three)
ObjC: [thing add:blue]

Semantic-exposing syntax

I'm not sure what you mean about "line-terminating symbol". If you refer to the statement-terminating or expression-separating use of semicolon in most programming languages, they are important because they protect from ambiguities. If you consider them as "line-terminating symbol" (that is, only one statement/expression per line) you have a whitespace-dependent language : removing them force you to consider '\n' as the new terminating symbol.


Does anyone think that function calls that include the function or target inside the brackets are better syntax than having it outside?

Having the brackets outside is good when it exposes some semantic choice of the language. It's good in lisp because it reflects the Lisp AST structure (wich you manipulate when writing macros).
Code-data similarity could be recovered by using another lisp separator than whitespace, but that would noticeably reduce the ease of writing lists, wich is the most used structure of the language.

I also suspect ObjC syntax is motivated by a compromise between a light message-passing syntax (inspired from smalltalk) and the need to integrate well with the C syntax.

I do not think either choices (inside or outside caller) hurt readability or ease to program in general : or is there a specific way of writing something that is frustratingly not available with one of those syntaxes?


I have however experienced that programming beginners with a math background have a lot harder time with an inside-caller (or no brackets at all as in curryfied languages).
Some other syntax choices can be discussed in the light of "novice programming" (see Okasaki's In praise of mandatory indentation for novice programmers). However, one must be careful to distinguish objectively-good choices (wich mandatory indentation might be) and easier-because-familiar choices (wich the brackets placement probably is).

I'm not sure what you mean

I'm not sure what you mean about "line-terminating symbol". If you refer to the statement-terminating or expression-separating use of semicolon in most programming languages, they are important because they protect from ambiguities. If you consider them as "line-terminating symbol" (that is, only one statement/expression per line) you have a whitespace-dependent language : removing them force you to consider '\n' as the new terminating symbol.

Interesting distinction. It would be nice if there was no ambiguity with them removed (I think Lisp is like that), but then you could write a program with no visible separator which would be obfuscating.

Don't you think that almost all potential programmers have a minimal mathematical background? Personally I don't like to start a code line with an opening bracket, regardless of the language.

making tokens behave

making tokens behave differently depending what font it's displayed in

Fortress does this. The language distinguishes between italic, normal, boldface and blackboard-boldface. Which leads to code that reads like math.

Also Mathematica does this.

I was sure the second

I was sure the second sentence was going to end differently. Something like: "Which leads to code that reads like ransom letters"...

LULZ

LULZ

excellent

It sounds like we've got the basis for a new programming language called "Ransom".

Stropping

See also "stropping" in Algol and some other 1960s languages. This turned into the tradition for SHOUTY keywords in Pascal and its descendents.

Depends

>I doubt this is going to be a resolved debate any time soon.

Depends: I remember that a teacher told here (couldn't find the URL sorry) that he tried to teach two version of his own language, the only difference between those two version was the pythonic vs C-stype "indentation is significant or not" and he found that students learned better with the pythonic style.

Ease of learning is one interesting point, ease of maintaining would be another interest point to measure.

IMHO, it'd be very interesting to take a 'new' language such as Scala and make a 'whitespace' significant version and *measure* which is the one that students learns more easily or maintains code in it.
Then use these experiments as a way to drive syntax evolution of the language: a slow process, but at least one where the 'gut feeling' of designer is replaced with measurements.

A good compromise could be a

A good compromise could be a language with both a whitespace-dependent and a classic syntax, such as Haskell (wich has a non-whitespace-depependent mode where you replace indentation changes by brackets and line breaks by semicolons).

For such a language to be practical, it must have a direct mapping between both syntaxes (wich doesn't necessarily means that you have only one kind of non-whitespace delimiters), so that users can make the transition easily. It means that the layout rules (whitespace-dependent syntax rules) have to be simple and clean, wich can be both an advantage (simple rules are simple to explain and respect) and a disavantage : it is possible that our visual appeal naturally asks for complicated layout rules with numerous special cases coming from our diverse and non-unified experiences with the material world.

The idea can be generalized to a vision of a language with different syntaxes for different people/environments (wich we already what given the diversity of programming editors giving very different experiences of source code manipulation); it is possible to imagine different set of syntax rules, but for collaboration to be possible between user of different syntaxes, the rules must be very strict and the user should not do things "her own way".
For example, users of the C language with different identation offsets (2 chars, 4 chars, 8 chars..) can share code and edit it in their preferred mode by using the "indent" tool to convert code with different indentation, but they must follow their indent configuration very strictly for the file to look good after conversion.

better comparison

How about adding a whitespace-dependent pragma to Ruby? It already supports both braces and begin...end delimiters. That way, we'd be able to provide three different ways to handle it in a single language.

Between the three, I definitely prefer the begin...end keyword approach, but I'll fall back on braces as a distant second place if need be.

Oh, how could I forget -- add Lisp-style parentheses as well, for four different block formatting syntaxes.

clarity and conciseness

Syntax is for the human reader. Obviously it needs to be unambiguous for the compiler but that is a much lower bar. If there is a chance the reader or writer might be confused about what they have written then it is generally a syntax problem.

Significant whitespace and open/closed blocks are related as generally allowing significant whitespace is how you delimit open blocks. I didn't originally like the idea of significant whitespace but after using it I find it preferable as when done well it reduces clutter.

Also more modern styles of programming with shorter blocks because of better support for abstraction works well with significant whitespace. The older styles with much longer procedures and long/nested blocks doesn't work so well with significant whitespace. So there are probably some issues with scaling.

camelCase versus ugly_underscores

Personally, I think both options are uglier than sin. I much prefer hyphenated-identifiers as commonly used in Lisp dialects. Really, Phosphorus had me in shreds.

Why not just allow spaces?

Would it really be so hard to just allow spaces in identifiers?

the result = scale factor * my function(my argument, my other argument) + some offset

It would probably only work in algebraic languages, not ones that rely heavily on the space as a delimiter such as Lisp or concatenative languages.

In my opinion, that would be

In my opinion, that would be a bad idea :

  • that would make the language whitespace-dependent in a weird way (are two spaces and one space the same ?)
  • you cannot use "keywords" near identifiers anymore (as that would lead to terrible ambiguities for the human reader : "for i to j do the action end"); you're restricted in practice to symbolic separators/delimiters (reducing the number of possible special forms for interesting constructs of your language)
  • you couldn't use space for something else more useful in your language (Lisp : lists, ML/Haskell : function application, smalltalk message passing, concatenative languages..)

Note that the second point could be mitigated by a proper programming environment (having bold faces for keywords and italic faces for identifiers would reduce confusion).

I suppose that a way to work around those issues is to specify delimiters for identifiers (akin to the now-standard string ") : `the result´ = `scale factor´ * `my function´(...) (or ${the shell way}). That's of dubious interest.

Normalized Whitespace

Some schemes allow you to write arbitrary variable names using the quoting as you say.

As for whitespace in identifiers, I would think the best way would to to normalize multiple spaces to one space. I dunno about newlines though; maybe those shouldn't be allowed in the middle of identifiers in such a syntax.

So does '|COMMON LISP|

Common Lisp does that, too, but their quote symbol is |:

* (intern "common lisp")

|common lisp|
NIL

Normalized Whitespace

Inform 7 allows spaces in identifiers, but it's kind of a special case.

Historic examples

Fortran allowed spaces in identifiers -- white space was completely ignored, even in the middle of tokens. So

      DO 1 I = 1

and

      DO 1 I = 1, 10

were tokenized completely differently. (For the kids in the audience, the first assigns 1 to a varaible named DO1I. The second is a DO loop (for statement), terminating at the line labeled 1, with loop variable I ranging from 1 to 10.)

Identifiers in the Project SUE System Language were composed of letters, underscores and digits (upper case only, the character set was the IBM Model 29 keypunch subset of EBCDIC), like this:

      INTERRUPT_VECTOR.ADDRESS = SYSTEM_TIMER_INTERRUPT_ROUTINE;

They were rendered in listings (on an IBM 2314 printer with an upper/lower case print train) in mixed case, with the underscores replaced by spaces, like this:

      Interrupt Vector.Address = System Timer Interrupt Routine;

I'm not sure what my point is, except maybe that there is nothing new under the sun, and all possible mistakes have already been made.

quoting

Scala allows quoting identifiers. The primary reason is to support using Java libraries that use identifiers that happen to be Scala keywords, but it does work for arbitrary uses. (I suspect most Scala programmers would slap anybody who put this in production code.)

scala> val `scale factor` = 28
scale factor: Int = 28

scala> def `my function`(`first argument` : Int, `second argument` : Int) = `first argument` + `second argument`
my$u0020function: (Int,Int)Int

scala> val `some offset` = 12
some offset: Int = 12

scala> val `my argument` = 2                                   
my argument: Int = 2

scala> val `my other argument` = 3
my other argument: Int = 3

scala> val `the result` = `scale factor` * `my function`(`my argument`, `my other argument`) + `some offset`
the result: Int = 152

Thanks for reporting, that's

Thanks for reporting, that's a very good rationale for that feature.

A problem with this design choice is that it doesn't cooperate well with the idea of separate identifiers syntaxic classes for separate language constructs. For example OCaml allows only lowercase as the first letter of a value, type and class identifier, and only uppercase as the first letter of a module or constructor identifier.

To have both those features you would have to either:
- have different kinds of quoting for different identifier classes
- make case distinctions on the quoted identifiers

I think both those solution are dissatisfying, the first being probably the better one. An other possibility would be to tag quoted identifiers, in the spirit of Camlp4 antiquotations: `lid:the result`, `uid:-My-Type`, with a reasonable default for untagged quotes.

I have to mostly agree with BlueStorm

Maybe this is viable, but I'm doubtful. It would require a radically new syntax in almost any language. Would this really be that much worse?

the-result = scale-factor * my-function(my-argument, my-other-argument) + some-offset

If by "algebraic" you mean not fully prefix or postfix, this suggestion would not mesh well with say, ML or Haskell in which space often means function application.

Of course, I think the variable names in your example are a bit overly verbose. But that's another issue entirely. :-)

Oops

You just lost one of:

1) insignificance of whitespace
2) a commonly used notation for infix subtraction akin to the one you used for infix addition and multiplication
3) the (dubious, but extremely widespread) conflation of the hyphen and the minus sign into the Unicode HYPHEN-MINUS (U+002d)

Now we get to debate which of these is least surprising to the uninitiated. ;)

Mudlled (sic) syntax

Long ago, a friend was writing an extension language for a MUD - basically, a Scheme with algebraic syntax. His original design allowed hyphens in identifiers, but I advised against it, thinking it would surprise the intended users. He was foolish enough to listen to me, and we got stuck with underscores instead.

Allowing hyphens isn't really a problem — it worked nicely in Dylan — and most people seem to agree that identifiers composed of multiple words are easier to read that way. And it forces users to put spaces around operators!

already done

…just allow spaces in identifiers…

This has been already done – in Algol 68.

And also in C64 Basic

And also in C64 Basic

FORNUM=1TO10

F O R N U M = 1 T O 1 0

How esoteric :)

Not that hard.

I have this concept in the hobby language I'm working on. It's not entirely that hard. The method I use is to allow method parameters to be explicit identifiers. Combined with application via juxtaposition, with 'my other argument': 'my' becomes a function that takes the literal 'other' which in turn takes 'argument' which returns a reference or property or whatever you need.

This eliminates the 'one vs many space' issue described in other posts. The lexer deals with that. So the identifier isn't "my other argument" but it's functionally equivalent. There's a few other benefits to this method (and one or two downsides), but it works great so far in prototyping. Descriptive variable names without syntactic clutter.

Weird

What happens when you have code like this?

scope1
    my a = 1
    foo my -- what does this even mean?
    scope2
        my b = 2
    end
end

Perhaps I'm

Perhaps I'm misunderstanding, but you'd get unresolved identifier errors (if foo is bogus) or a standard bad parameter error (if foo takes something that isn't 'my' or 'a'->int).

[edit: ah. I forgot to mention static, explicit typing. That does seem important in retrospect/re-reading]

I wasn't very clear

I really had two questions in the last post - one of which you've answered, I think. One was how a call to some 'foo' would resolve whether 'my' was intended to refer to a literal 'my' or the value bound to the variable 'my'. It sounds like the answer is overloading.

The second question was about how this scheme works in the presence of scopes. This is a better example:

outerScope
    my a = 1
    foo my
    nestedScope1
        my b = 1
    nestedScope2
        my b = 2

Assuming foo's first parameter is of the correct type to receive 'my' as a function of literals, what function does it receive? Does it include 'b' in its domain? If so, what value is bound to 'b'?

Bjarne Stroustrup wrote a whole paper on whitespace overloading.

Here's the link: http://www.research.att.com/~bs/whitespace98.pdf

It's always been one of my favorite papers. It might even beat the Phosphorus paper.

Talked with Bjarne about this...

I've talked with Bjarne about this paper w.r.t. Fortress; he was basically reveling in the potential schadenfreude ... although, he has pointed out that, technically, there's nothing really wrong with it, tongue-in-cheek.

Character position

I guess we can consider not using character position as being resolved.

Some ancient languages, Cobol eg, used character position for syntax purpose. For example, line numbers at position 1, Code starting at position 6, Comments at etc...

Resolutions require awareness...

...or blighting the opposition.

I think syntax should be more hypermedia based. My comments should simply be a sticky note positioned somewhere in the source text, capable of having pictures in it.

If you say otherwise you will be blighted.

Textual representation

The compromise in textual language is to keep a textual representation available for "less advanced environments". You could probably use some markup syntax (eg. MarkDown, Wikipedia...) in comments and have your IDE display it as a sticky note, in any of today languages, while still retaining good readability for the textual editor user.

That solution would make the location of the comment less precise (line or groups-of-line instead of some precise part of a line), but you could imagine delimiter schemes such as /* note{ */ ... /* bla bla blah } */. A bit less readable but acceptable for textual reader (he may even appreciate the improved locality of comments, if the choosen location syntax is not too heavy).

That's reasonable today, and only a matter of cooperation between the language users, and the programming environment writers (wich can be 'you' in a modular environment with plugins). There is also the possibility of having that plurality of representation directly in the language description, as did Algol.
Today languages actually converge in that direction regarding comments, with a "blessed" syntax for documentation in comments (javadoc, ocamldoc..); your idea is not very different.

ASCII 1, APL 0

While it's acceptable for programs to be written in extended character sets (cf. Java, Fortess[1]), I think we can say there is a consensus that the parts of the character set beyond ASCII can only be for the representation of comments, and character and string constants.

I'd hesitate to call this "resolved", though, except in the sense that it's a consensus I could see being reopened.

[1]: Fortress actually allows unicode characters in identifiers, but if I understand the spec correctly, these identifiers are equal to ones using an ASCII encoding.

I dissent

I don't think this is a consensus at all. In fact, I think the general direction that these things are going is to allow all of unicode in program source. The Unicode Consortium even has a document about how to define an identifier using the Unicode character classes. I think most languages defined in the last 10 years are going in this direction (R6RS is a good example here).

As for Fortress, there are ASCII encodings of some Unicode characters, but the representation is Unicode, not the other way around.

Tempora mutantur

There was a consensus for some time about ASCII being the only useful and practical character set to plunder for notation and allow in identifiers, but much less so now.

Actually, the strongest reason for ASCII's dominance today is the tyranny of the IBM PC/AT Enhanced Keyboard. In an alternative universe, we would all be Space Cadets.

Unicode identifier lexing guidelines

I agree with your dissention. For reference, I believe that this is the Unicode TR that you were referring to.

Unicode IDs

I certainly plan to use them, along with extensible syntax. I cannot see any reason to stick with ASCII, excepting that it's easier to type on my keyboard (but that'd be rather biased of me, wouldn't it?).

I am planning to use XID_Start XID_Continue* for identifiers in general, and split them into: if the XID_Start character is upper-case or title-case (according to unicode standards) then the identifier is a variable name, otherwise the identifier is an atom. This is the case distinction Oz uses - case sensitivity to distinguish variables from labels/data called 'atoms'.

Neither XID_Start nor XID_Continue allow hyphen-minus '-'. I do not know whether they allow a specific hyphen-character. Either way, it'd be transparent to the language.

Fortress input encodings

I'm happy to be wrong about where the RW lies, but when you write: As for Fortress, there are ASCII encodings of some Unicode characters, but the representation is Unicode, not the other way around.

My understanding is that Fortress allows both ASCII and Unicode representations; Unicode is the default, but you can specify other encodings, including ASCII.

Homoglyphs

I can't wait for the fun to be had with homoglyphs: 'var a, a, a; a=a.a(a); ...'

Much security literature on that topic

in the context of script-mixing.

Perl has allowed Unicode in identifiers

Perl has allowed Unicode since around 2001.

And Perl libraries don't use it (maybe there is some very exceptional case, but it's not common).

I wonder why that is. Is it a bad idea? Do people have problems typing the unicode characters? Is it simply unnecessary?

Problems typing them seems a

Problems typing them seems a reasonable guess.

I'd consider support for Unicode identifiers or macros to be more a blessing for multi-national programming languages, especially when working with 'extensible syntax' designs (like PEGs).

Disappointed, but not surprised

Whitespace sensitivity, identifier naming, unicode support, and most other matters of syntax suffer greatly from the bicycle shed problem whenever they come up for discussion. There really isn't a safe place to ask, even here. I'm disappointed, but not surprised.

I'd be interested in an actual usability study comparing different syntaxes for some language, but I cannot find any, and I believe it would be difficult to get a meaningful result from such a study, since it would be very hard to make a controlled experiment.

Can anyone give a link to real, scientific syntax study, rather than just another personal opinion?

What I was originally

What I was originally interested in is a set of metrics that would give an approximation of the "readability" of a syntax to the human reader. It would be great for example if we had a study claiming that "readability of a program is directly related to the number of shift/reduce conflicts of the grammar", or any other objective measure that would allow to compare and improve programming language concrete syntaxes.

I haven't found anything of that order, but some pieces that might interest you : the PURe project developped related topics, and (although most links seems to be dead) I found quite interesting things :

Metrication of SDF Grammars (pdf)
This paper gives different metrics for a grammar (described using the SDF framework); different way to specify the grammar of a given programming language leads to different measures (result table page 19), but they are similar enough that we can compare different programming languages. Java syntaxes for example are consistently "simpler" than C++ syntaxes.
Toward an Engineering Discipline for Grammarware
This paper (wich seems to be behind a paywall, though I originally found and read it) concerns the broader topic of "Grammar Engineering". I found it much less related to my original interrogation but interesting nonetheless. They also present use cases for "grammarware" that I hadn't thought of, such as recovering a partial grammar from a large language codebase with no accessible grammar description (think old COBOL programs with complex grammars hidden in compilers).

Color Coded syntax

ColorForth is a Forth which uses the color of words to indicate their use.
Here's a link: http://www.colorforth.com/cf.html