Yet another programming language with customizable syntax

I'm designing a new general-purpose programming language with entirely customizable syntax. Its "killer-feature" is naturalness in functions/macros syntax definition: other languages with customizable syntax (Nemerle, LISP, XL) requires either knowledge of compilers construction theory or reading a large amount of manuals or both (and resultant syntax definition constructs look rather cryptic), while in my language all this is done rather intuitive. Just look at examples.

Example 1:

`{min} < {var} < {max}`
  returns Boolean
  :
  min < var and var < max

`{HH}:{MM}:{SS}`
  where HH is Natural and 0 < HH < 23;  -- Here parser fails when using `{x} < {y}`
        MM is Natural and 0 < MM < 59;  -- so it uses `{min} < {var} < {max}`.
        SS is Natural and 0 < SS < 59;
  returns Natural
  :
  HH * 3600 + MM * 60 + SS

`time between {time1} and {time2}`:  -- No "returns" clause, automatic type inference is used.
  time2 - time1

`how much {x}`:
  x

`main`:
  print how much time between 23:48:12 and 08:17:00  -- Absolutely natural-language-like expression.

Example 2:

class Point.  -- Predefined function with syntax `class {regexp(\w+)}` is used.

`{p}_x`
  where p is Point;
  returns Number
  :
  field  -- That's how fields and variables are declared.

`{p}_y`
  where p is Point;
  returns Number
  :
  field

`({x}, {y})`:
  p := new Point;  -- Function `new Point` is declared automatically for every class.
      -- Function `{regexp(\w+)} := {value}` is predefined.
      -- Type inference is used heavily everywhere.
      -- Oh, and function `{f1}; {f2}` (sequential execution) is predefined too.
  p_x := x;  -- Function `{p}_{x} := {value}` is declared automatically for every "field" function.
  p_y := y;
  return p

`||{p}||`: sqrt(p_x^2 + p_y^2)

`{p1} - {p2}`
  left-associative;
  priority is the same as for `{x} + {y}`  -- Yes, priorities are relative.
  :
  (p1_x - p2_x, p1_y - p2_y)

`D({p1}, {p2})`: ||p1 - p2||

`main`
  p1 := (0, 0); p2 := (3, 0); p3 := (0, 4);
  print D(p1, p2) + D(p2, p3) + D(p3, p1);  -- We'll get 12!

It is possible to create similar language with XML tags instead of "`"-s and ":"-s. For example:

<func>
  <syntax>time between <arg>time1</arg> and <arg>time2</arg>
  <body>time2 - time1</body>  -- Look at the body, its sytnax is still entirely customizable!
</func>

so in fact I'm designing a meta-language.

Should I proceed or is this entirely stupid idea?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Related work?

How does it compare to related work?

(Remark: I was going to ask about specific related work, eg. mixfix operators, but I think I'd be more interested in your personal and more knowledgeable choice of related works to compare.)

This is my own research

Sorry, but I have read "LtU is blog not forum!" after I have made the post. This work is my own research (though based on several years of learning other programming languages and studying compilers design theory) and is intended to be discussed rather than to inform. May be this topic should not be on LtU but my friends adviced this resource, and there are already plenty of comments.

But must have been tried before...

In the COBOL days of computing science, one of the holy grails people were looking for was exactly what you propose: Programming languages which mimic natural languages by employing an extensible syntax.

I don't know a lot of these approaches, but there must be a ton of research from the sixties/seventies on it. Other than that, I can tell you you'll need an extensible parser and probably need to limit the production rules to some form. Plus, as stated by others, you'll need some disambiguation rules.

Lastly, of course it's a stupid idea; I therefore recommend you follow it through. Stupid ideas are always nice. ;-)

The bad part of it, as you can gather from history, it has been tried before and didn't work out -wasn't accepted as viable- then. I see no reason why your compiler would generate more than a little interest at best. But then "Who Cares?"

Dilettantes built Ark. Professionals built "Titanic".

You are absolutely right, this is yet another attempt to find the Holy Grails of programming languages design. But failure of previous attempts does not mean that my attempt will fail too, does it?

I have been studying history of programming languages for some time and have come to conclusion that all attempts to create similar language suffered from following flaws:

1) Overcomplexity. Designers of the languages introduce very flexible means of syntax customization but interface to these means is either hard to learn or unusable at all. They design the syntax customization facilities bottom-up - first extension points and low-level functions to parser and then language constructs to these extensions - but it is needed to build them top-down - first language constructs (simple and convenient) and then parser extensions implementing them. First language as such should be thought through, and rich theory and smart parser generation tools will help to implement the idea.

Also nobody used the most obvious idea to borrow notation from other areas of activity (for example, in maths we write "a+b", using "a" and "b" as placeholders for actual values), or even from natural language itself: we get used to write in documents: "`Class <class name> inherits <other class name>` means..." - so why no to write the same way in programs?

2) Poor positioning in the market. Usually designers of languages with customizable syntax position their products as proof of concept only ("wow, these things also can be done with a computer!") and never as new industrial tools. At least I had such impression. And of course they get what they want.

Also the designers don't promote their languages in the market at all; they build them as for new, unoccupied niche of applications. A programming language is not only syntax and semantics but also its infrastructure - libraries, community, corporations support etc. And the language is hardly to be preferred by programmers if it has poor infrastructure, even if it is excellently designed.

I'm positioning my language as - no more, no less - new industry standard programming language, as replacement for C++, Java, C# etc. Yes, it is very ambitious goal (may be too ambitious) but I have a trick in reserve: I'm not going to develop my language's infrastructure from scratch and compete with industry-adopted standards. I'll just borrow them. I'm going to make C++ a first target for my compiler and use Qt as a part of standard library. Or at least MSIL and .NET.

And, who knows, may be my attempt would be accepted as viable. ^_^

Holy Grails, Batman

Extensible syntax is nice, but it's still just syntax. Presumably the other details are just or more important.

Here are a few of issues to think about:

- How do you allow partial application? e.g. (+ 1) in Haskell

- How do you treat a syntactic form as a function? This is really just a special case of partial application with zero parameters, but for example, how do you write 'map points (_x)'? A problem with completely unstructured syntax is that it lacks structure. You'll probably want to introduce a few special forms to handle this kind of manipulation.

- Have you thought about handling deeper forms of customizable syntax, such as constructs with binders? How would you introduce a 'lambda' construct, for example?

Many days are spent on research

I have been thinking of this language for a while. In fact I have published it here when the design has become seeming consistent to me. There are much issues I have thought through but I just don't know how to state them well.

As for issues mentioned by you:

- Functions in my language are first-class objects. And the function declaration without a body is the reference to the function. Yes, the resultant code may be very long so I introduced function aliases:

`time between {t0} and {t1}` (1) (dt)
...

or, in XML-like instance of the language:

<func>
  <syntax>time between <arg>t0</arg> and <arg>t1</arg></syntax>
  <alias>(1)</alias>
  <alias>(dt)</alias>
  ...

They are just syntactic sugar for:

(1) := `time between {t0} and {t1}`
(dt) := `time between {t0} and {t1}`

- I think partial application decreases readability. Also it can be implemented via mentioned mechanisms. So I didn't introduce special syntax for currying. May be it would be good to make all functions automatically curried but, again, I don't think it would help maintainability of programs.

- All functions you declare are actually closures - inside function's body everything is accessible what is outside of it.

...And you have caught me on lambdas. It seemed to me that anonymous functions (for example, functions with special-form declarations like `anonymous({x}, {y}, {z})`) with macros and regexps are enough but with them constructs like (x, y) => x + y are unimplementable.

Fundamental problem is that I relied on fact that body of a function is always sharply defined (with indentation in "simple" form I usually write in, with <body> ... </body> in XML form) but for lambdas it is not always true. Operator `({args}) => {func}` with accurately adjusted prioriry could help in this case but the problem is that in func the parser does not know about args.

Thanks, I go out to spend more time on research.

The problem is much more

The problem is much more fundamental. When I designed the language I have cheated: constructs are allowed which may have expressions which are not known for parser yet. For example, there is a built-in function (in pseudo-code):

`{f1}\n\n{f2}\n\n{f3}\n\n...`
  where f1, f2, f3 are functions
  :
  ...

It executes f1, f2, f3, ... in unspecified order. But in following code:

print(x + y)  -- (1)

x := 10  -- (2)

y := 20  -- (3)

The order must be (2)-(3) then (1) because in (1) the parser does not know yet about x and y.

I thought that a cheat when parser may boldly skip parts of source code until it encounters "\n\n" and defer parsing of separate expressions until all their subexpressions will be known, that this cheat would be enough. But lambdas have caught me because there is no such cheat which is also general enough.

Parser extensibility, limited grammar and ambiguity resolution

Agreed with that and have already taken those into account.

In the COBOL days of

In the COBOL days of computing science ...

I like your snarky comments.

SQL with its 300+X keywords is among the biggest languages in use and another example of the mentioned design principles. I guess that's mostly the effect of not being able to write functions in SQL directly, so every possibility had to be foreseen, specified and standardized in advance.

Isn't this precisely how the world, the plants, the animals and finally us humans have been planned by the great designer according to creation scientists relying on the undeniable truths of the bible? There we have our scientific reference.

Maude is related

The Maude language has an extensible mixfix parser which handles a lot of the meta-syntactic notational ideas described here. It also includes annotations for operators to indicate idempotency, associativity, reflexivity, and the like, both for the parsing purpose and to guide the evaluation strategy for the operator.

Might not be obvious how to apply this here

The issue is that the obvious semantics has lousy algebraic properties, properties that I think Maude's parser would need to be able to find a decent evaluation strategy.

No disagreement

Clearly there's some distance between contexts. Maybe awareness of this achievement there will be helpful to him.

Thanks

Thanks for the reference. At least I have found some interesting ideas. For example, I was afraid that absolute priorities are not suitable for large and complex programs, but if to use them properly they can become as usable as relative priorities. I should think of this.

Resolving operator precedence?

I'm curious to hear more about how operator precedence is determined when it looks like you're just anchoring off of another. For example, what is the mechanism to parse min < x < max as that implementation instead of say... ((min < x) < max)?

This is one of the core non-trivialities in my current work, and am curious how my approach differs.

GLR with knowledge of types

I was thinking about ambiguities like "`min < x < max` vs. `((min < x) < max)`" for some time and come to conclusion that there is no way to distinguish such ambiguities without knowledge of types. So parsing techniques considering types should be used.

But I'm rather unfamiliar with such techniques so I do it with enumerative technique: parse input expressions with GLR-like parser and reject those parsings for which type checking fails.

What error messages does the user see?

The approach you describe seems to have a serious flaw: when the parse that the user "meant" to write has a type error, you will reject it. This could either result in a different parse being accepted (and accidentally changing the meaning of the program), or more likely cause parsing to fail.

How can you possibly give the user good messages for type errors if you reject any parse that had type errors?

The weakest point of my design

Yes, error reporting is the weakest point of my design. But as far as I know, this problem (adequate error reporting) isn't completely solved for now, is it?

So the only way to discover optimal error reporting strategy is to try writing programs in my language. My intuition tells me that very good strategy will be to use that parse in which first error will be as far along the program as possible.

As regards the accidental changing the meaning of the program, yes, it is a problem, and there is no any adequate way to solve it. But, again, let's turn to practice! Just imagine such a program where one typo may change meaning of the whole program - can the programmer himself understand such a program better than a machine? And how much do programmers tend to write such ambiguous programs? How often do people write texts like "shi shi shi shi shi" (traditional chinese) or "kosil kosoy kosoy kosoy" (russian)? Quite the contrary, the everyone tends to speak unambiguously that everyone else could understand.

But, OK, the problem still may arise. But there is a strong type system in my language as a second line of defence. And it is very unlikely that the program with accidentally changed meaning would pass type-check. At least programmers tend to use type systems to defend theirselves from bugs, including mistakes like one under discussion.

As for operators precedence...

As for operators precedence, you are right, I'm anchoring the priority of operator off of another one (the user can write in function declaration: "priority is higher than of `xxx`", "priority is lower than of `yyy`", "priority is the highest", "priority is the lowest"). In fact this is very poor choice which leads to much ambiguities and design problems... for developer of compiler. For programmers writing large and complex programs in my language this scheme will be at least more flexible than, for example, Haskell's "absolute priorities" scheme.

In other words, in theory "anchoring of priorities" scheme leads to very ambiguous systems of operators' precedence but in practice the programmer uses a small subset of operators in any program expression, and for this small subset operators' priorities may be defined precisely enough.

Also priority is not required to be defined for absolutely all functions. It must be defined only for functions which require it (such functions like "{x} + {y}", "{x} ? {y} : {z}" etc.).

And I guess then you'd

And I guess then you'd generate an error if you have an expression where the relative priority of operators contained within it is... indeterminable (for lack of a better term)?

Yes. In the final program...

Yes. In the final program everything must be clearly defined.

Ensure < is associative

You can get < to work by hacking it as a polyadic operator that is syntactic sugar for the conjunction of a sequence of comparison expressions. I'm sure I've seen this done somewhere. But I think you can do something more principled if you make the comparison operators have a result type with a richer semantics than Booleans.

As an aside, we probably don't want to tolerate incoherent expressions such as "2 > 1 < 3", since while we might reasonably claim that it is true, since 2 > 1 and 1 < 3, these expressions are nicer to work with the truth of the whole expression entails nontrivial relations between any two elements of the expression by transitivity, but of course no particular relation between x and y follows from "x > z < y". So I propose that such incoherent expressions should raise exceptions.

Here's a suggestion for a semantics for such expressions, which I'll call the expressions of the mixquality type. The denotation of such expressions is a 4-tuple, <v: Bool, c: Bool, l: Ord, r: Ord>, where v is the truth value of the expression, c is the coherence of the expression, and l and r are numbers or order values denoting the left and right extent of the expression.

Then any number n can be cast into a singleton mixquality with denotation <true, true, [[c]], [[c]]>, and if A and B are mixqualities with denotation <va, ca, la, ra> and <vb, cb, lb, rb>, then the mixquality A<B has denotation <va ∧ vb ∧ [[ra<lb]], ca ∧ cb ∧ (va → [[la <= ra]]) ∧ (vb → [[lb <= rb]]), la, rb>. The same pattern yields a semantics for the other comparison operators.

If we can show that this semantic of qualities is associative, then we are done. Some further points: (i) coherence should be inferable statically, (ii) this allows us to distinguish between safe uses of mixqualities, which raise exceptions when incoherent, and unsafe use where we do not, and (iii) you might like to have expressions of the form "0 < x, y, z < 1", which would need an extension of the semantics.

Minor bug

In your semantics, the expression "3 < x > y" is coherent if x and y happen to both have value 3. Rather than trying to deduce whether a "mixquality" is increasing or decreasing based on the end values (which doesn't work when they're equal), I think you should have a direction flag in your semantics.

Good catch

I was sure I must have made a mistake or two in the above. So my above denotation of a mixquality isn't enough to determine the relation, which is morally right.

I think we can generalise to having mixqualities being one of {<, <=, =, >, >=, incoherent}.

I think you can lump and

I think you can lump < and ≤ together and likewise > and ≥. You're right, though, that you need to keep the = case separate. You could label the cases {>, =, <, incoherent} as {-\inf, 0, \inf, and NaN}, and then rule for combining them is addition (where -\inf + \inf = NaN).

This is very particular case

Such parsing problem arises with any pairs of operators of the form `{x} a {y}` and `{x} a {y} b {z}`, regardless of types of x, y and z. And this is a particular case too, one can generalize the problem far more.

Particulars and patterns

Agreed. The wider suggestion is to avoid trying to fix these cases by being clever with parsing rules, but by thinking of what the meaning of the expressions should be and using more informative types to support them.

There's no reason I can think of why this couldn't be supported by a language extension mechanism.

Perhaps heading off topic, but...

I prefer a staged semantics for this kind of thing. Rather than giving a single denotation for the expression which encodes both the meaning of the syntactic form and the value of the form applied to its arguments, I would have a value for the syntactic form which, when applied to proper arguments, produces a value of type Boolean.

Also, returning closer to topic, in response to the idea that this is just an instance of a more general pattern, can someone give an example of another instance of this pattern? The inequality chain shorthand seems to me to have some rather unique characteristics.

There's probably quite a few

There's probably quite a few things that follow the syntactic pattern if not the semantic.

The only scenario I can envision that shares the behavior is something like a declarative workflow where the ops bind to the parameters instead of consume their results.

Can you explain how fields work?

In your snippet:

`{p}_{x}`
  where p is Point; x is Number;
  returns Number
  :
  field  -- That's how fields and variables are declared.

Why is x a Number and why is it in braces? I would expect that to accept things like "somePoint_3". How does that work? I would have expected:

`{p}_x`
  where p is Point
  returns Number
  :
  field  -- That's how fields and variables are declared.

BTW, is this implemented or are you the only compiler for this language at this moment?

Mistypes

You are correct. I have fixed my post with this.

And, yes, currently I'm the only compiler for the language.

It's been covered here before, but...

When announcing a new language idea or implementation, it helps a lot if you start with a clear declaration of your goals. What domain is your language intended for, or what problems is it intended to solve? What languages are currently being used for those domains/problems? How (and why) does your design differ?

It impractical for people to evaluate a design (or design ideas) without knowing the context in which they will be applied. What works well for a language for numerical computation on GPUs may be very different from what works for building a rich GUI application.

Custom binding constructs

Preamble

In my language there are macros. Macros are functions which are executed without creating separate scope (or, in other words, they are executed in context of the caller).

All computations in my language do not change some external state and whole program's state can be reverted to some previously remembered point. If some external state must be changed or read then code using this external state is deferred... into compiled executable. So the program

`main`:
  print 10!

will be compiled into equivalent C program:

#include 

int main(int argc, char **argv)
{
    printf("3628800");
    return 0;
}

Ah, and "regexp xxx as arg" construct in arguments results in String:

`Let's meet {regexp [A-Z][a-z]* as name}`:
  print name.

The comment as such

I have thought out how to resolve problem of custom binding constructs. Let's introduce special type "Variable". Instances of this type will contain the variable's name and all type information for values which can be bound to that variable. Also let these instances be automatically casted to values they are bound to.

Also let the user have special functions:

  • "`declare variable {name}` where name is String" which declares a variable with specified name and makes it accessible in current context.
  • "`f({var_1}, {var_2}, ..., {var_n}) = {expression}` where var_i is Variable" which returns a newly constructed function.
  • "`variable {name}` where name is String; returns Variable" which returns already declared variable.

Then let parsing be lazy: the parser first executes a function and then tries to recognize it in input stream. Example:

`({regexp \d* as var1}, {regexp \d* as var2}) => {expression}`  (1):
  declare variable var1;
  declare variable var2;
  f(var1, var2) = expression;

When parser encounters expression

(x, y) => x + y

it executes (1). First statement tells to declare variable "var1". But "var1" is not declared yet so the parser starts to parse input until "var1" is declared. Thus it consumes "(x".

Then it declares variable "var2" and consumes ", y".

And after that it creates a function and tries to consume ") => x + y" from input. But! The parser already knows about "x" and "y"! And it knows that they are variables! So it may parse the rest of input stream without errors.

Also reversed binding constructs become available:

`{expression} where {regexp \w+ as name1}, {regexp \w+ as name2} are free variables`:
  return f(variable name1, variable name2) = expression.

macro `{regexp \w+ as var name}`:
  declare variable var name.

`main`:
  f := x + y where x, y are free variables;
  print f(10, 20).

Post Scriptum

I think all these tricks are particular solutions. Fundamentally different approach to parsing is needed. But I have run out of ideas.

And please tell me where can I make up a team so I could work on this language not on my own?

Idea

I've got an idea. Let the programmer describe parsing of syntax himself:

<func>
  <syntax>(x, y, ...) => {expression(x, y, ...)</syntax>  -- This will be a comment in this case.
  <parsing>
    -- And the parsing procedure is described in terms of the language itself!
    consume "(";
    arguments := new List of Variable;
    until (next characters are ")"):
      argument name := consume regexp "\w+";
      add (declare variable "#{argument name}") to arguments;
      consume ", " if present;
    consume " => "
    lambda body := parse next expression;
    return new Function(arguments, lambda body).
    -- ...or something like this.
  </parsing>
</func>

The only problem left is to form convenient API to the parser.

Not new

This is I think not a particularly novel idea. The main issue with it is that even library developers want to write (or even interface with) parsers. If extensible syntax is the nugget your language is built around, shouldn't that be easy/common?

Still easy

Simple patterns are written easily:

`salary of {e}`
  where e is Employee
  :
  ...

Complex patterns are written with difficulty:

`direction of look of {character} in <image>{picture}</image>`
  parsing
  :
  ... -- Hard and intelligent code using
      -- artificial neural networks.

But they are still writable.

And the programmer even may write his own patterns of syntax description:

`{function name}({args}) = {expr};`
  parsing
  :
  function name := consume until "(";
  args := consume until ")";
  args := args.split(", ").map { |arg| declare variable arg; }
  consume " = ";
  body := parse next expression;
  consume ";"
  declare function (function name) with arguments (args) and body (body);
  -- ... or something like this.

Where does it end?

From the looks of this design, you're imagining that each construct gets to decide where it ends. That means that e.g. if you have a function call form 'f(x,y,z)' and you encounter a use of it 'f(e,g,h)', the subform 'e' could hijack the parse and consume the subsequent commas, expressions and even closing parenthesis. This is ugly to me, since I think it's the job of the language to specify and enforce syntactic conventions that will allow independently written parsers to play nicely together.

But here's a more objective problem: do you intend to allow forward references? In many languages you can write something like this:

let
      x = ... y ...
      y = ... x ...

If we generalize to x and y as syntactic forms, how can the y declaration specify the parser that determines where the previous use of y ends?

As for "hijacking" - yes,

As for "hijacking" - yes, and I have an example of the form with the similar problem: "(x, y, z) => x < y < z". When the "C# lambda" syntactic form consumes "=>" it then consumes "an expression". But it may consume both "x < y" and "x < y < z". And that's where GLR-like parsing comes in use: when function "parse next expression" is called the parser forks. And both alternatives are being parsed in each fork.

And the parser does not tell about errors only if exactly one fork had survived.

As regards

let
      x = ... y ...
      y = ... x ...

Let's again address to practice: computer parser can not be smarter than human one. So it is up to parsing procedure developer to apply restrictions to the syntactic form or he will not be able to write the parsing procedure. And yes, the developer should first unambiguously split the text into separate declarations and only then parse them.

My task here is only to give API to the parser and its symbol table which will be convenient enough.

Related thread

Extensible syntax has been discussed in the past few days on thread The Value of Syntax?.