Good Language Design Principals for Scripting Languages

I work for a vendor of an application for the management of trades. We would like to extend an application with a simple scripting language for batch processing. The idea being that someone non-technical could use a script to schedule processor intensive tasks, such as pricing a portfolio, to run over night. What we have in mind is something very simple just a few verbs and noun pairs, i.e.

import "fixing.dat"
price "portfolio a"

The script would then be run by a scheduler.

However, I vaguely remember a discussion on ltu about "ant" starting off like this with relatively simple objectives, but getting a bit bogged down because people inevitably used it for more complex scripting tasks. I think the designers of the language ended up saying they’d have been better off applying good language design principals up front. So my question is what sort of good language design principals should be applying to a litte scripting language like this? Or would it better to use an existing scripting language? Part of the application is written in caml so this would probably the scripting language of choice, but I’m a bit worried would be over kill.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Overkill

Writing your own script engine sounds like more overkill than using an existing script engine. My personal choice would be Javascript or Ruby, but that also depends on your target audience and availability of the script engines for the platform/programming language of your application.

Good question

"Extension languages" is one of my favorite topics, and sadly it's one in which there's been very little progress, even though the sad state of the art is pretty well known...

Anyway, you're completely right to observe that little languages like this (and like Ant) more often than not escape the bounds of their original designs and accrue all kinds of rather unpleasant features. I definitely agree with Sjoerd that you should use an existing engine if at all possible.

His specific suggestions are good, but of course I've also got my own opinion: I'd take a look at Lua. It's a nice little language which evolved from a configuration language into an elegant general-purpose scripting and extension language (sound familiar? The history is documented here). Since its niche has always been embedding into another application, it's pretty easy to work with and there are many syntactic niceties aimed at representing configuration data. There's a large community and a very good book.

Whatever you decide to go with, please do try and embed an existing interpreter if at all possible. It might be a bit more pain up front, and might be a little bit less fun and excitement, but I promise you'll be glad for it later....

OT: talking to C++?

A question for those folks who are into extension languages: what about the other way around - I'm desperately seeking something like a Scheme that can easily call C++. I've seen languages that can pretty much automagically call C w/out having to go through SWIG or something really laborious and painful like that. But C++ seems like a tough nut to crack, and that nobody as far as I can see has done it. (I wouldn't know how personally, I'm just hoping somebody does and has already :-) The thought of having to use SWIG and maintain the bindings is not pleasant but I suspect all the issues of name mangling, memory, and resource lifecycles add up to make it hard to make simple?

Talking to C++

The main problem with talking to C++ is that to get benefits over calling procedural routines, you'd need to have objects within your scripting language. And that, in turn, means that you'd either have to adopt quite a bit of the C++ object model into your scripting language or restrict the types of C++ objects you could interface to.

Neither of these options shows much utility over simply calling C routines. Implementation takes a lot more work (especially when each of the different C++ compilers have different name mangling schemes you need to adapt to for linking).

If you really want to talk to C++ objects from within your language, it's a lot easier to externalize the objects via C functions or use CORBA ORBs that both provide a common object model and bind to C++.

Felix

I recommend taking a close look at Felix. It's a lot more like ML than anything else, but it integrates directly with the C++ object model, which seems to be what you're looking for.

Right!

I'd plum forgotten about Felix, even though it is filed away somewhere in my mind. Many thanks for the reminder!

Lua

I second the recommendation for Lua. It's very small, very fast, very well documented, has an excellent C API, and is extremely flexible as a programming language for when you need the generality.

Lua-ML

And for the Caml part, there is Lua-ML, an ocaml implementation of Lua 2.5, as mentioned before.

But what if...

The suggestion of using an existing scripting language (especially one designed with embedding in mind) is, of course, the expected and reasonable answer. But what if there are reasons why you can't do this. For example, your users may lack even basic programming skills of the type required for using Javascript etc. Another reason (close to my heart) is that your users may be uncomfortable with English (say they are Hebrew speakers), so you need a language that isn't based on English.

What then?

Domain specific library

In Javascript and even a bit more so in Ruby you can write a domain specific library that can give the code a very natural feel. With a few examples which a user without programming skills can copy and modify you can go a long way. In fact that is how a lot of websites are built. And Javascript even supports Unicode identifiers, so you can create a Hebrew library if you want.

Unskilled users don't mind a bit of syntax. People have a natural ability to ignore the things they don't understand, that's how children learn to speak. Excel uses some syntax too, and that doesn't seem to scare anyone.

I think it is very important that a DSL is implemented as a library in a complete scripting language, because that will allow your unskilled users to grow into skilled users.

I think it is very important

I think it is very important that a DSL is implemented as a library in a complete scripting language

I wasn't arguing against this, of course. But I think you may be a bit too optimistic about users. They come in all shapes and sizes... Many of them will object Excel.

If that is the case

Won't those users object to any form of text based configuration?

They will probably resist

They will probably resist it, yes. But I think it may in practice be possible to persuade them, if enough real use cases are presented - that is, providing the language is powerful and offers real flexibility.

But for this you may need a more capable language, raising the barrier to entry, right? A Catch-22 situation, I agree, but one that can be overcome, I hope.

As a personal anecdote

As a personal anecdote, I once demonstrated a prototype task-automation system using a library-based DSL running on Python. The audience was made up of "knowlege engineers", primarily librarians and other researchers. Their technical expertise ran from "knows how to edit text files" to "has seen some code in a few languages". Before the demonstration, several were asking why they should waste their time attending, as this technology would mean nothing to them. After the demonstration, they were almost unanimously enthusiastic about using the system immediately.

The key, I think, was a combination of a reasonably intuitive (and regular) syntax from Python, a pithy DSL that fit closely with the tasks at hand, and a demonstration that directly addressed their daily workload. If you can show someone that they can learn to use a tool to automate a task in less time than it takes to perform that task once, they'll usually be hooked.

Prolog

in that light, it has always seemed to me that Prolog should be a requirement of high school education. :-)

The key, I think, was a

The key, I think, was a combination of a reasonably intuitive (and regular) syntax from Python, a pithy DSL that fit closely with the tasks at hand, and a demonstration that directly addressed their daily workload.

That's, of course, quite a challenge...

What indeed?

Well, practically speaking, you can get part of the way there... The Lua team, for example, has made an effort to keep the interpreter code relatively simple and comprehensible. This fosters a thriving community of optional patches, but more to the point, makes it pretty simple to do things like change the native language of the keywords. (On the other hand, native language may be baked into PLs in a deeper structural way, as has been discussed here before, but this is probably outside the scope of this thread).

In any case, I think the real point you're bringing up is a little deeper than keyword internationalization, and I don't mean to be facile. I think that there's a huge amount of really interesting work to be done at the boundaries between configuration, scriptability and extensibility, and in another dimension between reflection, dynamic loading, and language embedding.

It would be great if there were more ways for programmers to make their software extensible, to avoid the tedium of FFIs, interpreter APIs, etc., and to focus on precisely what form this extensibility should take. Real progress here probably requires work on both the host language and the extension language and there are a lot of open questions about, for example, just how much the data and evaluation models of the two languages need to match, handling security, etc.

I'd love to see more work in this area.

DSL + General purpose language

Next week I will present EasyExtend on Europython 2007. It addresses some of the issues in that one can define a domain model independently, conceptualize syntax and semantics of a DSL accordingly and extend the host language ( Python in this case ) using the DSL on demand. Note that the grammar ( specified in EBNF ) of the DSL is completely independent of the host language grammar. The only constraint is the power of the current top-down parser.

A good design practice in this particular context would be designing the DSL with conservative extension of the host language in mind. So instead of defining your own imperative loop just take the existing one. Avoid also name conflicts with keywords. Use load instead of import for datafiles. This is future proof. Using a couple of new domain specific operators like price is very unlikely to cause harm.

Camlp4

So my question is what sort of good language design principals should be applying to a litte scripting language like this? Or would it better to use an existing scripting language? Part of the application is written in caml so this would probably the scripting language of choice, but I’m a bit worried would be over kill.

Integrating a full-blown external scripting language, like Lua, Javascript, Scheme etc. can be its own kind of overkill, particularly if your initial needs are so simple. If you're already using (O)Caml, an obvious choice here would be to expose the functionality you need in a mini-language using Camlp4. Of course, you could just let users use OCaml directly, but Camlp4 will allow you to restrict the functionality they have access to.

For good language design principles, one approach would be to more or less directly expose an OCaml subset, so that growing your language would just mean exposing more Caml features. It's also easy to offer a universal type so that your scripting language is dynamically typed, if you want it, but of course then you'll be defining a more custom language.

Previously On LtU

A comment on Graydon Hoare's One-Day Compilers presentation, implementing a simple DSL-to-C compiler using camlp4 and, of course, OCaml.

Implementing verb/noun combination in Camlp4

That's a great presentation, useful to get familiar with Camlp4. But it's worth noting that the job of exposing a limited subset of OCaml itself is quite a bit simpler than what the presentation describes. All you really need are the grammar rules for the bits of OCaml syntax you want to support, which maps quite easily to code that generates the OCaml AST for that syntax. Those rules can be adapted from the Camlp4 grammar for OCaml.

So for example, the rule for single-argument function application (given here without the necessary context) is as simple as:

[ e1 = expr; e2 = expr -> <:expr< $e1$ $e2$ >> ]

I.e. when two adjacent expressions (e1 and e2) are matched, an application expression is generated.

That takes care of the verb/noun combination mentioned in the topic; we're about 1/3rd done with the language! ;)

The resulting language, with appropriate configuration, will be usable from within the OCaml interactive loop, or can be translated to OCaml source, or compiled to bytecode, or native code, etc.

Beanshell?

Perhaps Beanshell is a good solution: the power of Java combined with the power of scripting.

DSSLs

Those interested in extensible languages may be interested in my current efforts to implement the long planned Domain Specific Sub-Language facility of Felix.

A DSSL isn't the same as a standalone DSL, rather, it is an extension of an existing language, hence the Sub above. In Felix, you start with some grammar, and you can add new grammar productions grouped under some name like "SQL". Later you can open this package to make the syntax available. These syntax extensions are properly scoped.

Currently, it isn't quite working but almost the whole of the original hard coded Felix grammar has now been put into the standard library.

The compiler is written in Ocaml and the technology I'm using to do this consists of the extensible GLR parser Dypgen and OCS Scheme which are written in Ocaml.

The user actions of the new grammar production are written as strings containing a Scheme program which is executed when a rule is reduced to return a Scheme value, which is translated to a Felix AST term.

For example:


 satom := match sexpr with smatching+ endmatch =># 
    "`(ast_match (,_2 ,_4))";

extends the atomic expressions of the language with a matching construction. Here match, with, and endmatch are keywords whilst sexpr and smatching are nonterminals. The _2 and _4 notation in the Scheme string refers to the value returned by the user action when reducing those nonterminals. The returned s-expression is translated to a Felix AST term encoded with Ocaml polymorphic variants.

There are two retrictions on this mechanism: the lexer is hard coded so that the available tokens are predefined and can't be changed (although it is possible to defined new keywords, you can't define a new money token). This restriction will hopefully be lifted down the track.

The second restriction is that the final Felix AST created must be of the predefined static type, and no new processing such as type checking can be implemented. I hope to lift that restriction too, afer Ocaml 3.11 is released, allowing dynamic loading of native code.

The ability to define new syntax in such a flexible manner leads to many interesting issues I'm just discovering .. usability, dependencies, and many other fun things are emerging as problems to consider.

Nice to see these ideas are

Nice to see these ideas are emerging in parallel in completely different projects ( see my reference to EasyExtend above ). Feels like a renaissance of grammars + parsers to me. Suddenly transforming languages with more complex grammars than Lisp ( or XML ) becomes a practical issue again.

The ability to define new syntax in such a flexible manner leads to many interesting issues I'm just discovering .. usability, dependencies, and many other fun things are emerging as problems to consider.

I subscribe that.

I'm also working on a parsing-related project

I would invite you to check out my work-in-progress: a parsing engine designed to be language-agnostic, and improve on the Yacc/Bison/ANTLR approach:

http://wiki.reverberate.org/index.php?title=ParsingEngine

In the cited reference Josh

In the cited reference Josh writes:

keeping actions out of grammar files, so the grammars can be reused

This idea here is simultaneously desirable and untenable.

It's desirable because if you have a compiler using a grammar, you probably want a document generator with quite different user actions for the same grammar .. or you may want to emit XML.

On the other hand, it is untenable because the user actions are intimately tied to the detailed structure of the grammar, and separating the actions out forces the introduction of an additional later of complex coupling, which is likely to break if you change the grammar. I have already abandoned the Felix document generator because I couldn't keep the generator actions in sync with the grammar, precisely because they were physically separated.

Perhaps a better solution is to minimise, but not eliminate, the work done in user actions to generate an AST which can then be post processed by a variety of processors. The AST acts and the coupling, but is designed to be slightly more robust and higher level than the grammar. XML is sometimes chosen as the textual representation of such an AST and has the advantage of allowing a DTD to be specified to allow third party tools to verify conformance.

Still, I am seriously thinking of a concrete syntax which allows a record of user actions for each production, and to parameterise the parser or parser generator with a selector so one file can build multiple parsers for the same grammar. This keeps the actions and the productions they process textually close, and in particular means the association is annoymous rather than requiring indirect linkage via an invented name, but it also means adding a new interpretation is invasive.

Another idea is to forcibly stratify the grammar into layers of sequences and alternatives, which has the effect of forcing you to name every sequence with a distinct nonterminal. In that case an AST can be built of products and sums using the nonterminals as canonical names of the sum constructors and record projections, so there is no need to invent any names.

For example this:


  nt = a b | c d

would have to be rewritten as:


  nt => nt1 | nt2
  nt1 := a b
  nt2 := c d

where I used distinct operators to denote alternatives and sequences.