Extensible Term Language


I’m currently working on open source project that has 
a goal to create a language definition framework
that can be used as textual DSL construction kit. 
The framework is currently named Extensible Term 
Language (ETL). This language definition framework 
is very similar by architecture to XML. The framework 
has just reached its first public version.

* There is language definition language that is defined using
  framework itself (this aspect is more like XML Schema or
  Relax NG rather than DTD). This is a basic dog-food test for
  such framework.

* It works with plain text.
   - Non ETL-aware editors can work with languages defined using
     ETL
   - There is no special hidden markup.
   - It is possible to have and edit incorrect text. Even if
     syntax changes (for example some keyword is renamed).
     It is possible to fix source using normal text manipulation
     tools.

* It allows for agile definition of underlying model, language,
  and instance sources.

* The syntax has underlying document object model.

* There may be a lot of different implementation of parsers and many
  models of parser like AST, DOM, push, or pull parsers.

* The language definition framework specifies syntax and mapping to
  model rather than semantics of the language. It is possible to
  build semantic aware tools, but they should live above the language
  like it is now with XML.

* There are no build-in transformation facilities, but it is possible
  to define facilities using means above the framework. Such facilities
  might work on AST level or on more detailed levels (for example there
  is a tool that transforms source file to html basing on grammar
  definition).

* The language defines common lexical layer and common phrase level.

* Like XML it allows creating reusable language modules. These
  language modules can be exchanged between tools. There are few
  samples of such reuse in the package.

However there are also differences from XML:

* ETL syntax is believed (by me) to be much more usable than XML.
  It is possible to define traditionally looking programming
  languages using it. See samples in the referenced package (for
  example there is a grammar for Java-like language named EJ).

* One must have a grammar to derive underlying object model from
  source code. However such grammar may be created independently
  (in that case object model will be different from original
  intention of author). In XML the grammar is used mostly for
  validation and specifying syntax of text values, and object
  model is self-evident from source.

The project is still in pre-alpha stage. There is a working grammar
definition language and few extensions are planned for it.

There is a ready-to-use parser that can be used in situations when
grammar is static, like command line tools (extensions to parser
to make it more suitable to dynamic environment like Eclipse are
already planned and it is more or less known what to do). The parser
is of pull kind. And it is possible to build AST or DOM parsers
above it. For example there are ready-to-use AST parsers that
build tree of JavaBeans and AST models that have been generated
using Eclipse Modeling Framework. The parser itself uses EMF AST
during compilation of grammar to executable form.

The current version could be downloaded from http://sourceforge.net/project/showfiles.php?group_id=153075&package_id=178546&release_id=391153

Please look at the file doc/readme.html for more details about
the package. The file gives references and some explanations
for examples. There is also a document that describes motivation
for the language.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

I am lazy

You really should put something at http://etl.sf.net/ (at least readme.html). I hate having to download and unzip a package for README and I suspect others do too. The fact that it's on SF.net makes it even more work.

Web link project documentation and generated XMLs

Ok. I have put documentation from the package and XML forms generated from samples on the site:

http://etl.sourceforge.net/etl-java-0_2_0/doc/readme.html

The same documentation is included in the package.

I've looked over your

I've looked over your Problem Statement. Your list of problems with approach 3 is true in practice, but you should note that there are no theoretical constraints to having a BNF-based LL parser generator that supports error recovery, inremental parsing, run-time grammar modification, and other things you mention. Anyway, you should mention TXL and explain how your project is going to be better. From my recollection it already solves most of the problems with parser gernerators that you have mentioned. There are also many other compiler frameworks, ASF+SDF to name one, which don't exactly fit in the "parser generator" category.

Anyway, remember that syntax is an easy part of language design, and parsing is the easy part of its implementation. That's not to say that trying to improve parsers is pointless, just something you should keep in mind.

ETL and ASF+SDF

I have lookded at ASF+SDF again. They have made a good progress since last time.

This is more interesting than TXL in some sense. The syntax definitions are more modular. And there are finally almost expressions in the grammar. However there are still problems with expression grammar reuse. Expression priorities have to be specified explictly by using "<" operator. This might be very incovenient in case of grammar that declare a lot of operators like C or Java. Also I do not see how to implement Java conditional operator right now ?: . There are also no samples of abstract grammars, and documentation does not mention them.

Reuse of closed syntax modules looks like possible. Syntax seems to allow even adding new rules for existing productions. However I have not seen this in the samples, so possibly it just missed opportunity in semantics. It alos does not look like possible to replace definitions. ETL allows defining extension points in the grammar and it is possible to replace extiting definitions when including other grammars (see CommonControlFlow.g.etl and EJ.g.etl samples), the opportunity of this in SDF is not obvious.

Also I have got a feeling that they try to couple syntax and semantics together too much. Thise probably good for compiler framework, however not so good for language definition framework. To support just parsing, a lot of things not related to parsing have to be imlemetned.

The tool processing model looks quite closed from documentation. At least in all samples they try to do end-to-end processing of the sources. API seem allow interacting with external world, but it looks more complex than one from antlr. DSLs are usually small parts of larger systems, and DSL module usually interact with many differnt components (at least from my experience with XML DSLs). Current ASF+SDF implementation seems to consider itself to be center of the system. ETL parsers like XML parsers are specialized single-purpose components of the application that can be integrated with other components through minimal interfaces. And I think it increases possiblities for reuse.

Also it is quite strange, but I'm missing large grammars samples like C, Pascal, etc. Framework seems to allow this, but samples are missing for some reasons.

Interesting that SDF moved in opposite direction from ETL. ETL has added additional layer between syntax and lexical layer: phrase layer. SGLR has removed the lexical layer from the paser. This increases possiblities for language zoo.

So my view is that ASF+SDF is more complete and generic technology. ETL is more simple and more reusable. ETL also seems to has better syntax extensiblity model. However it looks like SDF might have done better even without change in syntax. Generally I think that it is because in ETL task is has been consciously limited, and it is possible to do what it is doing better.

As side note, the project seems to be officialy ended and their DSL effort seems to have ended even earilier. However bugzilla indicates that there is still some work in progress.

Real-Life use of SDF

I can't speak for ASF, but that's beside your point anyway. We in the Stratego group use and enjoy the power offered by the SGLR parser which hides behind SDF.

A group at EPITA has used SDF and Stratego to construct a C99-compliant parser which does disambiguation based on type inferencing. They are now working on ANSI C++.

Interestingly, they used SDF to extend SDF itself with a small DSL that adds attribute grammars. The result is ESDF.

In our group, we have a fully compliant Java parser, for both 1.4 and 1.5 grammars.

Given the parser technology, dealing with ambiguous grammars is laughably easy compared to, say LL(k)-restricted systems, but that is not to say that constructing grammars for real languages is trivial. It requires forethought and heaploads of testing.

The killer app for SDF, however, is embedded DSLs. Given a grammar for a general purpose language (GPL), it is easy to embed a small or medium-sized DSL in the GPL. For some examples, JavaSwul

It is, in fact, also possible to embed a GPL in a GPL, see JavaJava

TXL and ETL

There are few point I want draw attention to:

  1. The project is concerned with the syntax only.
  2. The set of grammars supported by ETL is more limited than supported by BNF parsers. By limiting class of langauges, new intersting things may be done. ETL is higher level langauge than LL grammar defnition langauges. It is like C compared with assembler. You could do much more in assembler, but C has greater usability.

Note I have read TXL docs only briefly, so I'm still possibly missing something.

Then I will proceed with your points:

  1. TXL and ETL compete only on syntax level. ETL based languages might make use of TXL pipeline furhter in processing chain.
  2. Syntax is easy only until you will try to compare with XML syntax definition. Quality of toolset and grammar definition langugages for XML is much greater than for LL or LR parsers. And sytnax definitions are portable between parsers.
  3. The goal is not to replace LL or LR tools (except in the sense as C has pushed assembler to more niche tasks). The goal is to create toolset for work human-readable and human-writable langauges. And that toolset should have similar usabilty properties to XML toolset. So the goal is to drive out XML from segment where XML verbosity gets in the way. Note that there are a lot of DSLs that use XML only because of quality of toolset and portablity of grammars. Easy examples are WS-BPEL and numerous mapping langauges like Hibernate mapping language.
  4. Error recovery is more usable in ETL because it is generic and one does not have to do coding for implementing it. There is a phrase syntax (that limits set of available langauges), and error recovery is based on it. The grammar Fallbacks.g.etl does not contain any special programming to support error recovery. You could see error recovery in action at sample: NonEmptyFallbacks.test.etl
  5. More high-level constructs make expresion defintion more simple. Actually its foundation has been borrowed from Prolog. Only major addition are composite operators.
  6. More high-level constructs make extensibliyt more natural. Compare TXL extensiblity model with EJ sample. First there is reuse of incomplte abstract expression and statements grammars. Then there is a custom AsyncEJ grammar that adds just few operators and statements. Try to implement these scenarios in TXL on the grammar of the same volume. Browse HTML files at: http://etl.sourceforge.net/etl-java-0_2_0/xmlout/ej/grammars/
  7. ETL allows creation of reusable language modules. There are a lot of reusable grammars for XML, but please show me a tool that does clean reuse of some arithmetic langauge module. Note I'm not talking about internal implementation, I'm talking about sytnax.
  8. There are greater possiblities for reuse inside single grammar. for example grammar laguage definition for grammar language itself resuse different operator sets in different context. Operators for compoisite sytnax are not available in context of simple operator definition. But it reuse some constructs form abstract context included by both contexts. EJ grammar from web site also has examples of operator reuse inside grammar. Operators from type expression are reused by generic expression context.
  9. Also not that these HTMLs above referenced in this message are built by tools that has not been customized for specific grammar. There are other sources on website transformed by customized stylesheet. For example this HelloWorld.ej.etl is highlighted by XSLT stylesheet that marks defining occurences of identifiers by italic.

Does it answers your question?

On XML and syntax vs. semantics

Does it answers your question?

Yes. Allow me to express some opinion, not so much on ETL, but on the whole idea of taking XML as a starting point to improve on.

First of all, it's not very hard to improve on XML, speaking in technical terms. XML's technical deficiencies are all very well known. Hell, if all you want is to "create toolset for work human-readable and human-writable langauges", then SGML is already quite an improvement over its successor.

However, the good thing about XML is that it's a standard. I don't see ETL or anything else becoming a new standard to replace XML.

Also, I got the impression from your comments like

Also I have got a feeling that they try to couple syntax and semantics together too much. Thise probably good for compiler framework, however not so good for language definition framework. To support just parsing, a lot of things not related to parsing have to be imlemetned.

that you consider "language definition" to have nothing to do with semantics. Let me reiterate: syntax is an easy part of language design, and parsing is the easy part of its implementation. That holds true even for DSLs. A language doesn't have to be Turing-complete to have semantics.

All that being said, I believe there is room for a DSL-definition framework that allows more human-friendly syntax than XML, but is less generic than context-free grammars. SGML, for example, is still being preferred over XML by many organizations in publishing area.

On XML and syntax vs. semantics

My note about doing too much were about "ADF+SDF". And I completly agree with you that language does not have to be turing complete to have semantics. After all ETL grammar definition language iself is not turing complete.

On XML. XML syntax suits quite well to document markup. I have written documents for the project in docbook. It is more or less suitable for data exchange. However, its verbosiness and abilities needed for documents starts to bite here. For programs is almost completely unsuitable. Even XSL-T cheats and uses XPath expression language to be usable.

The strongest point XML is tool support. This is why it used in places where it is actually unsuitable from surface syntax point of view. And there are features of XML that enable such tool support. What I have tried to do is to create language definition framework that that have properties also enable similar tool ecosystem.

I think that program transformation and compilation is well-solved task and there is a lot of resarch in the area. On other hand syntax definition techniquues have not been improved for ages. LL1/LR gramamrs is latest adopted imrovement. Prolog operator syntax has been very bright idea, but none have copied it. We still are in age of production/token.

Parsing is an easy task for a knowlegable developer. But even for knowlegable developer it is still an error-prone effort. Try to add a couple of operators with new precedence levels to existing grammar like Java as excercise. Now if you have completed this task, try to add these operators while keeping old grammar intact, and not copying text from it to new grammar. ETL allows it. ANTLR/Yacc will not allow it at least in natural way. From look of it neither TXL nor SDF support this cleanly.

The ETL packages that knowledge and makes it more accessible to wider audience. It also takes common langauge development pattern like adding statements and operators, reusing definitions inside grammar and from external grammars, and makes these patterns explicit in the language rather than implicit like it is right now in the most language definition framework.

As side note, the parser transforms input grammar to LL(1) form internally and than compiles it to state machines. Then it pasers input source using resulting SM representation. So as low level parsing technology it is nothing new. All that new in it are higher level syntax definition constructs applied to human writable/readabe syntaxes. These consctructs has been inspired by Prolog, Schema, Dylan, SGML/XML, etc.

So you pharse looks analogous to saying that arithmetics and control flow is easy part of program design. So we do not need high-level langauges like Fortran. There are still a lot of tasks where cost of defining DSL syntax parser exceeds cost of processing. Currently XML is used for such DSLs. ETL offers alterative for such scenarios.

ETL has many charactersitcs that might make suiltable for standartization. Among other things it decouples syntax and semantics. It can be implemented independently. It supports common usage scenarios and has extensibility model. And it is relatively lightweithg.

Have you looked at samples that I have provided. I believe that resulting langauges are much more readable and writable than SGML in domain of statements and exrepssions. I also believe that they are more usable for mere mortals than S-expression based DSLs (end users of DSLs will unlikely have parenthesis matching brainware installed). All grammars in my previous reply are written in ETL-based language. The sample HelloWorld below is also written in ETL-based language (you could also see this sample in automatically generated sytnax-highlighted form at this link).

doctype "../grammars/EJ.g.etl";
package test;
/// Classical "Hello, World!" program.
class public HelloWorld {
  /// Application entry point
  /// @param args application arguments
  @[SampleAttribute]
  to static public void main(array[String] args) {
    System.out.println("Hello, World!");
  };
};

For example ETL is targeted cover applications like listed below among other things.

  • RelaxNG Compact Syntax
  • ASN1 grammar. Many people has complained that major problem with implementing this standard is that writing syntax generator.
  • ANTLR grammar. Also yacc, bison, and rest of croud.
  • WS-SBPEL (this thing might be even Turing-complete)

Compare SSS

Alistair Turnbull's Semi-Structured Syntax may be of interest here; it's fairly easy to get the hang of, and has been used to express the syntax of one or two little languages.

Re: SSS

Having started reading the spec [pdf], I find it interesting to see - in my little opinion - that while valid points are made about drawbacks of XML, SSS goes ahead and introduces several new problems. The fact that large sections of the spec are dedicated to specifying how to do indentation while simultaneously having the statement "The white space doesn't mean anything, but it has to be there" indicates to me that SSS could stand to go through some heavy usability testing and revision before anybody else would really want to adopt it.

I guess here's the point I'm thinking: anything you create, even when based upon avoiding mistakes other people made, is going to have its own mistakes! Sometimes it is worth sticking with the devil you know. I have yet to see anything that is sufficiently better than XML to make me believe people would switch en masse.

XML has its advantages

Note that XML has its good points in the area of document exchange. It is quite usable in that area. It has a fine balance of good and bad points in the area of data exchange (S-expressions with a good schema language whould had been better, but it had not been seen until XML took the niche). In data area, some complex solutions are proposed that do not affect data exchange tools significantly.

I think it is unlikely that XML will be driven off from areas it where it is good or acceptable. So there will be no switch en masse in near time. However there are areas where XML is unacceptable as surface syntax, and pushed there only because of tool support. In this area there are chances for migration to better technologies if they are available.

SSS is interersting that

SSS is interersting that sense that it tries to take on the same problem.

There are some similar high level decisions been made:

  • Lexical level is fixed. It is even more fixed than one found in ETL. For example there is completly separate syntax for keywords. I have been at this decision point too, but I have chosen another way. While it looks tempting, I have decided not to go this way in my project as it limits set of available languages.

  • Phrase level is fixed. And it seems to be used for error recovery.

There are also some advantages of SSS over ETL.

  • SSS is much more lightweight in runtime.
  • Simple grammars are simpler to work with. ETL brings a bit more weight with its mapping constructs like object and let.

On other had ETL does not bear its weight for nothing:

  • Extensiblity and reuse model. SSS simply does not have it. But it is a major point why XML is so popular. XML standards are built upon each other. WS-Security uses XML digital signatures and XML encryption. Docbook 5 uses XLink. SOAP specifies a simple envelope with few extension points (headers, body) that are exteneded in generally unknown way.

  • Experssions and statements. SSS is still a BNF language. So it cannot have usable expression and statement model. This is no problem for small languages, but it disables extensibility. There are also problems with operator associativity. Only right associative operators looks like expressible (so a+b+c = a+(b+c)). ETL borrows expression model from Prolog and is much more flexible in associativity options.

  • Document object model. Hard to say without more detailed analysis but it looks like it simple to integerate ETL and IDEs because reparsing model is more clean. However neigher ETL nor SSS have reparsing model published.

So smaller tasks will favor SSS, bigger ones or ones where extensions or reuse are possible or even required favor ETL. Also ETL has an advantage in expression handling. There is more to type, but it easier to maintain. The only way of reuse in SSS that I see is copy/paste. And there is always a problem that tasks tend to grow (for example PHP).