Tim Bray: Don’t Invent XML Languages

The X in XML stands for “Extensible”; one big selling point is that you can invent your own XML languages to help you solve your own problems. But I’ve become convinced, over the last couple of years, that you shouldn’t. Unless you really have to. This piece explains why. And, there’s a companion piece entitled On XML Language Design, in case you do really have to.

Ouch.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Top-down vs. bottom-up

The analysis is probably right, but the message is wrong. Apparently, XML suggests to use a master-planning top-down approach to language design. This is very obvious from his section on Software Pain. ("First you have to do this, then you have to do that, then you have to do yet another thing, before you can even get to the work that you actually want to do.") There are some good essays on why master plans don't work, especially by Richard Gabriel. See Mob Software for example.

Discouraging developers to develop new languages, and especially making managers, who shouldn't make technical decisions in the first place, shoot down suggestions by developers without thinking about the real issues is definitely the wrong message. The real issue is this: If creating new languages - the primary goal of XML - is such a bad idea, then the only valid conclusion is that XML has failed!

Compare this to the notion of bottom-up design, as explained for example in Programming Bottom-Up.

Several problems with XML; IT industry on acid trip.

XML has several problems. It is slow to parse, it can result in big files, it only describes what the data are and not how to handle them etc. But it is a step in the right direction: we need to have a way to make programs interact based on a common protocol. In other words, we have a problem (interaction of programs), the correct approach (a common protocol) with a wrong solution (xml).

Maybe if every computer came with LISP (or something similar), computers could use that as a protocol instead of xml or any other protocol. If you thing about it, there are lots of protocols invented to solve the same problems (or the same nature of problems).

Take html for example; any HTML document can be represented with LISP lists. In fact, if we had LISP data sent over the network, we would have less data to send over (as a LISP program can be sent as a list of tokens); we wouldn't have to invent javascript, dhtml or ajax, as LISP data are both code and data at the same time. We wouldn't have to use SQL, since database schemas could be represented by LISP lists. We wouldn't have to invent SOAP, since LISP data could be used for sending requests over the network; we wouldn't have to invent OpenDoc or document formats, as LISP lists are more than sufficient to describe any format. We wouldn't have postscript, as postscript is nothing more than a bunch of instructions ala xml/LISP. We wouldn't have command line interpreters, since LISP can also play that role. We wouldn't need COM or CORBA. And on top of that, development could be bottom-to-top instead of top-to-bottom, since LISP development is usually interactive.

And on top of all these things, we also have managers that are clueless in technical details, and they insist to use XML because it will look good on the company's resume...

Lisp is no better

Lisp also has a complicated syntax (quote, backquote, lots of lexical token types) and it not much less verbose than XML. It has the same lack of inherent semantics that XML has - in fact its data-representation semantics are weaker, since it doesn't have an element/attribute/content distinction. It lacks standards like namespaces and Xpath that allow you to express more complex structures than simple trees.

Lisp data is only code if there is a suitable set of function definitions, and these would have to be standardized for interoperability. The same is true for all of the other "just use lisp instead" things you list above. Without these standards there is no way lisp could support the kind of software egology that has grown up around XML. As a concrete example (going back to the origin of it all) how would your ideal lisp browser display a random bit of lisp as a web page?

"Lisp data is only code if th

"Lisp data is only code if there is a suitable set of function definitions, and these would have to be standardized for interoperability."

yes, but well, there is standardised functions. Just as in any programing language.

And i guess a browser could display random lisp a bit like this. But I dont know if its optimal as I havent looked at it much.

Lisp is better

If you take Common Lisp or ISLISP, you have an attribute/value distinction (property lists and keyword arguments). Scheme also has association lists. Common Lisp has namespaces (its package system). I don't know enough about XPath to be able to comment on that.

Common Lisp is an ANSI standard, ISLISP is an ISO standard, Scheme has an IEEE standard.

Lisp was mentioned as an example.

I agree that Lisp has somewhat complicated syntax, but the fact that the syntax is programmable more than makes up for the lack of data representation semantics. As for standardization, it's a matter of decisions.

I do not undestand how your question of 'how to show random bit of Lisp as a web page' has actually any meaning. The point in Lisp is that it has a syntax that can be extended, which means that an XML-like DSL could be easily specified, as well as be extended with more features if the need arises. The real advantage here through would not be the DSL, but the ability to validate/interpret/process it without any other tools than the interpreter.

XML is not S-Expressions

The general tone of the preceding post seems wave a lisp hammer in a world of nails, and some of the arguments, such as "We wouldn't have to use SQL, since database schemas could be represented by LISP lists" just can't inspire any kind of reasoned reply, but let's just stick to the central argument: This little chestnut about XML being some kind of brain-damaged version of sexps gets thrown around every time the subject of XML comes up, so it's worth seeing the opposing viewpoint:

http://www.prescod.net/xml/sexprs.html

Specifically, while sexps may do a great job at structuring hierarchical data, they're lousy at markup. I can't agree with Tim Bray's argument either, but he only seems to be misguided from a political viewpoint, not a technical one.

XML is worse than s-expressions

If you want markup, you can portably add it to Common Lisp - see Lisp Markup Languages. See also the links on that page to markup languages in other Lisp dialects.

Philip Wadler has a paper where he briefly compares XML and s-expressions. From The Essence of XML: "XML is touted as an external format for representing data. This is not a hard problem. All we require are two properties:

  • Self-describing: From the external representation one should be able to derive the corresponding internal representation.
  • Round-tripping: If one converts from an internal representation to the external representation and back again, the new internal representation should equal the old.
Lisp S-expressions, for example, possess these properties. XML has neither property. It is not always self-describing, since the internal format corresponding to an external XML description depends crucially on the XML Schema that is used for validation (for instance, to tell whether data is an integer or a string). And it is not always round-tripping, since some pathological Schemas lack this property (for instance, if there is a type union of integers and strings). So the essence of XML is this: the problem it solves is not hard, and it does not solve the problem well."

Specifically, while sexps may

Specifically, while sexps may do a great job at structuring hierarchical data, they're lousy at markup.

This is probably true and one reason why SGML was invented. SGML is great at markup and lousy at representing hierarchical data. It's quite complementary to sexprs.

Now look at XML: It sucks at both tasks. That makes it universal, at least that's what the industry is trying to tell us.

The difference between engineering and mathematics?

XML has that engineering quality of striking a balance between several opposite requirements. This gives it a huge edge over both Lisp and SGML in the markup world. Making good trade-offs is the soul of engineering.

"just can't inspire any kind of reasoned reply"???

Without wanting to start a language war, but can you back your claim up? SQL is a language for expressing database operations...A programming language like Lisp can easily be used to provide a custom DSL like SQL for database operations.

As for the article about s-exprs, let's analyse it a little:

"There is no standard syntax for attributes across Lisp-like languages"

That's not a criterion for language quality. It's a matter of circumstances and politics. If Lisp vendors decided they wanted a standard syntax for attributes, they could make one. The point is that the mechanisms of their language allow it.

"To me, the XML version looks easier to read"

That's subjective, as the author says, but if the XML document is 10 pages long, is just as unreadable as Lisp.

"The XML one defaults to treating random characters as text, not as markup"

So? big deal. How many times there are "random characters anyway? Lisp is a programming language. String literals are enclosed in double quotes. Where XML can stand for representing information, it can not do anything else.

"The XML one does not use standard human-punctuation characters as markup."

Well, the character are special. It is the symbols 'less than' and 'greater than' respectively. Show an xml document to a mathematician and he will have a hard time digesting things like "paragraph

"Redundancy Is Good"

No one stops you for using an extra word in your DSL at the end for this purpose.

"So in my opinion, XML's syntax is wildly better than S-expressions as a language to integrate the worlds of documentation and data."

I failed to see the 'wildly better' thing, and the author simply misses the point: the real power of Lisp is the ability to invent any DSL needed for a specific task, as well as code and data being intergrated.

"DTDs, RELAXand XML Schema define constraints on individual instances of XML documents. This is a necessary feature for web services."

A Lisp like environment is much more open to checking constraints, since it can be interactively programmed to do so. Extending a web server with new functionality is much easier with an enviroment like Lisp than with xml.

"XPointer allows change-resistent addressing of particular nodes in the parse tree ("infoset") in URIs.
"XQuery allows the expression of sophisticated queries on XML documents"
"XSLT allows the declarative, patternn-based transformation of XML documents"
"CSS and XSL allow XML to be presented to human readers with formatting"

All of these things can be done in a Lisp-like environment very easily.

"You cannot evaluate the hype around XML without incorporating all of these technologies into the evaluation. Cumulatively there are decades of person-effort embodied in those specifications!"

The effort would be much less if something like Lisp was used.

" The central idea of the XML family of standards is to separate code from data. "

Which it never actually happens with xml: most programs, if not all, use xml for communication with other environments, converting xml from/to a set of native objects. And two applications can share the same xml information, but if they need to manage it in the exact same way, they need to share the code too.

"The cental idea of Lisp is that code and data are the same and should be represented the same. The Lisp community's idea of "Schema" would likely be "Lisp program". The Lisp community's idea of "addressing language" would likely be "Lisp program." The Lisp community's idea of "query language" would likely be "Lisp program.""

Good, at least the author understands a few things.

"Unfortunately this response ignores the Principle of Least Power"

I don't see how that applies to a Lisp-like environment.

What the author cleverly sidesteps is just how powerful is a system like Lisp. It can be procedural, functional, object-oriented, purely declarative, mathematical, etc. I am not praising Lisp, I am praising the idea of macros, extendable syntax, data as code and vice versa, etc. Lisp has its flaws, and there are various implementations around, each one with its own problems, but there are incredible ideas behind it, not properly exploited by IT.

Everything's a string!

"The XML one defaults to treating random characters as text, not as markup"

So? big deal. How many times there are "random characters anyway? Lisp is a programming language. String literals are enclosed in double quotes. Where XML can stand for representing information, it can not do anything else.

Wouldn't you say that it is this property of XML (it's (dis)ability to slurp up everything that isn't explicitly markup as strings) that forces one to "define constraints on individual instances of XML documents" using things like "DTDs, RELAXand XML Schema?" Lisp has a type system; it doesn't treat everything as a string by default. You can declare types in Lisp without resorting to the use of external verifiers. This is good for storing and transmitting data. The fact that XML itself is so bad at defining and structuring data is evident in all of the external documents and tools needed to verify that the XML document is valid. However, I think that many in industry(especially those who don't have to deal with the XML Schema's directly, i.e. managers, etc.) like the idea of having a separate document that defines the types and structure of XML documents, even though such a system is less flexible than specifying types and structure in the document itself. I suppose it gives people a sense of security: we have this document that defines our XML right here, so we know our XML is valid.

>> "There is no standard synt

> "There is no standard syntax for attributes across Lisp-like languages"
> That's not a criterion for language quality.

In theory, no, but in pratice it matters a lot.
The reason why XML matter is that it has one unique syntax, not many which help synchronisation of course.
Of course, they dropped the ball for XML schema (and IMHO XSL suck, but..)

As for '<>' against (), if one use the XML view that everything is text then () have a disavdantage as they are quite common in text (just look at this webpage) so this increase the usage of escape character which lower readability..
Maybe LISP should use [] or {} instead :-)

The problem is not XML

I've both the articles and Tim Bray has some excellent points to make. But none of them are really specific to XML. For example: don't invent a new language if you don't have to. That's common sense whether or not the new language is an XML language or not. Designing languages ARE difficult. That doesn't depend on XML either. As a matter of fact I didn't find a single reason NOT to use XML in the above articles. His points were about language design in general.

That's even worse :-)

That's even worse :-)

It's good!

It means that inventing languages is hard, which justifies a dedicated weblog! He should have said: Don't invent languages, unless you read Lambda the Ultimate.

:-)

:-)

Amen to that!

But Tim Bray's site is worth a visit all the same. Check out the flower photos.

Designing programming languages

Whenever you write a library/API, you are designing a new language, i.e., a family of operators that are about to be used together in a certain way. Are you suggesting that we shouldn't program if we don't have to? ;)

Jargon

A family of operators is not a language, not even a dialect. It is at most a jargon ( or gibberish ;)

Kay

Jargon

Consider a stream abstraction with operators "open", "read" and "close". "Read" cannot be called unless "open" has been called before, and not anymore after "close" has been called. Furthermore, it's a good idea to always "close" streams.

This is not just a bunch of operators thrown together, but they are to be used in a certain way - in a sense, they have to be obey a "grammar". I would already consider this to be closer towards a language.

I would claim that most APIs have such non-trivial interdependencies between operators.

Not sure I agree

There is nothing in the English language that enforces I use the following three conversational phrases in particular order:

  • "Hello"
  • "Nice weather we're having"
  • "Goodbye"
Well, from an API perspective the enforcement of the rule that says that "read" cannot be called unless "open" has been called before, etc... is extra linguistical. That is, there's nothing in the syntax or structure of the language that enforces that I write code that enforces this order beyond getting a runtime error (e.g. saying "goodbye" first and then saying "hello" is legal from an English standpoint but confusing none-the-less).

Enforcement

It doesn't need to be enforced. "Runtime errors" are sufficient, so to speak. ;) (Languages that have dynamic type systems are still languages, right?)

Consider "We weather nice're having." You would get a bad mark for this in school, right?

Furthermore, I vaguely recall reading about an approach that makes it statically checkable that open, read and close are called in the right order, but I don't remember where I have read about it. As another example, design by contract is, though not static, at least an approach that claims to help to detect these problems earlier. Etc.

It's languages all the way down

ITLS! I.e., this notion that there's a separation between "programming" and "designing languages" is completely false. Of course, there's certainly a spectrum in the quality of the resulting languages.

You are right!

This is a big problem in Tim Bray's arguments! He says: don't invent new XML languages, but use microformats. But all his arguments apply to microformats as well.

The guy is right.

I couldn't agree more with John Mitchell. Programming is language creation mostly. All these efforts (AOP, LOP, etc) and every API designed are mostly language creation tasks. And since language creation is the main task, the best way to "program" is to work with a tool that allows "language creation"...i.e. a system open like Lisp.

Language Oriented Programming

The link leads to an article about Language Oriented Programming. It's funny that it is mentioned as the "next programming paradigm", when it is over 50 years old. That's why I said "IT is on an acid trip."

Don't design a database, either!

I'm really confused by the articles. The only explanation I have is that there's an unwritten assumption that he really means "don't invent a new language and try to push t down everybody's throat, i.e., abolish W3C", but then there is the last "To the Managers Out There" paragraph which is ridiculus, and as Pascal Costanza said, really dangerous.

Let's compare XML design with database design:

  • Neither Easy Nor Fun - definitely.
  • Pass/Fail Ratio - I don't really know what he means by "does it matter" test. No XML language has established world peace yet, and no database has either. So I guess they all fail.
  • Software Pain - Oh yes. There's table definitions, constraints, triggers, query optimisations, and of course applications. There's no such thing as a standardized parser, even.
  • Network Law, Opportunity Cost - Even worse than XML.
  • The Big Five - You should definitely check if you can buy a database product that already does what you need.
  • To the Managers Out There · The next time one of your technical superstars comes into the room and says “We gotta design a database for X”, make them prove they can’t buy one. And if they can prove it, sigh deeply and budget a couple of years’ delay, and a few thousand more engineering hours.

So databases are even worse than XML languages. And you could make a similar list for any other kind of business application. What Tim Bray is saying is that no software deserves to be developed.

The next 600 data languages

Funny he mentions 600 data languages, and on the agenda for tomorrow's POPL session is a talk entitled The Next 700 Data Description Languages. Wish I could be there!

XML is useful, but it is not a language.

1. The author is confusing language with data format.

XML is not a language, nor are its subsets like XHTML and Atom. They are data formats.

2. The issues facing someone developing an language are very much different than those developing a data format. It should never take thousands of hours to develop a data format, especially ones that are based on a pre-existing framework like XML.

3. The author has totally ignored internal data formats. Internal data formats, whether or not they are based on XML, are easy. I have personally created several in the last couple of weeks, all of which were successful.

4. External data formats, those used outside the developers company, are another issue entirely.

In my opinion, they should be developed by the standards body for the industry in question. For example, MISMO (Mortgage Industry Standards Maintenance Organization) controls the XML specs for the mortgage industry. No "selling" is required because the people who need to use it are the ones who developed it.

5. The auther is assuming the langauge predates the software. In the real world, the software usually exists first. The problem is, there is a lot of software that already exists and none of them are talking. Thus a standards group has reason to design a common format so the programs can share data.

6. Every data format should be specific to the data it contains. You cannot use any of the "langauges" mentioned by the author to hold mortgage applications in a meaningful way. You need a very detailed data definition, inlcuding ranges, units, and real-world meangings for every field. This is time consuming, but would have to be done even if you use one of the "languages" cited by the author.

To clarify a few things:

XML is not a language, nor are its subsets like XHTML and Atom. They are data formats

XML isn't a "language" in the programming-language sense; it really isn't a "data format" either. Rather, it's a specification for creating text-based data representations, all which operate with a common substrate. Those text-based data representations may or may not be languages--XSLT is generally considered to be a programming language by most definitions. OTOH, an XML application for describing a purchase order, would not be. XHTML is complicated and interesting enough that I consider it to be a fine example of a DSL.

Then what is a data format?

If XML is not a data format, then what is? Does that mean CSV is not a data format? How about R12?

In my mind a data format is a known way of representing data. It is not necessarily rigorously defined with meaning of every value known, though that level of depth is appreciated.

One of my major objections to the article is the term “language”, which on its own has no useful meaning. It doesn’t convey the underlying common issues that a class of languages shares. If you say X is a “written language” or “programming language”, then I immediately known what it is used for. If the former, than I won’t ask what hardware it is used on. If the latter, I won’t ask what countries use it.

If you want to call self-describing a data formats a markup language, so be it. But it shouldn’t be called simply a “language”, especially when it could give the connotation that it is a programming language, and thus would have similar issues.

As for your comments on XSLT and XHTML, I agree.

One final note, writing new programming languages isn’t that hard either. I have worked with a company that created its own programming language for running a genetic algorithm engine in about a week. Granted it was domain specific.

Eh?

One final note, writing new programming languages isn’t that hard either.

Well, implementing a programming language might not be that hard, but designing one certainly is. Otherwise how do you explain the fact that many (most?) programming languages are severely broken in one respect or another? Designing a programming language typically means making tradeoff decisions here and there, and although a decision made may seem right at the time, it may turn out to be very wrong. A good example is C arrays. When C was created the tradeoff between speed and safety was typically biased in favor of speed, hence the unsafe, but very fast C array. Unfortunately, that tradeoff ended up being a very bad decision for two reasons:

  1. Computers soon became fast enough (depending on your definition of soon and fast enough, of course) that such a decision was no longer warranted (except in the most critical of applications).
  2. Although the cost of the decision is hard to estimate, the damage is very clear: many pieces of malicious code have taken advantage of the weakness of the buffer overflow, a direct result of unsafe C arrays. The cost of this damage could be in the millions, or even billions of US dollars.

Fortunately for the modern programming language designer, there is a lot of past research and experience regarding the design of programming languages. Also, there are forums such as LTU that have many bright individuals (excluding myself of course; maybe some day I'll be among those who understand this stuff ;) who are very knowledgeable of programming language theory. This makes the job of designing a programming language easier, as there are many people who know the tradeoffs and have experience as to what the better choice is.

Otherwise how do you explain

Otherwise how do you explain the fact that many (most?)--snip--

I think the word you're looking for is "all".

Also... I dispute that there's a "better" choice for every dilemma in PL design. Choices seem to combine in a way that has non-linear effects on the value of the resulting language.

Design decisions

"I dispute that there's a "better" choice for every dilemma in PL design. Choices seem to combine in a way that has non-linear effects on the value of the resulting language."

That's why it's better to have a language toolbox that allows you to create your own language for the task at hand.

Yes.

Also... I dispute that there's a "better" choice for every dilemma in PL design. Choices seem to combine in a way that has non-linear effects on the value of the resulting language.

Yes that is true. In fact, I would argue one of the most difficult aspects of language design is deciding what tradeoff to make for your particular circumstances. I suppose that is why they are called tradeoff in the first place: your making some sort of compromise.

DSL vs. general purpose.

Key word in his post was "domain specific". Most programming languages suck because they have to balance a wide variety of tradeoffs; domain specific languages have a much narrower requirement, and hence don't need to make so many tradeoffs. I've done written a DSL in a day that works perfectly well for my purposes, cuts my job by 3/4th, and has already saved me significantly in maintenance work. I wouldn't think of using it for general programming (in fact, I can't - it's tied to the specific framework and task it was made for, and isn't really Turing-complete), but it gets the job done fine.

First-hand experience is confusing me

I would like to accept Tim Bray's arguments. In fact, for IT-oriented applications they may (or may not) make sense. For the kinds of software development I've been doing, custom XML description schemata have been pretty nice.

Concrete example: The X protocol description language we've developed for XCB. This language was originally written as an M4 meta-language, and Jamey Sharp and I were pretty happy with that. But no one else would work on it with us. :-) Besides, M4 is hard to debug. So Jamey and Josh Triplett designed an XML representation of X protocol requests, and an application to generate C code from it using XSLT. (XSLT, BTW, which Bray doesn't really mention at all, is one good reason to represent things as XML: it's a pretty clean way to do data transformations.) We've since built other applications to process the same description, including a protocol analyzer plugin and some documentation stuff (in progress). Other XCB contributors have successfully written descriptions of protocol extensions. It certainly hasn't taken us thousands (or even hundreds) of hours, and it certainly seems to be working pretty well for us.

So why do we avoid Bray's critique? I don't know, but I'll make some guesses:

  • We know what we're doing. We all have experience designing and implementing programming languages and data descriptions. Josh and Jamey have a lot of clue about XSLT.
  • Our work is local to our project. We aren't trying to interoperate with anybody: just using XML as a convenient data representation syntax with nice tools available for it.
  • We prototyped the schema first. The XML schema pretty directly mimiced the M4 description, which meant that we knew what we wanted to do when designing it.

I'm not sure what our experience (on other projects as well) says about new XML deployment in general; regardless, I'm certainly not giving up on representing data with new XML schemata any time soon.

(BTW, I agree completely with Bray that XML descriptions should always have explicit, automatically validatable schemata. In my experience, the RELAX NG meta-language greatly eases this task.)

Experience

"We know what we're doing. We all have experience designing and implementing programming languages and data descriptions. Josh and Jamey have a lot of clue about XSLT."

I think that's an important statement. You only gain experience in doing something by actually doing it.