Martin Fowler on Language Workbenches and DSLs

I thought this would be interesting, if only because of the author: Martin Fowler, of UML and XP fame, on

Language Workbenches: The Killer-App for Domain Specific Languages?

and

Generating Code for DSLs

Never really took the guy to be a language guru. His books are Ok though.

[fixed second link]

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Advanced Perl example

You can get a lot of the first link from chapters 8-10 of The Art of Unix Programming by Eric Raymond. You may also be interested in this article.

The second link looks like a C#-ified lift of chapter 17 of Advanced Perl by Sriram Srinivasan (1997).

I hope you find these links useful.

Cool

Nice stuff, enjoyed the second link a lot.

Having read both the articles there is one thing I am wondering about. What are they using to typeset their articles? Both articles look like they have been processed. SGML or TeX? Can anyone recommend an easy TeX-like macro language (preferably not some TeX2Html) for web publishing?

Skribe

For web publishing, maybe you would be interested in Skribe.

LyX

LyX is not exactly a macro language, but still pretty cool.

LyX and Literate Programming

What's especially interesting about LyX to me is its integrated support for literate programming via noweb. Check it out.

Great

I was thinking more in the direction of Skribe since I tried docbook once (and fled sreaming thus scaring my neighboors ;0) and didn't want to go the road of TeX for web publishing since I felt I needed a more light-weight approach. (Thought of building my own publishing system, but, uhm, decided against it. That took some coding nights though.)

In the end, the only functionality I feel I want from TeX is (a) simple macros and (b) the automatic generation of tables, links, and bibliographies. So I guessed something between a simple macro like language and SGML would be best.

I'll try them both.

[Update, I didn't read the other threads. Maybe I'll be stubborn and try to do it myself in Javascript ;-)]

Textual input and Language Workbenches

I thought that one of the interesting parts of Martin's articles was the subtext (no pun intended) that DSL's of the sort he's talking about are not well suited to being expressed in traditional textual forms. I personally agree with the majority of the articles' contents, which are the best explanation I have seen of the subject. However I wrote up some of my thoughts on whether text has a role to play in DSLS in a blog entry, and why people might wish to see the back of textual inputs.

In subsequent private correspondance with Martin, it traspires that he doesn't share the enthusiasm of some of the language workbench people for working towards the eradication of textual inputs. However I'm interested that noone else seems to have really picked up on this particular part of the language workbench concept - it would indirectly force a major change in the way that one could develop software.

Ambiguous about it too

When introduced to UML, a substantial part of a big engineering department voiced the following opinion: "We don't use hieroglyphs anymore in the western world since we broke with Egypt." [As told by a collegue] Given the current state of the art, I find it hard to disagree; at least in the application domain of general purpose languages, text is just too powerfull.

However, I feel some complexity is better managed pictorially. Looking at DB and relational modelling tools there still is a lot happening there; they seem to shape up to expectations. (Even if there is always text behind the surface, and rightly so, I think.) An example where a graphical language does substantially increase programmer productivity?

Personally, even if still I program in Vim, I really do think that graphical languages, even general purpose ones, will break through eventually; but I don't expect a revolution within, say, ten years. (But yeah, the world also didn't need more than five computer, right?)

As I see it, one major show-stopper for graphical languages at the moment is that I don't know one language where graphical source-code can be kept for more than ten, or twenty, years without rotting. [With the exception of spreadsheets?] Do we have any reason to believe that UML will still be tha-thing-to-do in fifteen years? This stuff just takes a lot of time.

I definitely see a place for

I definitely see a place for diagrammatic languages too. After all, that's a substantial part of what my day job pays me to research ;) I should note though that Martin wasn't referring directly to diagrammatic languages though. The JetBrains stuff inhabits the previously largely uncharted territory between textual and diagrammatic language.

Referring to your post, personally I have always believed that the somewhat apocalyptic vision one sometimes hears about diagrammatic languages completely elbowing out textual languages seems rather unlikely. Diagrammatic languages have many uses, but they're probably not suited to everything. And, of course, we have a few lines of legacy textual code to consider too. Working out what works best in each domain, and how to get them working well with one another, is likely to be one of the most interesting pieces of work over the next few years IMHO.

You are right to point out the problems of storing models. This mailing list post documents a few run-ins I had with XMI a while back. [XMI is the so-called "standard" for interchanging UML models.]

Text vs. pictures

I am am firmly in the text amp, but I agree that this doesn't really have to be either/or, and that only time will tell.

Personally, I think that if one must go to diagrams, it is better to have an underlying textual language, that can be understood by humans reading the text, and use diagramming techniques to make understanding/manipulation easier.

Also: one should distinguish between programming in the large (i.e., using UML models for architecures, top level design etc.) and programming in the small (i.e., algorithms). The case for diagramming the former is much stronger.

Graphical languages are better for small programs.

The case for diagramming [programming-in-the-large] is much stronger [than for diagramming programming-in-the-small].

I think just the opposite. One of the biggest problems with extant graphical languages is that they are inefficient in terms of space ("screen real estate"). The amount of information (real information, mind you---not intuition or warm fuzzy feelings) you can get out of a page of graphs is much less than you get out of a page of text, even if you're not using mathematics. Whenever I see one of those "software layer" diagrams, or those toolchain diagrams (you know, with a little computery thing connected to a database/cylinder, which is connected to another computery thing), I just skip it because I find them next to useless. I know that UML, etc. are somewhat different from this, but, really, the difference is pretty minor.

(A problem with UML-type things is that they purport to give you a broad graphical overview of a system, but really the important details are all captured outside the system, for example in the text adorning the diagram. If the diagram notation really carried any useful information, you could erase all that text and replace it with unique numbers and the diagram would still be useful. As it is, the low-level details, which are by and large outisde the diagram, can easily subvert the impression you get from the diagram itself: the diagram is potentially treacherous.)

Part of the reason for the space inefficiency is that all graphical languages seem to be graph-based, and projecting a graph onto a 2D grid in a dense way makes it unreadable. Also, because you're using a projection of a graph instead of the graph itself, there are more ways to present the same graph, and it is difficult to see if two graphs are the same (in any sense of the word). And if a formalism makes it difficult to see if two things are the same, it better have both 1) a complete set of rules for showing that two things are equivalent and 2) some other feature which makes up for this nuisance. UML, and indeed all of the "large-scale" graphical languages I've seen, have neither.

(Maybe what people ought to be looking into is graphical languages which are lattice-based, so they can better exploit the geometry of a screen or rectangular region on paper. So you would have a program that looks more like a bitmap or cell automaton than a bunch of circles, squares and connecting line segments. But I have my doubts about this too.)

On the other hand, I know from experience using diagrams in category theory that for small problems they can be quite useful, even though I still find it a pain to typeset them.

it's important to understand

it's important to understand that behind any decent uml tool there's a model of the software. the diagrams are various views of that model and the tool guarantees that they are consistent in various ways.

each diagram shows only some of the information, and some details are hidden away, but there's is still a lot of "code equivalent" in the model. when you generate the code, there's a lot there.

the obvious advantage of this approach is that different diagrams show different aspects of the design. that a particular diagram is simpler than a page of text is a *feature*, not a bug.

one way to see this is to reverse engineer code that was not developed with uml. you get to grasp the basic relationships between classes much more quickly than you do by reading the code and you can easily spot various errors - with a little experience you can develop a visual aesthetic for what good designs look like.

having said all that, i'm still not convinved uml is the best way to develop code, even for large teams that need to communicate across different levels, using the less-dynamic oo languages. however, if anyone is reading this and hasn't tried it, i'd suggest giving it a serious go. it takes time to learn, but it has some advantages - things that perhaps aren't obvious, even to very smart computer scientists, without some experience.

visual aesthetic

'with a little experience you can develop a visual aesthetic for what good designs look like.'
I bet you the same visual aesthetic effect could be achieved by using a sufficiently advanced source code grapher.
That is to say that the good designs are not necessarily caused by their being made in UML, but rather that you have been trained to recognize patterns in good designs when made with UML.

Aesthetics ?

One should be very careful about recognizing patterns, too.

Models like UML work at such a high level of abstraction that unless you know precisely what is not in the diagrams, you cannot compare any two, imho.

One thing that surprises me is the amount of pure text in hardware design documents (especially digital hw). And these guys already have visual tools. If there's going to be a successful visual language, my money is on a language that replaces Verilog.

Information density

The amount of information [...] you can get out of a page of graphs is much less than you get out of a page of text

Unless, of course, if your concrete text syntax is XML...
<gd&r>

> One of the biggest problems

One of the biggest problems with extant graphical languages is that they are
> inefficient in terms of space ("screen real estate"). The amount of
> information (real information, mind you---not intuition or warm fuzzy
> feelings) you can get out of a page of graphs is much less than you get out
> of a page of text, even if you're not using mathematics.

This depends entirely on the graphical language in question. There's no lore of the universe that says "graphical languages shall be more space inefficient". As a simple example of an extant graphical language, if you take a page worth of UML class diagram, and it's equivalent in text, the text will typically take several pages - because, conventionally, we tend to use lots of newlines and blank space to split up the text. Of course, some graphical languages will be more space inefficient, but that's not inherent in the paradigm.

graphical languages seem to be graph-based

Typically yes. Text tends to be naturally represented by trees; graphical languages tend to be richer in certain respects (e.g. associations in UML), and generally are most naturally represented as graphs. [As a side note, this richness does not neccessarily come without other costs though. Associations, for example, can be painful for tool implementers.]

it is difficult to see if two graphs are the same (in any sense of the
> word). And if a formalism makes it difficult to see if two things are the
> same, it better have both 1) a complete set of rules for showing that two
> things are equivalent and 2) some other feature which makes up for this
> nuisance. UML, and indeed all of the "large-scale" graphical languages I've
> seen, have neither.

This is irrelevant to the discussion. A graphical language might be formal, but most aren't, and I can't imagine any good reason why they should be. Thus whether graphs can be proved equivalent is a red herring. When the day comes when you can prove two C programs are equivalent, then I'll start worrying about proving graphs equivalent. [And yes, I am well aware that the day in question can't come :)]

Graphics versus text

This depends entirely on the graphical language in question.

And the textual language. And some textual languages, like mathematics, are extremely concise. Take for example a datatype of lists. In a UML-like language, I need to draw a three boxes: one for List, one for Cons and one for Nil, and some inheritance arrows. In text, I can just write, for example, mu x. 1 + Int * x [fixed]. The UML version, at a legible size, is going to be much bigger (and it still doesn't mean quite the same thing).

There's no lore of the universe that says "graphical languages shall be more space inefficient".

How do you know there isn't such a law? I can imagine an argument along the following line. Let's assume that a graphical language must be represented on matrix of pixels, where every pixel is black or white. Similarly, let's assume a textual language must be represented on a matrix of characters. There are more characters than pixels, so you can store more information in a grid of a given size: so text has a higher information density.

Now let's assume graphical languages can use more colors of pixels, as many as a textual language, say 256. Now we have about the same information density. But I think that programs represented this way would just look like noise. There is not much point in choosing graphics over text here, since they're isomorphic.

I think that when we talk about a graphical language, we're talking about something more complex, where you can distinguish continuous features that span more than one pixel in the grid: things like lines, boxes, etc. But now one logical element, say a line, takes up a lot more space than before. Information density is going to decrease, probably even if you increase resolution to the point of illegibility.

Of course, you can argue along similar lines for text: the "spanning features" might be keywords, for example. But my point is that it isn't obvious that graphics beats text from a information-theoretic perspective.

Text tends to be naturally represented by trees; graphical languages tend to be richer in certain respects (e.g. associations in UML)

"Text", to me, means a sequence. If text has a tree structure, then it is because of the text's semantics. Associations in UML have nothing to do with it being graphical; they are part of the semantics. Syntactically, a UML association is just a line with a doodad at one or both ends; that it means "association" is a semantic matter. I can pick any symbol I like and interpret it as an association. Conversely, I can describe a graph textually: that is a matter of syntax.

[Decidability of equivalence] is irrelevant to the discussion... When the day comes when you can prove two C programs are equivalent, then I'll start worrying about proving graphs equivalent.

It may be irrelevant to the discussion you want to have, but not to the one I am having. The reason I don't program in C is precisely because I can't prove C programs equivalent; in Haskell, I can, at least for small programs.

A graphical language might be formal, but most aren't, and I can't imagine any good reason why they should be.

No? I find this rather unusual statement from someone whose publications page lists several papers about model transformations. What is the point of "transforming" something which has no formal description? Are the transforms informal also? Or am I misunderstanding, and the models in question are indeed formal, yet not graphical.

To me it seems clear that, if a graphical language is intended to describe formal objects, namely programs, then it is desirable that the language be formal as well. I cannot imagine why this would not occur to you.

Text != Graphics?

Where do we draw the line between "text" and "graphics"? Both are visual representations in which different combinations of symbols are used to infer meanings. Is "text" different because it uses only "characters"? Or because it is "linear" (i.e. sequential)? How then does one account for the variety of mathematical notation that uses unconventional "characters" and nonlinear arrangements of symbols (a trivial example being a typeset integral equation). Is a Z schema (which uses all sorts of exotic symbols, and boxes and lines to delineate different elements of the specification) a textual or graphic representation? How about UML that includes OCL?


The "information density" of a particular notation is as much a function of the semantics as it is of the space taken up by individual symbols. Consider the relative information density of an assembler program and an equivalent C program (assuming one could prove "equivalence" :-) - the C program has a semantics that works at a much higher level of abstraction, and can convey significantly more information in a smaller space than can assembler. That is the essence of "high-level" languages (and DSLs). In some ways it is the essence of programming - creating powerful, expressive abstractions that allow one to precisely and concisely describe some aspect of a problem/solution.


IMHO the issue isn't so much "graphical" vs "textual": the key distinction to make is between systems of representation that have a precise and formal semantics (e.g. mathematical notations or boolean gates), and those which have an informally defined (or undefined) semantics (e.g. pseudo-code or UML).

> I think that when we talk a

I think that when we talk about a graphical language, we're talking about something more complex

Not neccessarily. Sometimes, yes. But sometimes, no. There's no reason why diagrammatic languages can't be incredibly simple. And some are.

> A graphical language might be formal, but most aren't, and I can't imagine
>> any good reason why they should be.
> No? I find this rather unusual statement from someone whose publications
> page lists several papers about model transformations. What is the point of
> "transforming" something which has no formal description? Are the transforms
> informal also? Or am I misunderstanding, and the models in question are
> indeed formal, yet not graphical.
>
> To me it seems clear that, if a graphical language is intended to describe
> formal objects, namely programs, then it is desirable that the language be
> formal as well. I cannot imagine why this would not occur to you.

This might be heresy, but the sort of total formality that you're talking about has had almost zero impact on the way things work in the real world. Probably the most common transformation most people on LtU do involves compilers. As you imply in your message, proving anything other than toy Haskell problems is impossible. Noone's ever proved GHC correct - and noone ever will. It's just not possible. Does this mean that GHC isn't good and useful? I think a lot of Haskell fans would argue otherwise. [Would I be correct in thinking that GHC spits out C as an intermediate format? If true, maybe you should care about the equivalence of C programs, because you might be using them].

So, nope, the inputs aren't neccesarily formally defined, and nor are the transformations. But they're still useful. And that's what counts. I think we tend to forget that formality is a means to an end. Because our current formal mechanisms are often very weak - sometimes just because we don't know how to do better yet, but sometimes for fundamental reasons - as soon as you restrict yourself to that class of things, you basically rule out doing many of the useful, real world things that people want to do. The conclusion I have made over the years is that it is not worth ruling out useful things just to satisfy what, in the harsh light of days, is really just an esoteric academic itch.

Such blatantly informal arguments! ;-)

[The dialectic rationalist in me comments:]

Is there anyone here who can define formal? Math is hardly formal. Maybe logic is formal? I once read John Macfarlane on formality, but yeah....

Btw, if anything, C programs look pretty formal to me...

[Damned, it's not like I actually want to have an opinion on stuff like this. ;-)]

[Hmpf, should add that I agree with the arguments of Laurence]

Well

Much of maths can be formalized using first order logic. And then there's philosophy of maths...

UML defines how its diagrams are structured, but as far as I understand Frank, it leaves it at just that (intentionally) without defining equality, etc. This is similar to defining a new set of numbers but not any algebraic operations on them. You can now have a matrix of these new numbers, but its usefulness is debatable.

Thus?

Much of maths can be formalized using first order logic. [snap]
And subsequently isn't, with good reason.
UML defines how its diagrams are structured, but as far as I understand Frank, it leaves it at just that (intentionally) without defining equality, etc. This is similar to defining a new set of numbers but not any algebraic operations on them. You can now have a matrix of these new numbers, but its usefulness is debatable.

Uh, I would say elements and operations are mostly defined informally in some natural language. Which is exactly what most mathematicians do. In my opinion, in that respect UML and normal usage of math are pretty similar. Both being informally defined languages.

So who knows what formal is? Again, the closest we have to formal math is symbolic reasoning/computation. Well, if formality is related to rigor, I guess most ASCII source code is more aimenable to formal interpretation than a large number of Greek symbols on paper. (At least you will find more machines who will agree on the correct interpretation of the first than peer referees on the latter).

It's massive

In my opinion having at hand a formal definition for something means being able to check if a given object matches this definition in a methodical way.

UML is actually formally defined by OMG.

"Formal" eh? It's defined in

"Formal" eh? It's defined in UML, and it's not at all clear what that buys us. I can't really see what valid operations are that would preserve a given property are, for example. Has anyone ever used the formal model for anything other than checking well-formedness of models?

Yeah

UML is actually formally defined by OMG.

I know that, and I also know that most of industries initial reaction five years ago was somewhere along the line "Why in God's name would you want to do that for?" I don't have very strong opinions about it either way.

Yeppers

There needs to be a formal definition if a tool is ever going work on UML diagrams or assist us in any sense. The definition may not be as formal as we are used to since it uses UML itself instead of a metalanguage as posted above. But its creators must think that the definition is unambiguous.

And I would guess that it is possible to sit down and make it even more formal. However, that's not the point, nor would it help. You can show that a given C++ program with nonsensical variable names and no comments performs quicksort and it is equivalent to or better than another C++ program that performs, say, bubblesort.

Can you say that a group of C++ functions who have meaningful names, arguments and commentary are better structured for a certain task than another group of C++ functions (without actually filling the function bodies) ?

It strikes me that I can make a very small part of my formal arguments by mentioning the call graph and the length and types of the arguments, but anything formal or informal past that point will have to mention the plain English text. And this is not because C++ is informally defined (C++ compilers exist); it's just that the text within the comment block or the function/class name is formally undefined, and this analysis depends entirely on this kind information.

Heresy

This might be heresy, but the sort of total formality that you're talking about has had almost zero impact on the way things work in the real world.

I didn't realize I was talking to an antiformalist. I don't want to get into this discussion yet again. But just since you mention it:

As you imply in your message, proving anything other than toy Haskell problems is impossible. Noone's ever proved GHC correct - and noone ever will. It's just not possible. Does this mean that GHC isn't good and useful? I think a lot of Haskell fans would argue otherwise.

If you are willing to grant that Haskell is useful, then you ought to be convinced of the usefulness of formality. Though Haskell itself lacks a formal definition, languages like Haskell, SML and Ocaml (not to mention their compilers) would and could never have been created if the designers did not have a good understanding of formal methods. Even if you set aside the fact that ML was developed precisely to do formal work (theorem proving), and Lisp inspired by a calculus Church developed for studying logic.

Just because proving the correctness of large programs is (currently) infeasible does not mean that formal methods cannot inform, guide and have a significant impact on such programs. That is like saying that, just because we do not have an ideal, atomic-level description of some system, mathematics is useless for solving physical problems. Tell that to Newton. And to the rest of the scientific community.

You criticize formalists for being fetishistic. But you are the one who's insisting on an all-or-nothing approach, either all formal or all informal.

It's true that if, say, I can prove a core language type-safe then that result does not necessarily extend to an extended language. But I can make a much more convincing informal argument that the entire language is safe by doing that bit formally, and then arguing informally why it ought to extend to the whole language. Such an argument is certainly no worse than an argument that is completely informal, and it satisfies what ought to be a key criteria for any scientific investigation, namely that the results are as objective as possible and make use of the best possible tools.

In other words, compared with an informal approach, a formalist has extra tools at his disposal, and formal methods need not be limiting.

Noise

Now let's assume graphical languages can use more colors of pixels, as many as a textual language, say 256. Now we have about the same information density. But I think that programs represented this way would just look like noise. There is not much point in choosing graphics over text here, since they're isomorphic.

You mean like this? ;-)

    

Yeah, well, 'Hello World' in the Piet language. With a funny side-comment on a language by some E. Meyer.

An analogy to physics

I think a good analogy here would be to physics or engineering: in modeling a dynamic system, it is not uncommon to prepare a simple diagram of the system which shows the basic entities involved, and illustrates their relationships. However, those diagrams are meant mostly as an aid to the reader. The real analytical heavy lifting is always done in textual (i.e. mathematical) form. Even the relationships between entities, which may be represented in diagram form, are formally defined within the equations. I think that diagramming languages can serve a similar purpose in the software world - a handy visual aid, but not the definitive representation of the system


Having said all of that, it's worth noting that some concepts (especially the more complex ones) just don't appear to lend themselves well to graphical representation. At that point the diagrams become even more like cartoons (or disappear completely), and the math becomes far more dense. As Frank has already noted, it's not clear that graphical notations (especially the existing ones) can adequately represent many of the structures that we encounter in the software world.

A number of exceptions...

1. Physicists routinely use Feynmann diagrams to do the analytical heavy lifting; people really don't think of them as deWitt indices.

2. Category theorists routinely use diagrams in preference to purely equational descriptions of commuting compositions.

3. AI researchers use graphical representations of Bayesian belief networks to describe conditional independence relations.

From these examples, I infer that where diagrams really win as a notation is when a) you can infer something interesting from the absence of a line connecting objects, and b) the graph is sparse. This lets you turn an O(N**2) equational description into an O(N) diagram.

Also, as a UI matter, graphical objects with more than about 50 elements become unusable.

In an environment where open

In an environment where open source is building up more and more momentum, it becomes more and more valuable to have as large a readership as possible. Therefore, rather than a physics paper, I think a magazine layout would be more interesting for the reader. Or perhaps a website.

This would use stylesheets to derive a layout and perhaps the diagrams you suggest based on the program text. All of this should/would customisable for each person in a reader not unlike Emacs but with graphics capabilities.

Dr. Scheme does something very rudimentary in this direction by overlaying arrows to definitions when you mouse over identifiers.

Even more rudimentary are syntax highlighters. Doxygen is also worth noting (though it's read only).