Lambda the Ultimate

inactiveTopic XML Is Too Hard For Programmers
started 3/18/2003; 1:11:52 PM - last post 3/23/2003; 3:22:03 PM
Dan Shappir - XML Is Too Hard For Programmers  blueArrow
3/18/2003; 1:11:52 PM (reads: 3259, responses: 33)
XML Is Too Hard For Programmers
via Scripting News

One response has been a suggestion that we need a language whose semantics and native data model are optimized for XML. That premise is silly on the face of it

Certainly a statement I can live with. Tim Bray gives two technical reasons for this, I can think of one general reason - it's contrary to XML's intent a universal data exchange format.

The rest of the article explains why XML is too hard for programmers, but reads more like why XML is too hard for Perl programmers.
Posted to xml by Dan Shappir on 3/18/03; 1:49:19 PM

Dominic Fox - Re: XML Is Too Hard For Programmers  blueArrow
3/18/2003; 2:25:34 PM (reads: 2634, responses: 0)

Much of the discussion seems to centre around the difficulty of applying regexps to an XML document, viewed from the regexp-engine's perspective as a stream of characters. This is a "natural" way for Perl programmers to attack the problem, as it works very well for a lot of the other things Perl is used to do, but it comes unstuck when you have to reckon with namespaces and other ways in which an actual instance of an XML document might deviate from the canonical representation of the "same" document.

The "traditional" alternatives are i) build a complete representation of the document in memory, then traverse the tree structure via either the DOM or XPath (too clumsy, resource-intensive and cumbersome, according to Bray and some of the others he quotes), and ii) use the event-driven model provided by SAX and implement callbacks (counterintuitive and awkward, especially one gathers in Perl).

I don't see why a sequence of SAX events couldn't be treated the same way as a stream of characters, and processed by a regexp-like engine that matched on event characteristics and data rather than character values. This would enable single-pass processing without having to build a document model, but with better "impedance matching" between the type of data being processed and the pattern engine. In other words, instead of passing each event to a handler as it fired, you would serialize it and place it in a buffer to be picked up by the regexper.

I believe there's also work afoot to define a subset of XPath that will work with SAX (e.g. without needing to be able to 'see into the future' for descendent nodes that will identify the current node as a match), which would enable SAX-based XSLT transformation (with a subset of XSLT using this subset of XPath). This subset of XPath could also be used for pattern matching in a single-pass context, where you just want to grab all values that match a particular set of criteria out of a stream of XML data on the fly.

Sergey Goldgaber - Re: XML Is Too Hard For Programmers  blueArrow
3/18/2003; 4:12:10 PM (reads: 2606, responses: 1)
"XDuce ("transduce") is a typed programming language that is specifically designed for processing XML data. One can read an XML document as an XDuce value, extract information from it or convert it to another format, and write out the result value as an XML document. Since XDuce is statically typed, XDuce programs never yield run-time type errors and the resulting XML documents always conform specified types."

http://xduce.sourceforge.net/

And it's written in OCaml :)

Paul Snively - Re: XML Is Too Hard For Programmers  blueArrow
3/18/2003; 4:29:52 PM (reads: 2587, responses: 0)
I read the essay, and was struck by the repeated assertions that the "callback approach" to XML stream processing was awkward, clumsy, felt unnatural, etc. I don't understand this at all. To me, that says more about the languages/libraries being used than about XML. What about SXML/SXPATH? Is that unnatural? The XML support in Haskell?

Callbacks are only unnatural if you're not familiar with a functional style of programming, IMHO.

Patrick Logan - Re: XML Is Too Hard For Programmers  blueArrow
3/18/2003; 8:36:16 PM (reads: 2536, responses: 0)
I am not sure why XML should be incorporated into a general purpose programming language. XML is just an information transfer syntax.

More than XML per se, I wrote a response to this article about what I'd like to see general purpose language designs aim for.

Chui Tey - Re: XML Is Too Hard For Programmers  blueArrow
3/18/2003; 8:50:55 PM (reads: 2566, responses: 0)
1. Like it or not, programmers have to deal with XML regularly (it's every other day for me now), and having some kind of native support can help ease the pain.

2. Microsoft, for example, can create classes from XML schema, so that node selection can be spell checked, eg. orders.order[0].itme instead of selectNodes("orders.order[0].itme").

3. The XML DOM is extremely verbose and takes a lot of finger-typing.

4. Holding DOM in memory is not an issue in my case, as I find that most of the XML I'm dealing with is gratituous XML, like in simple configuration files or with data integration projects, where the document are small.

5. What I'd like to see is something like embedded xslt (similar to embedded sql) where I can switch over to xslt to do some quick transforms and then continue with my program flow.

Patrick Logan - Re: XML Is Too Hard For Programmers  blueArrow
3/18/2003; 10:01:09 PM (reads: 2612, responses: 0)
Like it or not, programmers have to deal with XML regularly (it's every other day for me now), and having some kind of native support can help ease the pain.

Agreed that programmers have to deal with XML and require better support for that.

I'm not sure what should be made "native". We're still learning about XML and related standards like XSLT, XPATH, XML Scheme, Relax NG, etc.

This is another situation where an extensible language like Lisp or Scheme, or a dynamic language like Smalltalk, can be applied without overcommiting the core language designers into decisions that make sense today but perhaps not tomorrow.

Ehud Lamm - Re: XML Is Too Hard For Programmers  blueArrow
3/18/2003; 11:49:21 PM (reads: 2585, responses: 0)
XDuce and firends were discussed here many times. Check the archives.

Dan Shappir - Re: XML Is Too Hard For Programmers  blueArrow
3/19/2003; 2:24:34 AM (reads: 2477, responses: 0)
Since I arived at this link via Scripting News rather than via Slashdot, I failed to mention that Tim Bray was one of the co-authors of the original XML 1.0 specification.

Using regexps to parse XML has several problems beside namespaces. For example, if the XML is coming in over a socket (as it is very likely to do these days) tag names could become broken across packets. Yes, regexps can handle this, but it makes things even more complex.

In general, regexps are not the ideal method for doing parsing, for performance reasons if nothing else.

Creating classes from XML schema is certianly appealing, and Microsoft also provides the inverse route of serializing objects into XML. However, this approach also has its downsides. Mainly, as I believe Don Box has pointed out, there are some missmatches between the XML Infoset and PL data structures. Adam Bosworth, in an article referenced by Tim Bray also touches on this issue.

Adam Bosworth also comments on the baroquness and verbosity of using the DOM (as does Chui Tey). However, using the DOM does have its advantages. Much of its verbosity stems form the fact that it's a cross PL, cross platform standard. If you are using MSXML you can even pass DOM objects between PLs and processes (using COM). This of course makes no difference to someone just using Perl.

Many programers do find SAX ackward, even backward. It maybe that the callback approach comes naturally to functional programmers, but as we know, most programmers aren't. BTW, using SAX should also come naturally if you are familiar with interfaces and components.

In .NET Microsoft has come up with XmlReader, which they term "a pull model vs. the SAX push model" (check out a comparison to SAX). Microsoft uses this reader throughout .NET.

Dan Shappir - Re: XML Is Too Hard For Programmers  blueArrow
3/19/2003; 2:51:41 AM (reads: 2472, responses: 0)
A personal anecdote: I performed a code review of a server application written in Java. The application used XML over sockets to communicate with its clients (something like simplified SOAP). The XML parsing and generation was done using the Xerces DOM implementation.

After performing performance profiling on the application, I replaced much of the XML parsing and generation code with simple string operations. The resulting code was a bit more brittle, it would require modification for most any change in the XML Schema, but performance was improved five fold.

(I also made sure to concentrate the XML code into a few dedicated functions so that modification due to schema changes would be localized).

Dominic Fox - Re: XML Is Too Hard For Programmers  blueArrow
3/19/2003; 3:04:52 AM (reads: 2449, responses: 1)

Simply binding SAX events to handler functions isn't that tricky, whether you're coding Scheme, Java or Python. I shouldn't think it's that hard in Perl either (I haven't tried). What's more difficult is managing switching between different handlers based on context; once you get beyond a trivial level of complexity, you start having to build FSM-type frameworks to keep on top of it. This isn't hard hard, but it's less straightforward than you'd want for many purposes.

By way of illustration: suppose we have a Controller object and a bunch of objects implementing IContentHandler and IState. Events are picked up by the Controller object and passed to the current State object, which can choose either to handle the content passed to it, have the Controller push it on to a stack and switch to another state (when we go down a level into a nested tag), or pop the previous state off the stack and switch to that (when we encounter an EndElement event). The DTD / Schema can be enforced by each State object checking the events passed to it and objecting to any content which is invalid for the current context.

I've built an apparatus like this manually a couple of times, which is often enough to realize that it ought to be possible to assemble one automatically from the DTD; I believe .Net already has a tool to do this, and there's a Python library I keep meaning to look into that's meant to do the same thing.

The lesson I draw from this is that certain kinds of XML-related work are too cumbersome for programmers to have to do them repeatedly by hand, but that the answer to this isn't necessarily either simplification of XML or the creation of custom languages (although either might be a valid route for certain purposes) but better ways of encapsulating and abstracting away from the boilerplate-and-busywork part of the task.

Ehud Lamm - Re: XML Is Too Hard For Programmers  blueArrow
3/19/2003; 3:35:32 AM (reads: 2466, responses: 0)
better ways of encapsulating and abstracting away from the boilerplate-and-busywork part of the task

As you mentioned there are tools for various anguages that try to assist with this task.

However, I'd be interested in hearing what you think is missing for the abstraction facilities in, say, C++ or Java, for encapsulating the mechnisms you are thinking about.

Dominic Fox - Re: XML Is Too Hard For Programmers  blueArrow
3/19/2003; 4:51:56 AM (reads: 2426, responses: 2)

In this context I was thinking of code that writes code: automatic class generation tools, a la .Net, for instance. For all I know there may be libraries available in C++ and Java for doing this (as I mentioned, there is certainly a Python DTD-to-native-classes tool); there's nothing I know of in the standard Java libraries, though (and C++ I don't use much at all). My point was that writing FSMs to manage context-switching for a SAX-based parser isn't a task on which human ingenuity ought to be repeatedly expended - it should prove to be fairly automatable. The human programmer would come in at the end and specialise the machine-generated code to perform the desired tasks, either by modifying it directly or by using a mechanism such as inheritance plus overriding to fill in the blanks (as in the GoF Template pattern).

So far as useful language mechanisms are concerned, the most useful (that C++ and Java lack) are those that would support continuations, so that the "push" mechanism of SAX could be used by a "pull"-based parser.

I'm not well up on continuations myself, so this is likely to be a fairly fuzzy account, but basically it goes like this: when you invoke a SAX parser, and handle events by registering functions for callbacks, the flow of execution goes back and forth between the parser and the callback functions. What's needed for a "pull" mechanism is the ability to interrupt this flow inside one of the callback functions and jump out to an external context, which then in its own time allows the callback function to resume (and return control to the parser). From the perspective of this external context, it only allows the parser to resume when it wants a new event to process, so it can behave as if it were obtaining data on request rather than being driven by events.

Yield in Python and Ruby would be one mechanism for this; from my so far limited reading on FP and continuations, continuation passing would be another. The problem with Yield is that I'm not sure how you would yield to the caller of the function that kicked off the SAX parser from inside the function that was handling the SAX events: yield in Ruby typically yields to a block passed in to the function, and unless you could pass this block to the parser and get it to pass it along to the callback function it's not obvious how the callback function would access it.

One of my colleagues suggested earlier that placing the data you wanted to pass back in a synchronized and shared location and yielding to a separate thread would work (and work for Java and C++ as well); but this seems a little heavyweight for the purpose.

Ehud Lamm - Re: XML Is Too Hard For Programmers  blueArrow
3/19/2003; 5:29:03 AM (reads: 2452, responses: 0)
It's ok, others here are pretty much continutation mavens.

So basicalyl what you want is continutations and a macro system.

All together now: Scheme!!!

Frank Atanassow - Re: XML Is Too Hard For Programmers  blueArrow
3/19/2003; 7:17:18 AM (reads: 2375, responses: 0)
I have two points to make.

First, I think the perceived difference between SAX-like event-style APIs and DOM-like tree-style APIs is mostly irrelevant vis-a-vis performance. The reason is that it simply isn't the case that event-style APIs automatically give you better performance than tree-style ones for large documents.

If an XML-processing application needs to keep track of context information, then memory use will be bounded proportional to the size of the input document, whether or not you write it in an event-style framework. The only difference is that in an event-style framework, the programmer needs to program and manage his own stack of context information whereas with a tree-style framework he can co-opt the stack where function activation frames are stored.

If the XML-processing applications doesn't need to keep track of context information, then you can probably get basically the same performance as an event-style API from a tree-style API by using an implementation which reads the document in lazily rather than all at once (i.e., keeping it in memory). I admit I'm not sure how you could arrange it so that the context (XML document minus the subtree you are processing) could be freed as early as possible, but maybe this could be done with Huet's Zipper somehow.

So, my point is that the performance isn't really a function of the API, but rather of the algorithm which is using the API and the API's implementation.

Second, why is it acceptable for XML document writers to dump all their data willy-nilly in one file of several gigabytes when it is unacceptable for programmers to dump all their program source in one file? Programming language compiler writers typically assume that programmers will be smart enough to modularize their programs, so they write their compilers on the assumption that a source file will fit in memory all at once and try to support separate compilation. Is there some compelling reason why the same should not be expected of XML document writers?

Yeah, yeah, I know, they aren't programmers, so we want to be as nice to them as possible. But any non-trivial XML application is going to make essential use of the tree structure, which means that, as I explained above, regardless of what API you use, it will need memory proportional to the size of the input document. There just is no solution for this problem: the only thing you can do is require document writers to modularize so that the problem can be broken into separately computable parts.

The way to do this is to create interfaces for your documents and make sure they are respected by the XML processor. For example, if you are formatting a book, modularize it by breaking it into chapters, which are separately processible, and make sure that when you process a chapter that you return information like page numbers, TOCs and indexed words which is needed for global processing. Then you can forget the rest of the chapter contents immediately and consequently you only need to keep one chapter in memory at a time. Then you have another, global processing step (analagous to linking in PLs) which ties the chapters together.

Dominic Fox - Re: XML Is Too Hard For Programmers  blueArrow
3/19/2003; 8:29:07 AM (reads: 2366, responses: 0)
If an XML-processing application needs to keep track of context information, then memory use will be bounded proportional to the size of the input document, whether or not you write it in an event-style framework. The only difference is that in an event-style framework, the programmer needs to program and manage his own stack of context information whereas with a tree-style framework he can co-opt the stack where function activation frames are stored.

There are scenarios in which we might be able to discard context information as we go along, for instance if we only wish to know the "path" to the current node (its ancestry: seventh child of the seventh child of the root element, for instance) and what elements it "inherits" from. In that case, memory usage is bounded proportional to the maximum depth to which elements are nested in a document, rather than by the overall length of the document itself. This would be a candidate for either a SAX-based model or perhaps, as you suggest, a "lazy" DOM.

If I'm interested in all <foo> elements which occur as sub-elements of a <bar> element, then I don't need (all of) a DOM (at once) but I do need some (in this case very trivial) context tracking. There are obviously more complex cases. On the occasions when I've wanted to do this kind of thing, it's been for purposes of serializing an object hierarchy where objects in a container typically don't know (or need to know) about their siblings; the idea here is to go from SAX events straight to custom object model, without having to build a DOM in the interim. So there are some common cases where it does matter what you use, from a performance / resource usage perspective.

paul jensen - Re: XML Is Too Hard For Programmers  blueArrow
3/19/2003; 10:34:41 AM (reads: 2301, responses: 0)
Omnimark (http://www.omnimark.com) was a beautiful language for working with structured text in general, and SGML and XML specifically. It was especially nice for doing the sort of thing Mr. Bray appears to be doing: processing very large datasets, picking out only the elements of interest, and letting the rest "pass through" the program.

It was available for free while I worked there, but alas it is no longer. Too bad...

James Hague - Re: XML Is Too Hard For Programmers  blueArrow
3/19/2003; 11:12:02 AM (reads: 2288, responses: 0)
REBOL has html/xml tags as a type. You can load data and parse it into tags and text with a single command.

Oleg - Re: XML Is Too Hard For Programmers  blueArrow
3/19/2003; 1:23:10 PM (reads: 2252, responses: 0)
Perhaps the following paper will be of relevance to this discussion:

This paper demonstrates how a higher-level, declarative view of XML parsing as folding over XML documents has helped to design and implement a better XML parser. By better we mean a full-featured, algorithmically optimal, pure-functional parser, which can act as a stream processor. By better we mean an efficient SAX parser that is easy to use, a parser that does not burden an application with the maintenance of a global state across several callbacks, a parser that eliminates classes of possible application errors.

This paper describes such better XML parser, SSAX. We demonstrate that SSAX is a better parser by comparing it with several XML parsers written in various (functional) languages, as well as with the reference XML parser Expat. In the experience of the author the declarative approach has greatly helped in the development of SSAX. We argue that the more expressive, reliable and easier to use application interface is the outcome of implementing the parsing engine as an enhanced tree fold combinator, which fully captures the control pattern of the depth-first tree traversal.

PADL02. LNCS volume 2257. The paper is available from Springer-Verlag http://link.springer.de/link/service/series/0558/tocs/t2257.htm and also from http://pobox.com/~oleg/ftp/papers/XML-parsing.ps.gz

Incidentally, the article "On parent pointers in SXML trees" mentioned recently on this log shows several advanced examples of XML SAX-style parsing (custom parsers that maintain node dictionaries and parent links as they parse the document).
http://pobox.com/~oleg/ftp/Scheme/parent-pointers.scm
http://pobox.com/~oleg/ftp/Scheme/parent-pointers.txt

Frank Atanassow - Re: XML Is Too Hard For Programmers  blueArrow
3/20/2003; 7:43:06 AM (reads: 2119, responses: 1)
I realized after I went home yesterday that there is a big error in my post above: the space complexity of an XML processor will generally be bounded by the log of the number of elements in the input document, rather than linear, because the structure is a tree. OTOH, the number of elements is not a very good measure of document size and also, since XML is text, you cannot access a sibling node without traversing all the descendants of a node so maybe the real complexity is somewhere in between linear and log-like.

There are scenarios in which we might be able to discard context information as we go along, for instance if we only wish to know the "path" to the current node (its ancestry: seventh child of the seventh child of the root element, for instance) and what elements it "inherits" from. In that case, memory usage is bounded proportional to the maximum depth to which elements are nested in a document, rather than by the overall length of the document itself.

That is the sort of context info I'm talking about; the depth is on average the log of the number of elements.

But, yeah, there are certainly simple applications where you can ignore the tree structure. But, for those applications, XML isn't really much of a win, is it?

BTW, I think it is rather funny that Bray calls the idea of an XML-specific programming language stupid and then goes on to suggest extending Perl with XML-specific features.

Ehud Lamm - Re: XML Is Too Hard For Programmers  blueArrow
3/20/2003; 11:09:30 AM (reads: 2126, responses: 0)
BTW, I think it is rather funny that Bray calls the idea of an XML-specific programming language stupid and then goes on to suggest extending Perl with XML-specific features.

I am not sure this is as stupid as it sounds. Isn't that the difference between DSLs and DSEL? i.e., stratified design?

Frank Atanassow - Re: XML Is Too Hard For Programmers  blueArrow
3/20/2003; 11:18:58 AM (reads: 2051, responses: 1)
I think the idea behind DSELs is to exploit existing features of programming languages, ones which have general application, not to extend a language with specific support for some buzzword technology.

Dominic Fox - Re: XML Is Too Hard For Programmers  blueArrow
3/20/2003; 12:16:06 PM (reads: 2030, responses: 0)

I was trying to work out what I thought was wrong with Frank's statement that the depth of an XML document tree was on average the log of the number of elements.

Possibly nothing at all, but it did strike me that an XML document of element-nesting-depth n can have n plus any number of elements in it, which means in effect that if the stack you have to maintain for a SAX parser is only deep in proportion to the maximum depth of nested elements in the document, then the document can be any length you like without the parser's resource usage increasing.

In the case of Topic Maps, for instance, most of the complexity in the document takes the form of cross-references between nodes via xlink:href. The elements are not nested deeply at all; yet there may be thousands of topics in the document. If you want to know which topic element is the grandparent of a given <baseNameString> element, then you need a stack a few items deep to keep track of the state during SAX-based parsing, but the stack doesn't need to expand at all to accommodate larger documents.

In other words, while it's true that as the maximum depth of nested elements in a document increases the overall number of elements in the document increases, the reverse is not true and a parser's resource utilization may still be bounded proportional to the document's maximum depth without being bounded in direct proportion to the overall size of the document.

Ehud Lamm - Re: XML Is Too Hard For Programmers  blueArrow
3/20/2003; 12:29:58 PM (reads: 2089, responses: 0)
Sure, you exploit existing features, and you extend the language with constructs that are helpful for your chosen domain.

So obviously you want to start from a solid base language (which, perhaps, disqualifies Perl) and you should extend the language with features that are indeed relevant (which, by definition, excludes buzzword technology).

Notice I didn't say I agree with Tim Bray, only that the basic notion is not enitrely without substance.

Noel Welsh - Re: XML Is Too Hard For Programmers  blueArrow
3/21/2003; 1:22:36 AM (reads: 2049, responses: 0)

My point was that writing FSMs to manage context-switching for a SAX-based parser isn't a task on which human ingenuity ought to be repeatedly expended...So far as useful language mechanisms are concerned, the most useful (that C++ and Java lack) are those that would support continuations

I'm agreed that SAX forces one into writing these painful FSMs but I don't think continuations are necessary to solve the problem. The SSAX parser is, in my experience, a far better solution than the SAX parser system. Oleg has already pointed out his paper on SSAX. For a different look at it you might find our tutorial on SSAX useful.

Consider an XML tree. There are two directions we may want to pass information in: down the tree to child nodes, or up the tree to parent nodes. Assume that each time we visit a node we call a function. The natural way to pass information down the tree is to use arguments to functions (assuming we're visiting the nodes in a left-to-right depth-first manner). The natural way to pass information up is to use the normal stack (ie function return values). SAX allows neither, SSAX allows both. Hence programming using SSAX is very natural.

In my humble opinion XMl isn't too hard for programmers; the tools available in Perl are too primitive.

Frank Atanassow - Re: XML Is Too Hard For Programmers  blueArrow
3/21/2003; 4:17:56 AM (reads: 1986, responses: 0)
I was trying to work out what I thought was wrong with Frank's statement that the depth of an XML document tree was on average the log of the number of elements.

Possibly nothing at all, but it did strike me that an XML document of element-nesting-depth n can have n plus any number of elements in it, which means in effect that if the stack you have to maintain for a SAX parser is only deep in proportion to the maximum depth of nested elements in the document, then the document can be any length you like without the parser's resource usage increasing.

Sure, if you let the branching factor vary, then a tree of any depth (greater than one) can have any number of nodes, so we assume the average branching factor is k: if the average node has k children, then the average tree of n nodes will have depth log_k(n).

In the case of Topic Maps, for instance, most of the complexity in the document takes the form of cross-references between nodes via xlink:href. The elements are not nested deeply at all; yet there may be thousands of topics in the document. If you want to know which topic element is the grandparent of a given <baseNameString> element, then you need a stack a few items deep to keep track of the state during SAX-based parsing, but the stack doesn't need to expand at all to accommodate larger documents.

Sure, but the stack depth is irrelevant for the memory requirements of that application because the xlink:href stuff will be stored in the heap. In effect, you have just moved the complexity to a different place. If topic maps encoded at least part of the cross reference graph via nesting, say by forcing forward edges in a depth-first traversal to be encoded that way and backward and cross edges as xlink:href, then you would see a rise in stack space requirements again, though of course it depends how you do your traversal. (Caveat: I dunno anything about topic maps; maybe they already work like that.)

This is just proving what I said, though: the memory requirements depend on the application, not how you process the syntax. (Of course, you could engineer an XML vocabulary which requires lots of extraneous nesting and artificially pump the stack space requirements, but that's pathological.)

In other words, while it's true that as the maximum depth of nested elements in a document increases the overall number of elements in the document increases,

No, there is no relationship there unless you fix the branching factor.

Dominic Fox - Re: XML Is Too Hard For Programmers  blueArrow
3/21/2003; 4:59:46 AM (reads: 1972, responses: 0)

It may be folly to cling on to this, but a couple of quick points:

  • If you increase the maximum depth of nested elements in a document by one, the overall number of elements in the document must increase by at least one (since you have to add an element somewhere where there was previously a leaf node in order for the depth to increase). The reverse is not true.
  • If your application uses the DOM, then memory requirements depend on the document size and the efficiency of the DOM; there's no scope to use anything less than all of what the DOM supplies. If your application doesn't need everything, then processing the syntax a different way can reduce memory usage for certain kinds of document. SAX (for example) gives the wily application programmer some discretion about resource utilization, which in a lot of cases can be used to good effect. In other words, only if you're not building the entire tree in memory as a first step in processing is it true that the memory requirements depend on the application and not on how you process the syntax.

Dimitre Novatchev - Re: XML Is Too Hard For Programmers  blueArrow
3/21/2003; 6:55:13 AM (reads: 1956, responses: 0)
I already informed the author in a personal communication that he missed a considerable group of programmers, who are happy to deal with XML -- the language they use is XSLT.

Frank Atanassow - Re: XML Is Too Hard For Programmers  blueArrow
3/23/2003; 10:00:15 AM (reads: 1862, responses: 2)
If you increase the maximum depth of nested elements in a document by one, the overall number of elements in the document must increase by at least one (since you have to add an element somewhere where there was previously a leaf node in order for the depth to increase).

Nope. Here is a tree of depth 2 with three nodes: a(b,c). Here is a tree of depth 3 with the same three nodes: a(b(c)).

The problem with your logic is that you are unclear on what you mean by "increase the maximum depth" and, I suspect, your implicit choice of definition already assumes adding an element somewhere.

If your application uses the DOM, then memory requirements depend on the document size and the efficiency of the DOM; there's no scope to use anything less than all of what the DOM supplies. If your application doesn't need everything, then processing the syntax a different way can reduce memory usage for certain kinds of document.

Then the DOM is overspecified. If the DOM, for example, allows random access to nodes or traversing from child to parent or something, then it is not modeling a tree anymore, it is modeling a graph, and resource requirements will naturally increase. Another way to think about it is that the there is some core part of the DOM which is the real API plus some utility functions, and a lot of crud which is part of the application.

A good API is one that imposes the least requirements on its implementations. If the API is for modeling a tree, then what it needs to do is allow the client to write all the computable functions which can process a tree. (This is exactly the role of fold combinators in a functional language.) If it allows specifying any more functions, then it is doing more than it needs to do, and consequently you can factor it into a core part and a non-core part.

But, anyway, I think I understand the source of our disagreement.

For you, an "XML document" is just a piece of data, and you equip it with operations which make it into a tree, a graph (DOM) or a sequence (of events---SAX) depending on the situation.

For me, an "XML document" is always a tree. That is, I define it to be a data structure equipped with a set of operations which make it into a tree. If I want to treat it is a graph, I will transform it into one, but then it is no longer an XML document to me; it's a "DOM document" or something. Conversely, if I want to treat it as a sequence for efficiency reasons, I will avoid building the tree in memory by treating the input file as a simple text file and I don't think of it as an XML file then.

So my assumption that most XML applications exploit the tree structure is arguably predicated on the fact that I would never used XML to encode something which doesn't have some tree structure that I want to exploit.

I admit that your view, that an XML document does not uniquely define the operations available to XML processors, is probably more in line with how the XML community, and in particular Tim Bray, regards XML.

But for me it's problematic because it makes XML files less portable. An XML author has to know what structural information in the document will be available to processors; if he doesn't then the semantics of the document become unclear.

For example, I think some versions of the DOM make available the comment text of an XML document to the processor. This is a fact which the person writing an XML file has to know: can I encode information in comments, or not? Personally, I would never do so, because my model is that the XML document defines a tree of elements (and attributes), and I assume that stuff like comments is not part of the semantics. I can make this assumption because I have a particular class of processors in mind when I write my document.

There are lots of problems like this with XML, problems which arise because it is ambiguous what information the processor has access to. Another example is, does the processor have access to the order in which attributes appear? If so, then I can encode information using that ordering.

Dominic Fox - Re: XML Is Too Hard For Programmers  blueArrow
3/23/2003; 12:10:09 PM (reads: 1845, responses: 0)

Thank you for this, Frank: you have successfully reduced our disagreement to a matter of semantics, and shown clearly how mine differ from yours. Yours, incidentally, are precise in some areas where mine are not; usefully so, I think.

Chui Tey - Re: XML Is Too Hard For Programmers  blueArrow
3/23/2003; 3:22:03 PM (reads: 1842, responses: 0)
Javascript and Perl are the two languages that has native support for Regexes. Which is why I believe people such as Tim Bray find Regexes the perfect hammer for the XML nail.

This is why I believe languages should offer some kind of XML support beyond DOM. One example may be in C# attributes which handle SAX/XPath using callbacks with pull-DOM support. eg.

[SaxHandler("/Root/Tag1/Tag2")] public void handleTag2(Node context) { ... do DOM processing here ... }

Languages like Perl might have Regexes which grok XML like:

while <INPUTFILE> ~= XML("/Root/Tag1/Tag2") $_->childNodes ...

I'm embarassed to say I'm to lazy and busy to learn XSLT extension functions. Will writing these help glue XSLT with my class libraries?

Dan Shappir - Re: XML Is Too Hard For Programmers  blueArrow
3/25/2003; 12:36:35 PM (reads: 1710, responses: 1)
While all of Frank's statements are true, as usual, I take a slightly different view of the issue.

Then the DOM is overspecified. If the DOM, for example, allows random access to nodes or traversing from child to parent or something, then it is not modeling a tree anymore, it is modeling a graph, and resource requirements will naturally increase.

Indeed, but this is intentional. The DOM was designed to be the high-level API that provides all the functionality you might need in manipulating the InfoSet. Being able to move up the tree is just one example of this. Another is the ability to use an XPath query to generate a collection of nodes you can traverse as a list.

As a result, a programmer that sets out to manipulate the InfoSet in any way can go for the DOM because she knows that whatever her future needs, the DOM will provide the required functionality. There is obviously a price to pay in terms of efficiency when you take this approach, but in most practical cases I don't think this is such an issue (despite that fact that I have presented such an example my self).

If, after profiling, you come to a conclusion that the DOM is just to expensive you can go for a lighter representation. There are numerous DOM alternatives around, and you can always create one yourself working over SAX for example.

But for me it's problematic because it makes XML files less portable. An XML author has to know what structural information in the document will be available to processors; if he doesn't then the semantics of the document become unclear.

Exactly, but in most cases IMO this is a non-issue. I believe that in the real world the exact structural information is known to the processors because the XML syntax is defined per project. That is, most projects do not have to deal with general XML or schemas. Instead, a concrete XML syntax is defined and that XML is used throughout.

An interesting question is, as result, why use XML at all? If interop may not be an issue at all, why deal with all the angle brackets? Why not just use a proprietary format that may be both simpler and more efficient? There are several reasons:

1. XML Hype - "yeah boss, we use industry standard XML in our project". Despite the fact that the XML syntax is totally proprietary and nobody bothered to create a DTD or Schema for it.

2. Net friendly - XML can go anywhere and be read by anything. So by using XML you avoid being locked in somebody's trunk.

3. Hopes of future interop - even though the syntax is proprietary for now, there is still hopes for the future using tools such as XSLT.

4. Free parser - yes, despite the fact that parsing has been around for years, people are weary of creating parsers. In fact, if you look at Tim Bray's original issue you'll see that it stems from his desicison to roll his own parser, and to do it using regexps.

Frank Atanassow - Re: XML Is Too Hard For Programmers  blueArrow
3/26/2003; 4:40:29 AM (reads: 1736, responses: 0)
If, after profiling, you come to a conclusion that the DOM is just to expensive you can go for a lighter representation. There are numerous DOM alternatives around, and you can always create one yourself working over SAX for example.

But in choosing an alternative, you have to rewrite all the code which uses the DOM whereas, if it were factored into a core and a non-core part, you would only have to rewrite the part which uses the non-core part.

Exactly, but in most cases IMO this is a non-issue.

Are you invoking the 80/20 rule?

First, if it is a non-issue because everybody somehow magically has the same model of "intended semantics" in mind (above and beyond the syntax), say that a document is a tree or a graph or whatever, then surely there is no problem making that fact explicit in the standard. Such an explicit semantics would cover 80% of the intended uses. And yet people like Bray insist that there is nothing beyond the syntax, apparently because they are afraid XML will be inapplicable for the remaining 20% of applications. So apparently it is not a non-issue for them.

As it stands, AFAIK there is no API except the Haskell XML Toolbox which does not tighten up the semantics to something more than the syntax. The result is that there are no portable XML applications written with those APIs at all, because they all make assumptions about the document semantics which are not supported by the XML standard. Instead of XML applications, we have portable DOM applications, AX applications, etc...!

Second, it is not the case that everybody agrees on the intended semantics. I've been working a model of XML Schema and since the syntactic model is way too low-level and uninteresting, I've been trying to use the so-called "intended semantics" . In some cases (for example, interleaving), it has been unclear to me, and so I've been asking my colleagues what they think the semantics "ought to be" and, you know what, they don't agree!

Without a clear, well-defined and effective semantics for Schema, you cannot prove Schema applications correct. You may feel that this is not very important, because no one ever proves Real Life applications correct and this is largely true. But part of the reason no one ever does it is precisely because it becomes impossible when there is no good semantics for the APIs and protocols that an application depends on. So it's a self-fulfilling prophecy!

You have to cut the loop somewhere.

Anyway, any system which is ambiguous or difficult to reason about is a breeding ground for buggy applications. How many bugs occur because two programmers interpret the semantics of an API differently? How many bugs occur because the newest version of a library violates some implicit invariant of a former version?