Towards the best collection traversal interface

This has been discussed before on LtU, but I don't think you can add new messages to threads in the "classic" section. I came across this presentation while I was contemplating ways to generalise a SAX-style interface to a parser I was writing and realised that I'd practically written a fold already. The API proposal is excellent and I think would make a good fit for some collections I've been thinking of writing for Tcl (I'd adapt the fold to be able to optionally make use of Tcl's event loop for doing traversals in the background, and use Tcl's break/continue/error exceptions rather than an explicit continue/don't continue indicator in the result). However, I thought I'd just check to see whether there is any new research on this that I should be aware of. Also, I'm struggling somewhat with the Haskell code demonstrating the enumeration->cursor transform without call/cc. Specifically, I'm wondering if it would be possible to translate the example into a language like Tcl which is strict and has relatively weak support for functional programming techniques? I think I could probably get my head around the Haskell enough to try and make a cheap+dirty imperative version, but any ideas on good ways to go about it would be useful.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Another article

Perhaps you might find the following message
http://www.eros-os.org/pipermail/e-lang/2004-March/009643.html

helpful. I have never met its author, btw. As to the interface, I have been using it for database access for more than three years, even in production work. It does seem to work.

Thanks

From that article and your Haskell, I've managed to develop a Tcl version. [Edit: Moved code to the Tcl wiki, and fixed many bugs...]

Well, it all seems to work for lists. However, the current version is terribly inefficient in Tcl. Now to see if I can optimise it, but it's going to take a lot of work to be a practical technique in Tcl (on my laptop the last comparison shows the stream/fold version takes 998.12 microseconds per iteration vs 6.38 for the standard lrange, i.e. ~156X longer).

Complaints

I disagree with the author of that paper on a lot of points.

For one thing, Ruby doesn't have Icon-like generators.

For another, to say that callbacks are faster than iterators, using Java System.arraycopy() as an example, is... well, ridiculous.

And there are a few things I just don't understand because I don't have the background. In Haskell, why not just return the desired list? Too slow? I'm extremely new to Haskell, and one of the things I've been enjoying is how generators can be expressed in Haskell simply as functions that return lists, lazily evaluated.

Re: complaints

For one thing, Ruby doesn't have Icon-like generators.

Perhaps not (I don't know Ruby). It seems Ruby uses enumerator functions much like those described in the article. This discussion even hints at a continuation based implementation of an enumerator->iterator conversion, which I guess is similar to the Scheme one.

For another, to say that callbacks are faster than iterators, using Java System.arraycopy() as an example, is... well, ridiculous.

Not callbacks, as such, but enumerator functions. The point of enumerator functions is that they are whole-collection operations which encapsulate a complete traversal. System.arraycopy() is an example of such a whole-collection enumerator function in that it encapsulates a traversal of an array performing a copy. Because of this encapsulation, it is therefore possible to employ a specialised efficient implementation (e.g. using memcpy()), without changing the interface.

However, I take your point that it doesn't really demonstrate why a fold is more efficient than an iterator, as it is hard to imagine how implementing arraycopy() in terms of fold() would be any faster than implementing it in terms of an iterator. Indeed arraycopy() is an example of having to traverse two collections at once (source and destination) which is tricky for a fold. In general, though, I do think folds provide more opportunities for optimisation than iterators.

There are other advantages to enumerator functions than just speed, such as generally working at a higher level of abstraction, and not exposing any state to clients. Streams/lazy sequences also share some of these benefits, and can be a good choice in certain cases. For instance, the problems with combining enumerations over multiple containers seem to translate into general problems with composition of enumerations, whereas with streams this is not the case. For instance, consider code like the following:

check [parse [lex $spec]] [parse [lex $source]]

If lex/parse etc all return streams (or iterators) then this is straight-forward enough. However, it is hard to see how to implement this as a fold. You could define a foldl2 for iterating over two collections of the same type, e.g. for lists:

foldl2 :: (a -> b -> c -> a) -> a -> [b] -> [c] -> a
foldl2 f z [] []         = z
foldl2 f z (x:xs) (y:ys) = foldl2 f (f z x y) xs ys

But then each collection has to define a foldl, foldl2, foldl3, etc up to some arbitrary limit (or a variadic version, if your language allows), and it still doesn't address what to do if the collections are of different types (or even of different lengths). So, there is still clearly an occassional need for an iterator-style interface (in the form of streams, preferably), but I'd still agree with the central point of the article that a left-fold enumerator function is the best default, especially if you can convert between such a function and a stream easily.

Re: complaints

jorend wrote:
to say that callbacks are faster than iterators, using Java System.arraycopy() as an example, is... well, ridiculous.
I think you may have taken the statement out of context. Here is the full context:
Cursors are simply less efficient. A cursor must maintain the state of the traversal. That state may be invalid. Each operation on a cursor, therefore, must first verify that the cursor is in the valid state. The cost of the checks quickly adds up.

This issue of the inefficiency of cursors is well known. For example, it is highly advisable to use "native" operations to move large sections of an array (ArrayCopy in Java) rather than copy element-by-element. The cost of an array bound check on each access to an element is one of the considerations.

Enumerators can do more than just eliminate redundant bound checks. Enumerators, if implemented as staged functions (e.g., C++ templates) can tile or unroll loops and thus partially or completely remove the traversal iteration overhead [Veldhuizen]. The perfect encapsulation of the traversal state makes enumerators particularly adaptable for multi-stage programming.

The cost of an array bound check does add up. Avoiding reifying the traversal state is beneficial. Enumerators make optimizations such as staging easier. Incidentally, could we avoid terms like 'iterator' in the present discussion? jorend wrote:
In Haskell, why not just return the desired list? Too slow?
First of all, if the elements of a collection are produced as the result of a computation in a strict monad (such as IO), we must compute all the elements first before we can return them as a list. For example, a database interface, to return the (lazy) list of selected rows, must first fetch all of the rows. All of the rows must be present in memory before their processing can begin.

Well, there is a ``workaround'', lazy IO, effected by hGetContents, which, in turn, relies on unsafeInterleaveIO. The latter is called unsafe for a reason. There are many messages on the Haskell mailing list about the pitfalls of lazy IO (e.g., handling of IO errors becomes impossible). See, for example,
http://users.aber.ac.uk/afc/stricthaskell.html#semiclosed
Lazy IO has also the problem of the resource management (timely disposal of allocated resources), which is discussed at length in the LL3 article.