Let's fix spreadsheets

I wonder: why is it that spreadsheets appear to be ignored by real programmers?

Is it because spreadsheets are the equivalent of spaghetti code? Is it because spreadsheets are hacks, put together by non-savvy 'business' people?

The horror!

Yes, I admit. I also very much believe all this to be true. Spreadsheets are sloppy programs. Hence, we programmers don't want to bothered by them. At the same time, research shows that almost 5% of the spreadsheets contain serious errors, causing multi-million dollars of losses.

That figure should freak you out, if you happen to own a bank account. I know firsthand that our banks still run some of their critical businesses on broken spreadsheets. The same spreadsheets we programmers tend to ignore.

Sure, there are attempts to lift spreadsheets into the realm of functional programming, with Reactive Functional Programming (FRP) as the most recent development. But FRP doesn't feel right. As a programmer? May be yes. But I don't think the 'poor' people will understand. FRP is just to distant from their beloved spreadsheets.

Can we educate people to learn FRP? I don't believe we should. What I believe is that we must teach them appropriate 'programmer values':

Modularity
Expressivity
Repeatability
Testability
Control
Performance
Versioning
Continuous builds
Production runs
No crashes please

None of these can currently be attributed to legacy spreadsheets. Ultimately, that's why I think legacy spreadsheets are bound to be a failure. We need more programming, without getting in the way of 'getting things done'.

So that's our challenge: to enhance spreadsheets, so that programmers and business people will meet each other in cells. It is obvious that I've accepted this challenge: to develop a new powerful spreadsheet paradigm. A paradigm for everyone to like.

Edit: a small edit

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

RDP

I'm working on a programming model that could readily become the next generation of spreadsheets. It is called Reactive Demand Programming. It has received some recent discussion here, and one consequence of that discussion had me recently write of its relationship to spreadsheets in my implementation Readme.

RDP can address most of your concerns: modularity, open composition, integration with external resources (databases, wikis, search engines, art assets, etc.), potentially even of streaming data (e.g. video). RDP is designed to support live programming of real-time open distributed systems. RDP does address security (via object capability patterns) which can prevent many bugs. RDP is orthogonal to type safety, but compatible with it.

Open composition is addressed by supporting bidirectional dataflow - which allows publishing into a shared space - and elimination of local state and even of internal stateless resource concepts. The latter ensures all spaces are sharable.

(I also believe that FRP "doesn't feel right", largely because it is unsuitable for open composition.)

To address type-safety in spreadsheets, I would make a representation of type part of the identifier for each cell. (I use similar techniques for resource locators in Sirea, and will do the same for external state.) Schema separation (similar to SQL) is feasible, but I would consider instead pushing the schema to the

For versioning of spreadsheets, I would strongly consider abandoning explicit versioning in favor of implicit versioning on the time axis. For human-modified spreadsheets, it's actually quite feasible to keep every version all the way back to the blank spreadsheet (leveraging diffs and checkpoints), but if we want to operate in bounded space we can use an exponential decay of history model.

I developed RDP not for spreadsheets, but was inspired considerably from an idea of zoomable, mashable live documents and command & control HCIs. Use of stateless stable models can replace many common uses of state and would, I think, be extremely useful for planning and control systems, for artful software and configurations.

With a little extra support for widgets to influence local evaluation and ultimately external resources (stateless and stateful) spreadsheets can become a very powerful, general basis for application programming and live, interactive documents. No local state loops are needed. Without local state, cooperative work becomes much closer to default.

Key operational point in RDP: ubiquity of protocol timeouts

Any scalable system has a scalable strategy for dealing with breakage. Whole-system versioning isn't all that scalable, because it requires a lot of duplication and communication to distribute consistent versions to everyone. The bigger the system gets, the greater the communications burden per node gets to distribute a version. And usually at precisely the moments when communications are problematic if version rollbacks are an error strategy.

The strategy with RDP is that remote resources, protocols, etc, don't simply lock. Or at least, not for long. If a protocol is interrupted in the middle, the server just drops it; the "demand" represented by the other party's use of the protocol goes away and the server, after a short wait, just frees up the resources for use on something else.

What this means, if I understand it, is that "consistency" is local consistency. You've control of local resources and local data and locally you know what they were at any point in time. Someone who observes the system from a different locality may see events having happened in a different order due to speed-of-light delays or choking routers with a backlog delay or whatever, but they also have a consistent local picture of the local resources that they control.

Is this a "correct" understanding?

Consistency in RDP

Disruption in a distributed system is generally non-deterministic. But that doesn't mean we need to abandon every useful consistency property: if distributed parties can agree on the logical instant of disruption, they can achieve a consistent view of the disruption - a "breaks cleanly" guarantee. In RDP, the logical instant would be continuously negotiated by the heartbeat protocol (on a per-session basis).

For performance and prediction, it is useful to speculatively evaluate well ahead of the negotiated instant. Potentially, that requires rolling back a little and performing some rework after disruption. RDP is structured in a manner that ensures the rework necessary can be precisely computed. RDP's temporal semantics are inspired from time warp protocols - a technique that harks back to 1985.

RDP cannot guarantee strong global consistency. But many forms of weaker global consistency are easily achieved: snapshot consistency between threads or processes (which protects against all malign glitches) is easily achieved by batching updates; eventual consistency for straggling updates (albeit with only a small window before negotiated disruption) is easily achieved for state resources. (I don't bother with long-term eventual consistency because it's useless beyond a certain point.)

Someone who observes the system from a different locality may see events having happened in a different order

There are no events in RDP. You could observe states changing in different order, if you are different latencies from them. However, latency properties are expressed formally and logically in RDP, by use of a `bdelay` behavior to model all those speed-of-light delays. A choking router, if it has any observable impact, must be formally modeled in RDP by use of an external stateful resource. Local computations achieve a consistent view of not only how local resources appear this instant, but also a consistent view of how remote resources appeared some (formal, logical, compositional) number of nanoseconds ago.

Preferably the `bdelay` is on the conservative side so you end up buffering future updates (instead of suffering straggling updates). However, RDP can tolerate an occasional bdelay underestimate via the small eventual consistency window. Either way, varying latency in the system does not introduce non-determinism or global inconsistency. It is not difficult to reason about how different observer-code will see the same states.

With latency modeled logically, disruption is the only cause of inconsistency in a distributed RDP system. If latency varies outside the tolerance windows, that is modeled as disruption. And I described above how disruption is handled.

It would not be difficult to develop and enforce a sort of temporal type-safety to guarantee latency-valid distributed programs.

(Aside: Logical synchronization - achieved by matching logical latencies - has several wonderful performance and robustness advantages compared to the more traditional stateful (lock, semaphore, barrier) synchronization.)

Prior work

One thing you definitely need to look at is Lotus Improv (http://en.wikipedia.org/wiki/Lotus_Improv). In some ways it's what a spreadsheet would look like if it were redesigned by a computer scientist from 1992: first-class axis definitions completely decoupled from presentation, clever abstraction and aggregation primitives (including APL-like auto-lifting), logical indexing (including semantically relative indexing) instead of fixed navigation, and a bunch of other good stuff that I don't recall right now. It's pretty much exactly what you would want from a spreadsheet as a grid computation engine, as opposed to a half-assed database and report format, which is the spreadsheet use-case that actually won here in our fallen world.

Existing Efforts

One might also look at David Pollak's visi.io, which obtained some interest in the recent StrangeLoop.

more past work

I would echo Dave's comment to look at past work (there's been a lot), both commercial and research. Lotus Improv is the big example, as he mentioned. Many years ago I asked a friend who worked at Lotus about its reception, and his comment was that in general it worked very well, but when one wanted to create an exception for the general formula for a particular cell in an array, it was very difficult.

Other things to look at (Google for more references that what I give below):

Trapeze, which did away with a fixed grid and allowed you to draw arrays and give each a formula. At the time I thought of it as visual APL.

Spreadsheet 2000, similar to Trapeze, with nesting.

Quantrix, an existing product that continues and expands on the ideas of Improv.

This topic has been discussed several times before on Lambda-the-Ultimate, including this thread, which mentions some of the same things as above.

I did some research work (not really published) on this around 1989-1992, mostly trying to build an automated back-of-the-envelope calculator for casual calculations. You could enter values and formulas on a blank sheet, and rearrange them freely. I was particularly interested in making the formulas visible at all times. I think hidden formulas is one of the big issues with spreadsheets. The lack of visibility and the usual way of seeing only one formula at a time promotes errors. The other big problem is the awkwardness of assigning meaningful names to cells and blocks of cells.

You might also look up the research work of Bonnie Nardi, done around the same.

Nice

That's beautiful. For many years I've had a vague fluffy idea of how I would like to redesign a spreadsheet, after reading through the links I can see that it has been done and would be a clone of Trapeze. Halfway between a spreadsheet and a live programming language.

The spread data structure

I've just finished a blog post that describes ` new data structure that I've invented. I believe it to be novel but may be there is some prior art that I'm not aware off. Anyway, without this data structure it would be impossible to implement SPREAD efficiently.

The Spread data structure.

A Spread is a two-way association data structure with many application areas. I believe it will revolutionise the way we create, associate, distribute, compute and (redundantly) copy data, across all our applications.

[meta] broken link

I think you meant this link.

Fixed.

Fixed.

Relations between relations?

This has some concepts in common with Cairntaker, an untested, unoptimized, and generally neglected garbage collector I've built on top of JavaScript. It supports mutable and immutable ephemeron tables, if it works. I haven't heard of the term "uniquely represented," but I wanted a table entry to be reachable if its key was an immutable table that could still be constructed, so my implemenation interned all immutable tables based on the key and value pointers they contained.

Whereas I was trying to build data out of finite functions between finite functions (maps of maps), it looks like you're trying to build data out of finite relations between finite relations (multimaps of multimaps). It's an interesting prospect, probably more interesting than the interface you use for it (at least to me). Do you have an example of what you're trying to accomplish with it?

finite binary relations

Yes, you are right! A finite binary relation between relation. I just found out that the Spread data structure is 'just' a finite binary relation.

Even more powerful are ternary relations, although they can be represented with binary relations. Triple stores are big ternary relations but they are mutable - and they are but one level deep.

exponential computing

One major use case is the memoization of computations going two ways. I have a hunch that Spreads will enable exponential computing out of the box: hash-life is the canonical example that such thing is possible.

So you have a Spread that 'multimaps' Expressions to Expressions. It essentially maps unreduced expressions (programs) to their non-reducable versions (values). So you can have 3+2 that maps to 5, but also go 5 back 3+2.

If you do that for all sub-expressions recursively, you can save a lot of computational work.

But how?

This has been on my mind for a while. You can’t “fix” spreadsheets without understanding what they’re used for, and providing a better alternative for those specific use cases. No point in coming up with something “better” from an ivory-tower point of view, only to discover that it’s not what’s needed.

I’ve casually interviewed a number of people—programmers and non-programmers in equal measure—about their typical spreadsheet habits, and especially those of their family members. According to my totally unscientific research, the commonest (ab)uses are:

  • Sorting things. In ways.
  • Storing, and performing complex queries on, tabular data.
  • Making graphs and charts.
  • Getting tab-delimited or columnar data into CSV for munging by other software.

I don’t see applicative functional programming as necessarily suited to these tasks—the nature of the system is dataflow. The most significant limitation is that everything is first-order: constructing queries is often difficult because spreadsheet functions are generally not parameterised by other functions, even though this would present no intellectual challenge to the people I interviewed—I asked! Even something as conceptually simple as a relational join is needlessly complex in spreadsheet-land—and yes, people do use Excel as a relational database.

To my mind, a solution (though maybe not the) is the concatenative paradigm, where dataflow programming and higher-order functional programming can peacefully coexist. Of course I would say that. Still, I don’t think it’s a challenge of programming language design, but one of model and user interface design instead.

"Near miss" use-cases are

"Near miss" use-cases are what I tend to target - e.g. the problems where people try spreadsheets but get frustrated with them. For those cases, the question isn't "what are they used for?" but rather "why aren't they used here?" If I can improve spreadsheets for today's successful use-cases too, that's a nice bonus. I do believe higher-order spreadsheets are worthy of exploration, i.e. as a means of expressing functions and expanding and displaying intermediate computations.

PL and UI designs tend to be tightly interwoven and affect one another in deep ways. A PL viewpoint is useful to understand and address the problems of spreadsheets.

Martin Erwig

On his publications page Martin has papers related to spreadsheet type systems and evolution. Actually he has an intimidating number of papers about a lot of things.

pot of gold

Many thanks for the reference - that's a pot of gold!