SPREAD: Authenticated reusable computations

I want to know exactly what my software is doing. Better still, I want cryptographic proof that my software executes each and every computation step correctly.

Wouldn’t you?

Currently, I don’t know what Windows 10 is doing (or it is very hard to find out) and I hate that.

That’s because most of Windows 10:

- is compiled to machine code that bears no resemblance to the original source code,
- hides intricate networks of mutable objects after abstract interfaces,
- and destroys precious machine state after machine state.

Welcome to SPREAD!

In SPREAD, all data, machines states, code and computations are cryptographically authenticated.

To put it mildly, some practical issues had to be resolved to make SPREAD a reality. First and foremost, SPREAD almost completely eradicates mutable state.

Obviously, keeping all the state for authentication purposes would require enormous amounts of storage. SPREAD solves that issue by cryptographically ‘signing’ states incrementally, while still allowing full user-control over which ones need to be signed, and at what level of granularity.

Alternatively, full machine states can also be stored incrementally by SPREAD. In turn, this allows the pervasive re-use of state that was calculated earlier.

So SPREAD kinda acts like a spreadsheet. Because spreadsheets also re-use the previous states of a workbook to optimally recalculate the next.

Unlike SPREAD however, spreadsheets are incapable to keep all their versions around. And typically, Excel plug-ins completely destroy the (otherwise) purely functional nature of spreadsheets. In contrast, SPREAD only allows referentially transparent functions as primitives.

SPREAD builds on top of a recent breakthrough in cryptography called SeqHash. Unfortunately, SeqHash has been unnoticed by most. But hopefully this post will stir some renewed interest. To honour SeqHash, my extension has been called SplitHash:

SplitHash is an immutable, uniquely represented Sequence ADT (Authenticated Data Structure):

- Like SeqHashes, SplitHashes can be concatenated in O(log(n)).
- But SplitHash extends SeqHash by allowing Hashes to also be split in O(log(n)).
- It also solves SeqHash's issue with repeating nodes by applying RLE (Run Length Encoding) compression.
- And to improve cache coherence and memory bandwidth, SplitHashes can be optionally chunked into n-ary trees.

SplitHash is the first known History-Independent(HI) ADT that holds all these properties.

Sorry about the obvious self interest, but I'm just so excited about what I've created that I need to share this.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

A new language?

As I understood (correct me if I'm wrong), SPREAD is a new programming language?

If this is true, what tasks (except obvious undo implementation) could be made easier than in other programming languages? What other uses complete execution trace of some expression could have? Is SPREAD declarative or imperative programming language?

Enchilada woes

Currently it is a Scala DSL. I don't want to make the same mistake as with my other programming language Enchilada. With Enchilada I went completely over the top with an APL'ish postfix syntax (which I still prefer to infix by the way).

What makes SPREAD easier than other 'languages' is that you get incremental computation (almost) for free. Incrementability was something Enchilada always lacked. I also believe SPREAD subsumes reactive functional programming as a whole.

One issue with SPREAD is controlling what traces to fully keep, and which ones to only cryptographically sign. Traces that should be kept are the ones that have a high probability to be re-used (via weakly referenced memoization) in future computations. So programmers only need to guess which computations will probably be re-done on a daily basis, and then SPREAD takes care of the rest.

I've heavily updated my documentation the last past days. Hopefully it will all make more sense.

Thanks for your interest!

Reproducing computation

How do you validate the hash without recomputing? If you must recompute, why hash?

I have warmed up to the idea of eidetic systems, where we can trace or recompute old states. And I also like the idea of exponential decay of information, gradually forgetting old or intermediate states to enable operation with finite memory.

I'm doing something like this with Awelon project, via recording inputs directly in the codebase and using annotations to enable GC and compression of some code (by software agents aka bots). 'State' of an app becomes a cached computation over a series of inputs from an initial value, like command pattern. There are many spreadsheet-like characteristics.

I wonder if I might also apply seqhash in some useful way. For stable caching across simple method refactoring, perhaps, if not for validation. But it isn't obvious to me how I might do so.

You don't need to *complete*

You don't need to *completely* recompute to validate a hash (see also the Versum paper). You only have to redo a small part of a big computation and see if it that part matches up with the one previously calculated.

FYI, traces are SplitHashes which can be split anywhere you like, so you can quickly chop parts of the trace that are of interest (or diverting) and compare (their hashes).

Outsourcing computations

I think another major use case would be when you want to outsource your computations (see also the Versum paper).

Let's say there are 3 independent vendors, A,B and C, that are capable to run your big query. After they are done, you just compare the hashes of their results. If they are equal, you'll have more confidence that the results have not been tampered with.

You can also use SPREAD to reach distributed consensus, i.e. by merging different parallel branches and computations into a single final value (the master), similar to how we use GIT to reach 'distributed' consensus on our code base.

However, SPREAD does lack an important incentive measure (like Bitcoin) to actually reach distributed consensus, without any central authority (say: Linus Torvald).

With trusted peers of course, there is no issue. So replacing internal bank applications with SPREAD technology should be OK I guess.

The big deal breaker?

Now what, in essence, is the big deal breaker? The issue with standard cryptographic hashes is that they are not incremental. Here is an example. Let's say that you've just calculated the SHA256 hash of a big multi-terabyte file, which took you about 5 minutes. After that you change 1 byte in the middle of the file.

The question is: how fast can you recalculate the SHA256 hash of that slightly modified file?

The first known Sequence ADT that is capable to incrementally re-hash slightly modified data in O(log(n)) is called SplitHash. What's interesting about SplitHash is that it isn't build on top of a single cryptographic hash type (like GIT's SHA1), but on top of 'infinite' hashes.

What's wrong with using a

What's wrong with using a Merkle Tree?

Authenticated Data Structures

Every tree data structure can be trivially 'Merklized’, by annotating each tree node with a cryptographic hash of its child nodes.
But there are only a few data tree structures out there that have canonical, unique representations. You want unique representations, so that the order in which you construct a tree doesn’t change the final Merkle hash.

To my knowledge there are only 5 data structures that have (incremental) unique representations: Treaps, Skip-lists, (Bagwell’s Ideal) Hash Trees, SeqHashes and SplitHashes.

Canonical Sets and Maps can be build on top of Treaps, Skip-lists and Ideal Hash Trees, while a canonical Sequence can only be constructed on top of SeqHash and SplitHash.

So what about these ‘infinite’ hashes?

Even with Merkle trees, there is a small chance that two unequal tree nodes have a hash collision. Then, in the case of a collision, why not just take the left and right children and compare their hashes (recursively)? It's easy to see that the chance that the child-nodes also collide becomes exponentially smaller at each recursive step.

What I’m suggesting is to abstract that idea: instead of a fixed size hash, every object should provide an ‘infinite hash ‘stream’. In case of a Merkle tree, we first return a stream of the top-level hash until that hash is exhausted. After that, we descend to the child node, recombine their stream hashes differently, until they are exhausted, etc. etc.

Neat

I see what you mean. Does your data structure (split hash) require new cryptographic primitives, or does it build upon accepted ones?

One comment I have is that I wonder if this technology wouldn't be better implemented as providing an efficient representation for a pure functional semantics. That is, rather than working in a language where the semantics involve a bunch of hashes, define a functional semantics first and then allow the programmer to supply annotations that don't affect the functional semantics but instruct how to extract an efficient implementation.

Anyway, it's interesting. Please keep us updated on your progress!

purely functional - it has to be

I'm not sure about your reference to purely functional semantics.

If it wasn't clear already, SPREAD is a purely functional language, by design. It has to be. Nothing can be mutated, everything is immutable. The exception may be is the use of weakly referenced memoization, because WeakReferences can be mutated in place. But weak memoization is just an orthogonal 'optimisation', just like a database index. Note that strongly referenced memoization is fine.

About using new cryptographic primitives: we can still use them and reap all their benefits.

But we may not need to use them: I believe we could allow weaker forms of hashing (SPREAD uses a 64 bit SipHash). But although non-cryptographic hashes may be allowed, the following (non-cryptographic) requirements have still to be met for SPREAD to work:

1) the hash stream always generates the same deterministic sequence of pseudo random numbers for the same (purely functional) object.
2) and that the chance that two unequal objects produce the same hash stream up to a certain index, exponentially decreases when the streams are advanced to higher indices.

I'm no cryptographer, but I think together, those two requirements are weaker than the requirements of cryptographically strong hashes.
But if I turn out to be wrong, SPREAD could still fall back to cryptographic hashes - at the expense of a pretty hefty constant cost.

Nevermind

Regarding purely functional semantics -- I think what I was complaining about is just an artifact of your having produced this as a Scala DSL.