Content Addressable Type Systems

Hello,

I have an interesting problem that this site's audience may have an opinion on.

I'm assuming that people know what content addressable storage is. Let's say it uses SHA1; we put some binary blob in it, get its hash back, which we can later use to retrieve the blob again.

Now imagine I have a primitive type system that looks something like this (pseudo-code):

    struct TypeInfo
    {
        string name;
        MemberInfo [] members;
    }

    struct MemberInfo
    {
        string name;
        Hash type;
    }

I can use the TypeInfo and MemberInfo structures to describe types, and insert them into the content addressable store. The type field in the MemberInfo structure contains the hash of the field type.

This all works fine and dandy, until a type refers to itself. The trivial case is a direct reference, for example:

    struct BinTreeNode
    {
        BinTreeNode left;
        BinTreeNode right;
    }

But one can imagine more complicated type relations where a type only refers to itself indirectly (i.e., Foo refers to Bar refers to Pep refers to Foo).

Suddenly, I can no longer build a type description using the content addressable store. After all, it means generating a SHA1 hash value for a binary blob that is supposed to contain that very hash itself.

Next thing I know, I'm looking at fixed point combinators and I'm trying to crack SHA1. Clearly that isn't going to work - not given the machine power and time available to me... :-)

Does anybody have some creative suggestions on how to solve this problem. So far, my work arounds include various combinations of losing type safety and/or introducing an extra level of indirection.

The former would involve dropping from a typed reference to an untyped one (a void* if you will). The latter would result in breaking the invariant of my content addressable store, namely that the SHA1 of each blob is the hash used to refer back to it.

Another interesting question is how to decide where in a type relation cycle it makes the most sense to insert such a work around. Theoretically, interpreting and inserting types can begin on any point on a cycle, so it seems arbitrary to pick the first time my parser detects a cycle; it would lead to non-deterministic type descriptions.

If the problem description is too vague, I can provide more detail if necessary. It's language agnostic.

I'm suspecting that there won't really be a clean solution; I'm looking for the least ugly work around.

Note that I'm aware we could generate hashes from just the type names instead of the full type description. But the whole point of the exercise is to capture a type's complete content in the system; this has numerous interesting benefits once types start evolving (in the area of database and schema upgrades).

Thanks,

Jaap Suter - http://jaapsuter.com

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

HashCaml

Look for "type normalization" in the HashCaml papers.

A note on recursive definitions

Does that part say that there is a special hash value that means "this is a recursive type"? I am not sure I understood.

A naive approach would be to generate, say 32 type hashes for the same type each at a different recursion level. But you would need to decide the maximum depth of recursion in advance. That would not fly for the tree example.

Thank You

That paper is very relevant. Skimmed the abstract, reading it properly now. Thanks for the reference!

rewrite to make the recursion explicit

Let's restate the problem another way, both to ensure that I understand it correctly, and to steer toward the answer I would like to suggest.

Let {foo} denote the SHA1 encoding of the string "foo". The notation can be nested. For example, if {} = "1234", then {{}} = {1234} = "4567", say.

We would like types to be named by the SHA1 of the structure they describe, for example the following Haskell datatype would be encoded as follows.
  data Bool = True | False
  data { {} | {} } = {} | {}

The problem occurs with recursive datatypes such as the following, because it's not clear which string we're trying to take the SHA1 of.
  data Nat = Zero | Succ Nat
  data { {} | ... } = {} | { {} | ... }
  data Foo = Foo (Bar, Bar)
  data Bar = NoBar | Foobar Foo
  data { (..., ...) } = ({ {} | ... }, { {} | ... })
  data { {} | ... } = {} | ({(..., ...)}, {(..., ...)})

What I suggest is rewriting types to use explicit recursion (called isorecursive types, if you need to google for more), as follows.
  data Nat = rec x. Zero | Succ x
  data {rec x. {} | x} =
        rec x. {} | x
  data Foo = Foo (Bar, Bar)
  data Bar = NoBar | Foobar Foo
  data {rec x. ({} | x, {} | x)} =
        rec x. ({} | x, {} | x)
  data {rec x. {} | (x, x)} =
        rec x. {} | (x, x)