Certificates/proof of work of type checking?

Type checking after inference is pretty straightforward and typically not too expensive, but suppose we have a low powered device that receives some mobile code. Is there some sort of even cheaper proof of work that type checking will pass, so we can avoid verifying the code's type safety?

I don't mean some cryptographic certificate requiring verification of the sender since that requires trusting the sender, but a shortcut to verifying some meaningful safety independent of any other parties. Proof-carrying code still requires verifying the code against the proof certificate, so I'm looking for something cheaper.

For instance, something like Tamper-Proof Annotations, by Construction, and the follow-up Making Mobile Code Both Safe and Efficient. Any other ideas out there?

Edit: the following seems like a good overview of various bytecode formats and their safety properties as of 2007, covering also the tamper-proof bytecodes above: Intermediate Representations of Mobile Code.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Probabilistically checkable proofs

A few years ago I ran across this page talking about probabilistically checkable proofs. I haven't read into it much, but I guess a system like this can verify a proof with high probability by analyzing only a bounded amount of data.

I wonder if someone's already made a proof-carrying code system with this, or whether it could be combined with some of the other approaches you've mentioned.

Proof Checking

Proof checking is a lot easier than finding the proof. Finding the proof is a search of the problem space, to see if a sequence of axioms can be found to create a path from A to B. The path itself is simply a list of axioms to apply in order. As such the complexity of proof checking is linear in the number of proof steps, irrespective of the complexity of finding the proof. So a system with 10 axioms may have a search space/difficulty of O(10^N) to find all proofs with a length N or less, but checking that proof is linear of O(N).

So all you need to do is include the proof with the code.

Sure, but tamper-proof code

Sure, but tamper-proof code let's you avoid even the O(n) cost, so that's no contest. Do you know of any other solutions like that?

An asymptotic bound?

If the code has size N, it seems impossible to do better than O(N) time -- you can't check the program without at least looking at all of it.

Given that, I wonder if, for any type system:
1. There is a proof encoding for any program that takes O(N) space.
2. There is a way to check the program and the proof in O(N) time.

Maybe someone knows a simple counter-example?

Edit: Hmm, I just realized that a system with a trivial type system (i.e. everything is well-typed) would be checkable in O(1) time. So it seems I haven't formulated my question correctly...

If the code has size N, it

If the code has size N, it seems impossible to do better than O(N) time -- you can't check the program without at least looking at all of it.

Right, but I'm not asking about checking the program, I'm actually specifically asking whether there's a way to *avoid* doing that.

Tamper-proof annotations encode the program in a form which prevents violating safety, so clients aren't paying any additional cost (I highly recommend the papers if you haven't read them!).

You could count bytecode decoding as the O(n) cost you mention since you're touching every program instruction, but then typical VMs pay at least double that cost, ie. decoding+verification touches each instruction at least twice, where tamper-proof code only pays it once.

So are there other techniques to only pay this cost once?

This seems boring

The example I understand of tamper-proof encoding is to use a minimal number of bits when referencing a variable rather than just using an integer that might be invalid. This is a form of compression that I would expect to increase decoding complexity rather than decrease it. Sometimes compression results in a net performance win when the decreased memory pressure is more important than the extra decoding cycles, but the theory behind this seems rather uninteresting. I was expecting something like homomorphic encryption but for type checking from the subject. Am I missing the point?

You are correct that

You are correct that compression isn't always a win if you just look at raw loading time, but if you want a safe binary format, then you have to consider loading+verification time, because the tamper-proof encoding does both simultaneously.

The impossibility of designating invalid variables via a compression scheme is only one aspect of the encoding, ie. you could still reference an invalid variable type. So the set of valid variables are partitioned by type, ie. each variable indexes its own independent "type plane".

The whole "bytecode" is really a compressed AST of sorts, with what seems like a few tricks like the above, but the end result is that you get an near-optimal language-specific compressed bytecode in which invalid or unsafe programs, as defined by the language spec, cannot even be represented. Seems like an interesting result to me!

Another compelling outcome is that they can simultaneously ship tamper-proof client-side optimizations. For instance, they demonstrate annotations for escape analysis optimization which is shipped alongside the original program. All sorts of producer-side optimizations can be performed and shipped with tamper-proof binaries, and clients can have a relatively simple runtime which need not perform virtually any optimizations of its own.

I was considering loading +

I was considering loading + verification. Is it really faster to maintain the data structures necessary to come up with a minimal labeling of all identifiers in scope at each point than it is to verify the sparse encoding? Maybe sometimes? You can build tables to optimize the sparse encoding, too. Whether it's a win to maintain such look-up tables vs. just lazily check things seems like it would depend on the code you're compiling, but I can't see why it would ever be a much faster approach.

Regarding the tamper-proof client-side optimizations, I haven't looked at that paper, but it seems like it would work just as well with a sparse encoding. You do need to pick an encoding that makes the scopes trivial to check, but I don't see why you need dense packing.

On the whole, the main useful property it seems to get you is that you can verify trivially that a chunk of bytecode is legal. Most of the time, this doesn't seem like a very useful property though. Particularly when the method isn't addressing termination (I assume, since it's not a local check). So you can just declare that any variable reference that turns out to be illegal is interpreted to be an infinite loop and then you'll have basically the same property for the sparse encoding.

Very cool idea

of passing ready made AST-s to a client interpreter. That makes an excellent base-ground for detecting and reporting unsafe code, so users can decide what to do with it. Things should get much more complicated with bytecodes that are reaching native code on the high-level/low-level scale.

As for speed, my javascript tests reported x4+ times increase in parsing speed, comparing equivalent CFG to PEG grammars (the difference is only internal, in a number of checked choices). My conclusion is that walk through ready-made precompiled AST could speed up another x3+ times (depends on the original number of choices), and it wouldn't matter if it was originally PEG or CFG, once that AST is parsed and saved on a server. For a reference, a code string with about 1000 lines was parsed with a PEG grammar for about 0.050 seconds on my PC machine, running inside regular Javascript. But I think you have to add client type-checking overhead to these numbers because of safety concerns (I wouldn't recommend storing safety checks precompiled on a server).

EDIT: But, of course, if you know where to look at, it's easier to find things, so a small help from a server in a form of ready made safety proofs would apply.

Not tamper proof?

So I can still change any variable of a given type to another variable of the same type, so for example I could change the code to decrement player lives by one, to instead decrement their score by one, making a big difference to the behaviour of the code.

This mechanism seems great for preventing privilege escalation type attacks on code by prevrnting certain kinds of undefined behaviour, but it is not tamper proof.

What does tamper-proof mean?

So I can still change any variable of a given type to another variable of the same type, so for example I could change the code to decrement player lives by one, to instead decrement their score by one, making a big difference to the behaviour of the code.

A type check will also not detect this sort of change, and tamper-proof code only prevents tampering that your type checker can prevent. That is, tamper-proof code can only express the subset of legal programs in your language. That's what tamper-proof means here.

Overstating their claims

Seeing as the common definition of "Tamper Proof" is "made so that it cannot be interfered with or changed." by redefining it to mean it does not actually prevent changes, but only certain qualified changes seems to be a problem to me.

I think you're reading too

I think you're reading too much into my comments rather than the paper's claims. I call it tamper proof code as a shorthand, they refer to tamper-proof annotations with a binary encoding that ensures only legal programs can be represented. A bit of a mouthful, hence the shorthand, but the annotations truly are tamper proof. I recommend the papers if you haven't read them.

Annotations

You are right, tamper proof annotations sounds more reasonable.

O(N)

Edit: Hmm, I just realized that a system with a trivial type system (i.e. everything is well-typed) would be checkable in O(1) time. So it seems I haven't formulated my question correctly...

Indeed. More generally, the proof is a typing derivation (or annotated program) which in non-syntax-directed type systems might be much larger than the original source. Thus your "N" is not necessarily the size of the program as the user typed it.

It is also true that some type systems admit distinct formulations that are equivalent from the point of view of provability but differ in terms of the amount of information needed to reconstruct the derivation. I believe that bidirectional type checking gives concrete instances of such formulations when applied to rich type systems, but I'm no expert.