Type evolution during construction

In a language with inheritance, the evolution of type during construction (and devolution during deconstruction, if appropriate) is a bit tricky. I was recently asked for pointers to papers or references on this, and came up dry on a Google search.

Can anyone suggest papers or references addressing this issue?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

I'm not sure what exactly

I'm not sure what exactly you mean by evolution, but perhaps something like typestate is what you're after?

http://www.cs.cmu.edu/~aldrich/papers/classic/tse12-typestate.pdf

What is evolution during construction of the object

It's simple, when you make a new object with inheritance, then first is created the root class object, from which all inheritance starts. Then created the next in inheritance hierarhy and so on.

The problem here is that it is not sometimes convenient to create the object directly so as required by the language. Sometimes you need first to init some fields _before_ the root is constructed. In C++ it is possible by calling your own method when suppluing args for the root constructor.

As a result the state of the creating object is not formally defined, and cound not be defined at all with current languages. So maybe it is required some other way to specify the state of the object. Not only during construction, it maybe of value elsewhere.

staging

The grace and dart guys were talking about this recently. In the worst case, control dependent typing seems workable. Their solutions were more structured: they restrict the language of initializers and what can be done in a constructor (which may lead to programming an additional post-construction init method, where type state becomes relevant). Staging also helps as it provides better perf control to programmers.

A language I worked on over

A language I worked on over the summer at Adobe is (last I knew) going to have two-part constructors like you described. Use of 'this' is restricted in the first phase, to prevent leaking uninitialized state. The second phases happen in reverse order (following inheritance) and have access to 'this'.

Cool

Cool, does the system have a name?

I think the real question here isn't "how do I use types to describe canonical XYZ" but "what properties do I want for initialization" and from there, mechanism follows choice. Syntactic restrictions, APIs, types, etc. seem to be confounding the issue. For example, I believe Dart wanted the separation, in part, for reducing memory consumption by guaranteed-safe sharing across instances.

What system name?

The language or model for initialization?

3 stages

Virgil has three stages of construction for objects.

The first stage is initializing expressions of fields for the class of the object being allocated. This code is outside of the constructor and thus cannot access the "this" object. It can only reference other initialized fields, in initialization order. It cannot call methods on the object (since it cannot access "this") and it cannot pass "this" to another method. The second stage is the call to the super constructor's code. This stage can only contain expressions that similarly do not access "this". The third stage, the code in the constructor body, has no restrictions. The "this" object is accessible, and by this point it is guaranteed that any fields declared with initializers have been initialized. Virtual dispatch, passing "this" to outside methods, etc, is totally safe now. This three stage approach guarantees that no code can ever access an uninitialized field, even if it is called virtually from the super class's constructor. I found that this works well in practice, and establishes the invariants that are expected when writing a field initializer, with the small cost of not being allowed to access "this" in those initializing expressions.

class X {
  def f = A;
  new(h: int) {
    B;
  }
}
class Y extends X {
  def g = C;
  new(h: int) super(D) {
    E;
  }
}
Y.new(0);

In the code above, the evaluation order would be C, D, A, B, E.

You can see a bit more here.

Stagin

I ended up coming up with a staged approached for my hobby language Magpie too. I wrote a bit about it here. The basic gist was that I wanted:

  1. A window where you can do arbitrary calculation without having access to the object being constructed. That lets you calculate whatever you need to determine the state needed to actually create the object.
  2. A window where you can do arbitrary calculation and you do have access to the object. That lets you call methods on it, send it to other places, store it in caches, etc.

And, ideally, you want both of those encapsulated with the "constructor" call so that from the outside they just ask for a new object and get a fully-initialized one back.

Dart has this in two ways. First, you have C++-style initialization lists where you can evaluate expressions to evaluate your fields. You don't have access to this and by the end of this, all final must be initialized. That ensures you can never see an uninitialized field. (I find it a bit ironic that Dart cares so much about this given that all variables are nullable in Dart anyway, so a default value is always available, but I digress.)

If "one expression per field" is too limiting, you also have "factory constructors". These look like constructors from the caller's perspective (new Foo()) but are basically static methods that do not implicitly create a new instance. Instead, you must do that yourself.

That lets you write a method that does whatever calculation up front, then creates an instance using the "real" constructor for the class and returns that.

Compcert C++ work by Tahina Ramananandro

You will be interested in the following work, done during the PhD thesis of Tahina Ramananandro about Mechanized Formal Semantics and Verified Compilation for C++ Objects.

A Mechanized Semantics for C++ Object Construction and Destruction, with Applications to Resource Management,
Tahina Ramananandro, Gabriel Dos Reis and Xavier Leroy, 2012.

We present a formal operational semantics and its Coq mechanization for the C++ object model, featuring object construction and destruction, shared and repeated multiple inheritance, and virtual function call dispatch. These are key C++ language features for high-level system programming, in particular for predictable and reliable resource management. This paper is the first to present a formal mechanized account of the metatheory of construction and destruction in C++, and applications to popular programming techniques such as “resource acquisition is initialization.” We also report on irregularities and apparent contradictions in the ISO C++03 and C++11 standards.

Along these lines

Yes. This is probably the reference I was trying to remember.

More specifically, I'm after something that deals with the question "What is the type of a Java/C++ object while we are in the middle of constructor execution?" It isn't like constructors in ML or Haskell; these constructors perform computation. In one sense, the type of a new instance isn't stable until it's constructor has finished. But in another sense, the constructor body is free to perform operations that source or target fields of the object that is currently under construction. From a type perspective, it seems rather a bad mess.

I vaguely recall that there were a couple of papers trying to deal with issues of this nature. The CompCert work was certainly one of them. I should probably pull the paper and look at the citations.

Take a look at...

...Qi and Myers' Masked Types for Sound Object Initialization, which deals with precisely this issue.

IIRC, their approach specializes techniques from typestate-based approaches to object initialization, with the usual tradeoff (less expressiveness versus less overhead). For a full-blown typestate system for Java, see Bierhof et al's work on Plural.

Virgil's model

If you are designing a new initialization model, rather than trying to verify code using an existing one, I would highly recommend you take a look at Virgil's initialization rules. They are actually quite simple, give you good guarantees, are trivially statically checkable, and practical.

The paper I cited above

The paper I cited above tackles issues that do not appear in your (interesting) description of Virgil's model in the above thread. For example, in presence of multiple inheritance, you have several super constructors to call. The type of the object "evolves" between these calls in the sense that method dispatch tries to find the most precise implementation that has already been initialized. Finally, in presence of virtual inheritance (two different subobjects A and B share a common subobject X, instead of having two different copies), these super constructor calls cannot be understood independently of each other, as building A first will affect B's build in that B.X will be already constructed. Of course, there are unreasonable^Winteresting questions to be asked about the influence of initialization order, etc.

(Of course you can also question the cost/benefit ratio of multiple and virtual inheritance, and prefer simpler object systems that allow simpler initialization models. But if the question is to understand the subtleties of object interface evolution during initialization, I suspect studying a language with this rich features will provide you with stronger insights and force your design into less-comfortable, yet useful, areas.)

Overcomplexifying...maybe

Sure, Jonathan was after the papers you cited, but it was my supposition based on his prior work on BitC that he was interesting in new design alternatives. Those references represent advancements in applying a lot of powerful static analyses to what I consider an unfortunate, avoidable problem. In that vein I think that the contortions one has to go through with applying advanced type systems and analyses to make some sense of the disaster that is C++ should at best serve as a warning post to others. I think venturing further down those paths isn't likely to be as useful as it may seem, and simpler solutions are already available. Virgil may not be particularly interesting in that academic sense, but I thought I'd at least mention it.

no paper refs, just some personal mental models ...

It depends on language definition since different languages can promise varying styles of contracts for semantics. For example, whether constructors exist per se depends on runtime model. Smalltalk and C++ both have inheritance, but are quite different. (My info might be stale, based mainly on a grasp of Smalltalk convention formed by reading MacSmalltalk source in the early 90's. Things change.) Ruby might work just like Smalltalk, but I haven't checked.

When interviewing C++ candidates, if they clearly know what a virtual method is, a standard question used to be, "What happens when you call a virtual method during a constructor?" An informal C++ story says vtable pointers are updated at end of constructors and start of destructors, so you only have access to the inherited versions of methods, and not your own, which means you usually don't want polymorphic behavior there. Practically speaking, you're only an instance of your type after construction, and are no longer when destruction starts. Both construction and destruction amount to liminal states where class invariants are not valid, which is partly why throwing exceptions is a no-no during construction and destruction, since deciding what to cleanup is not well-defined.

Because order of executing constructors and destructors is implicit, C++ must define a model explaining the order. If a language doesn't call methods implicitly, such a model need not be defined. For example, Smalltalk has no distinguished constructors, nor any implicit calls. So inherited constructors are only called if you do so explicitly in a subclass. A newly instantiated Smalltalk instance becomes the most derived subclass instantly before the first method to init is dispatched, so it doesn't evolve from superclass to subclass, unless this is simulated manually by a library writer. In effect, new instances start with no internal state, with all members defaulting to a standard value, probably nil.

For each language one might ask, "What makes you an instance of your type, and when does that happen?" In C-based languages like C++ is it easy to go through an extended gray zone where type is ambiguous, starting from random bits at the start of construction. (Actually, random would be safer than what happens in practice, where you're likely to start out aliasing pointers to things it would be disastrous to write if you kept fields uninitialized.) In higher level languages like Smalltalk, a programming model looks like atomicly starting as instance of a type with default fields.

Lazy initialization is commonplace and considered good practice with complex types, because a smaller window of ambiguity is safer and easier to grasp. When objects are mutable, good style aims to become an instance that can handle normal calls as soon as possible, even if this delays building complex state until later. (Immutable code style would make this problematic.) That makes reasoning easier for a dev, but maybe harder for a static analysis tool, because a tool would have to infer the implied change in "type" caused by internal state change. You can think of mutable objects as finite state machines whose exact type changes depending on state inside.

I think in terms of spatial metaphors a lot, so I always have a picture in mind to explain type evolution, which roughly looks like passing over boundaries of circles in Venn diagrams. Construction enters a type circle, while destruction exits. Type ambiguity caused by non-atomic changes in type invariants is a circle with thickness you can be inside when crossing it. Nesting of such circles usually looks just like the nesting of stack-based synchronous function calls. Typically resource allocation and deallocation corresponds to circle boundary crossing, so auditing resource management involves ensuring all boundary crossing is net zero over long time spans. (Yes, when I look for leaks, I'm checking this picture for a way to violate orderly boundary crossing.)

Every time you acquire a resource, this is a boundary crossed. So I actually think of each type visible at the programming language level as a collection of a bunch of types corresponding to all the ways an object can acquire and hold a resource. Proper resource management requires a net zero boundaries crossed in the long run. Under threads, native or green, if some resources need mutual exclusion, then acquiring a lock is the boundary delimiting a critical section. It's easy to see code that's not thread safe, because more than one thread can get into the same critical section circle at the same time.

Some boundaries you can only cross once, like locking a mutex, and this might be reflected in class api so different methods are called depending on whether a boundary has yet been crossed. (Getting a static analysis tool to understand this would be interesting, which is why I thought it bears mentioning.) You can think of constructors as the api subset to handle crossing the boundary to acquire an object's type, while other api subsets handle crossing other boundaries. I'll stop here, since if you were going to get a new idea from this, it would have happened by now.

Typestate and Linear Types

cf. Plaid language

Typestate makes it easy to track evolving state, and linear types can ensure construction completes protocols.