How OCaml type checker works -- or what polymorphism and garbage collection have in common
There is more to Hindley-Milner type inference than the Algorithm W. In 1988, Didier Rémy was looking to speed up the type inference in Caml and discovered an elegant method of type generalization. Not only it is fast, avoiding the scan of the type environment. It smoothly extends to catching of locally-declared types about to escape, to type-checking of universals and existentials, and to implementing MLF.
Alas, both the algorithm and its implementation in the OCaml type checker are little known and little documented. This page is to explain and popularize Rémy's algorithm, and to decipher a part of the OCaml type checker. The page also aims to preserve the history of Rémy's algorithm.
The attraction of the algorithm is its insight into type generalization as dependency tracking -- the same sort of tracking used in automated memory management such as regions and generational garbage collection. Generalization can be viewed as finding dominators in the type-annotated abstract syntax tree with edges for shared types. Fluet and Morrisett's type system for regions and MetaOCaml environment classifiers use the generalization of a type variable as a criterion of region containment. Uncannily, Rémy's algorithm views the region containment as a test if a type variable is generalizable.
As usual with Oleg, there's a lot going on here. Personally, I see parallels with "lambda with letrec" and "call-by-push-value," although making the connection with the latter takes some squinting through some of Levy's work other than his CBPV thesis. Study this to understand OCaml type inference and/or MLF, or for insights into region typing, or, as the title suggests, for suggestive analogies between polymorphism and garbage collection.
Doug Gregor of Apple presented a talk on "A module system for the C family" at the 2012 LLVM Developers' Meeting.
The C preprocessor has long been a source of problems for programmers and tools alike. Programmers must contend with widespread macro pollution and include-ordering problems due to ill-behaved headers. Developers habitually employ various preprocessor workarounds, such as LONG_MACRO_PREFIXES, include guards, and the occasional #undef of a library macro to mitigate these problems. Tools, on the other hand, must cope with the inherent scalability problems associated with parsing the same headers repeatedly, because each different preprocessing context could effect how a header is interpreted---even though the programmer rarely wants it. Modules seeks to solve this problem by isolating the interface of a particular library and compiling it (once) into an efficient, serialized representation that can be efficiently imported whenever that library is used, improving both the programmer's experience and the scalability of the compilation process.
Slide[PDF] and Video[MP4]
Slides and videos from other presentations from the meeting are also available.
Video: Records, sums, cases, and exceptions: Row-polymorphism at work, Matthias Blume.
I will present the design of a programming language (called MLPolyR) whose type system makes significant use of row polymorphism (Rémy, 1991). MLPolyR (Blume et al. 2006) is a dialect of ML and provides extensible records as well as their exact dual, polymorphic sums with extensible first-class cases.
Found this to be an enjoyable and thorough overview of MLPolyR, a language created for a PL course that goes all-out on various dimensions of row polymorphism, resulting in a small yet powerful language. (previously)
How to Make Ad Hoc Proof Automation Less Ad Hoc
Georges Gonthier, Beta Ziliani, Aleksandar Nanevski, and Derek Dreyer, to appear in ICFP 2011
Most interactive theorem provers provide support for some form of user-customizable proof automation. In a number of popular systems, such as Coq and Isabelle, this automation is achieved primarily through tactics, which are programmed in a separate language from that of the prover’s base logic. While tactics are clearly useful in practice, they can be difficult to maintain and compose because, unlike lemmas, their behavior cannot be specified within the expressive type system of the prover itself.
We propose a novel approach to proof automation in Coq that allows the user to specify the behavior of custom automated routines in terms of Coq’s own type system. Our approach involves a sophisticated application of Coq’s canonical structures, which generalize Haskell type classes and facilitate a flexible style of dependently-typed logic programming. Specifically, just as Haskell type classes are used to infer the canonical implementation of an overloaded term at a given type, canonical structures can be used to infer the canonical proof of an overloaded lemma for a given instantiation of its parameters. We present a series of design patterns for canonical structure programming that enable one to carefully and predictably coax Coq’s type inference engine into triggering the execution of user-supplied algorithms during unification, and we illustrate these patterns through several realistic examples drawn from Hoare Type Theory. We assume no prior knowledge of Coq and describe the relevant aspects of Coq type inference from first principles.
If you've ever toyed with Coq but run into the difficulties that many encounter in trying to construct robust, comprehensible proof scripts using tactics, which manipulate the proof state and can leave you with the "ground" of the proof rather than the "figure," if you will, in addition to being fragile in the face of change, you may wish to give this a read. It frankly never would have occurred to me to try to turn Ltac scripts into lemmas at all. This is much more appealing than most other approaches to the subject I've seen.
Validating LR(1) parsers
An LR(1) parser is a finite-state automaton, equipped with a stack, which uses a combination of its current state and one lookahead symbol in order to determine which action to perform next. We present a validator which, when applied to a context-free grammar G and an automaton A, checks that A and G agree. Validating the parser provides the correctness guarantees required by verified compilers and other high-assurance software that involves parsing. The validation process is independent of which technique was used to construct A. The validator is implemented and proved correct using the Coq proof assistant. As an application, we build a formally-verified parser for the C99 language.
I've always been somewhat frustrated, while studying verified compiler technology, that the scope of the effort has generally been limited to ensuring that the AST and the generated code mean the same thing, as important as that obviously is. Not enough attention has been paid, IMHO, to other compiler phases. Parsing: The Solved Problem That Isn't does a good job illuminating some of the conceptual issues that arise in attempting to take parsers seriously as functions that we would like to compose etc. while maintaining some set of properties that hold of the individuals. Perhaps this work can shed some light on possible solutions to some of those issues, in addition to being worthwhile in its own right. Note the pleasing presence of an actual implementation that's been used on the parser of a real-world language, C99.
Tool Demo: Scala-Virtualized
This paper describes Scala-Virtualized, which extends the Scala language and compiler with a small number of features that enable combining the beneﬁts of shallow and deep embeddings of DSLs. We demonstrate our approach by showing how to embed three different domain-speciﬁc languages in Scala. Moreover, we summarize how others have been using our extended compiler in their own research and teaching. Supporting artifacts of our tool include web-based tutorials, nightly builds, and an Eclipse update site hosting an up-to-date version of the Scala IDE for Eclipse based on the Virtualized Scala compiler and standard library.
Scala has always had a quite good EDSL story thanks to implicits, dot- and paren-inference, and methods-as-operators. Lately there are proposals to provide it with both macros-in-the-camlp4-sense and support for multi-stage programming. This paper goes into some depth on the foundations of the latter subject.
Adding Delimited and Composable Control to a Production Programming Environment (add'l material), Matthew Flatt, Gang Yu, Robert Bruce Findler, Matthias Felleisen, ICFP 2007.
Operators for delimiting control and for capturing composable continuations litter the landscape of theoretical programming language research. Numerous papers explain their advantages, how the operators explain each other (or don’t), and other aspects of the operators’ existence. Production programming languages, however, do not support these operators, partly because their relationship to existing and demonstrably useful constructs—such as exceptions and dynamic binding—remains relatively unexplored. In this paper, we report on our effort of translating the theory of delimited and composable control into a viable implementation for a production system. The report shows how this effort involved a substantial design element, including work with a formal model, as well as significant practical exploration and engineering. The resulting version of PLT Scheme incorporates the expressive combination of delimited and composable control alongside dynamic-wind, dynamic binding, and exception handling. None of the additional operators subvert the intended benefits of existing control operators, so that programmers can freely mix and match control operators.
Another tour de force by the PLT folks. Does your language have delimited control, delimited dynamic binding, and exceptions? It's the new gold standard, and so far only Racket and O'Caml qualify (and maybe Haskell and Scala?)
Racket's implementation is additionally interesting because it achieves backwards compatibility with code written using undelimited call/cc and dynamic-wind. The authors mention that a simpler solution would be possible without this compatibility - based on control filters from the Subcontinuations paper.
The Deca programming language is "a language designed to provide the advanced features of sophisticated, high-level programming languages while still programming as close as possible to the bare metal. It brings in the functional, object-oriented, and generic programming paradigms without requiring a garbage collector or a threading system, so programmers really only pay in performance for the features they use." The latter link provides a list of features that Deca does, will, and won't provide. Features provided include type inference, universally- and existentially- quantified types, and "a strong region-and-effect system that prohibits unsafe escaping pointers and double-free errors".
The Deca language and ideas behind it are documented in a thesis, The design and implementation of a modern systems programming language (PDF):
Low-level systems programming has remained one of the most consistently difficult tasks in software engineering, since systems programmers must routinely deal with details that programming-language and systems researchers have preferred to abstract away. At least partially, the difficulty arises from not applying the state of the art in programming-languages research to systems programming. I therefore describe the design and implementation of Deca, a systems language based on modern PL principles. Deca makes use of decades in programming-languages research, particularly drawing from the state of the art in functional programming, type systems, extensible data-types and subroutines, modularity, and systems programming-languages research. I describe Deca's feature-set, examine the relevant literature, explain design decisions, and give some of the implementation details for Deca language features. I have been writing a compiler for Deca to translate it into machine code, and I describe the overall architecture of this compiler and some of its details.
The source code for the Deca compiler, decac, is available here. The compiler is implemented in Scala and generates LLVM bytecode. (The author points out in the comments below that this implementation is a work in progress.)
The author of Deca is LtU member Eli Gottlieb, who back in 2008 posted in the forum asking for feedback on his language: Practical Bits of Making a Compiler for a New Language.
There's some more discussion of Deca over at Hacker News.
Software Engineering with Formal Methods: The Development of a Storm Surge Barrier Control System - Seven Myths of Formal Methods Revisited (2001), by Jan Tretmans, Klaas Wijbrans, Michel Chaudron:
Bos is the software system which controls and operates the storm surge barrier in the Nieuwe Waterweg near Rotterdam. It is a complex, safety-critical system of average size, which was developed by CMG Den Haag B.V., commissioned by Rijkswaterstaat (RWS) – the Dutch Ministry of Transport, Public Works and Water Management. It was completed in October 1998 on time and within budget.
CMG used formal methods in the development of the Bos software. This paper discusses the experiences obtained from their use. Some people claim that the use of formal methods helps in developing correct and reliable software, others claim that formal methods are useless and unworkable. Some of these claims have almost become myths. A number of these myths are described and discussed in a famous article: Seven Myths of Formal Methods [Hal90]. The experiences obtained from using formal methods for the development of Bos will be discussed on the basis of this article. We will discuss to what extent these myths are true for the Bos project.
The data for this survey were collected by means of interviews with software engineers working on the Bos project. These include the project manager, designers, implementers and testers, people who participated from the beginning in 1995 until the end in 1998 as well as engineers who only participated in the implementation phase, and engineers with and without previous, large-scale software engineering experience.
This paper concentrates on the experiences of the software engineers with formal methods. These experiences, placed in the context of the seven myths, are described in section 3. This paper does not discuss technical details about the particular formal methods used or the way they were used; see [Kar97, Kar98] for these aspects. Moreover, formal methods were only one technique used in the development of Bos. The overall engineering approach and the way diﬀerent methods and techniques were combined to assure the required safetycritical quality, are described [WBG98, WB98]. Testing in Bos is described in more detail in [GWT98], while [CTW99] will give a more systematic analysis of the results of the interviews
with the developers.
Discussion of formal methods and verification has come up a few times here on LtU. In line with the recent discussions on the need for more empirical data in our field, this was an interesting case study on the use of formal methods. The seven myths of formal methods are reviewed in light of a real project:
- Myth 1: Formal methods can guarantee that software is perfect
- Myth 2: Formal methods are all about program proving
- Myth 3: Formal methods are only useful for safety-critical system
- Myth 4: Formal methods require highly trained mathematicians
- Myth 5: Formal methods increase the cost of developmen
- Myth 6: Formal methods are unacceptable to users
- Myth 7: Formal methods are not used on real, large-scale software