## Draining the Swamp: Micro Virtual Machines as Solid Foundation for Language Development

Draining the Swamp: Micro Virtual Machines as Solid Foundation for Language Development
Kunshan Wang, Yi Lin, Stephen Blackburn, Michael Norrish, Antony Hosking
2015

Many of today's programming languages are broken. Poor performance, lack of features and hard-to-reason-about semantics can cost dearly in software maintenance and inefficient execution. The problem is only getting worse with programming languages proliferating and hardware becoming more complicated. An important reason for this brokenness is that much of language design is implementation-driven. The difficulties in implementation and insufficient understanding of concepts bake bad designs into the language itself. Concurrency, architectural details and garbage collection are three fundamental concerns that contribute much to the complexities of implementing managed languages. We propose the micro virtual machine, a thin abstraction designed specifically to relieve implementers of managed languages of the most fundamental implementation challenges that currently impede good design. The micro virtual machine targets abstractions over memory (garbage collection), architecture (compiler backend), and concurrency. We motivate the micro virtual machine and give an account of the design and initial experience of a concrete instance, which we call Mu, built over a two year period. Our goal is to remove an important barrier to performant and semantically sound managed language design and implementation.
Inside you will find the specification of an LLVM-inspired virtual instruction set with a memory model (enables proper GC support) including a specification of concurrent weak-memory operations (reusing C(++)11, a debatable choice), relatively rich control-flow primitive (complete stack capture enabling coroutines or JIT-style de-optimization), and live code update.

## Pycket: A Tracing JIT For a Functional Language

Pycket: A Tracing JIT For a Functional Language
Spenser Bauman, Carl Friedrich Bolz, Robert Hirschfeld, Vasily Krilichev, Tobias Pape, Jeremy Siek, and Sam Tobin-Hochstadt
2015

We present Pycket, a high-performance tracing JIT compiler for Racket. Pycket supports a wide variety of the sophisticated features in Racket such as contracts, continuations, classes, structures, dynamic binding, and more. On average, over a standard suite of benchmarks, Pycket outperforms existing compilers, both Racket’s JIT and other highly-optimizing Scheme compilers. Further, Pycket provides much better performance for proxies than existing systems, dramatically reducing the overhead of contracts and gradual typing. We validate this claim with performance evaluation on multiple existing benchmark suites.

The Pycket implementation is of independent interest as an application of the RPython meta-tracing framework (originally created for PyPy), which automatically generates tracing JIT compilers from interpreters. Prior work on meta-tracing focuses on bytecode interpreters, whereas Pycket is a high-level interpreter based on the CEK abstract machine and operates directly on abstract syntax trees. Pycket supports proper tail calls and first-class continuations. In the setting of a functional language, where recursion and higher-order functions are more prevalent than explicit loops, the most significant performance challenge for a tracing JIT is identifying which control flows constitute a loop -- we discuss two strategies for identifying loops and measure their impact.

## The Unison Programming Platform

Unison - a next-generation programming platform, by Paul Chiusano:

• Programs are edited in a (browser-based) semantic editor which guarantees programs are well-formed and typecheck by construction
• The codebase is a purely functional data structure
• The program is a UI, and UI interaction is programming
• Persistent data sources must be accessible via a high-level, typed API

An interesting project mentioned in a comment a few weeks ago, it now has its own website and a more descriptive abstract overview explaining it's core premises.

Previous posts on Paul's blog are also of interest, and some feature videos demonstrating some aspects of Unison.

## BER MetaOCaml -- an OCaml dialect for multi-stage programming

BER MetaOCaml -- an OCaml dialect for multi-stage programming
Oleg Kiselyov
2010-2015-

BER MetaOCaml is a conservative extension of OCaml for writing programs that generate programs''. BER MetaOCaml adds to OCaml the type of code values (denoting program code'', or future-stage computations), and two basic constructs to build them: quoting and splicing. The generated code can be printed, stored in a file -- or compiled and linked-back to the running program, thus implementing run-time code optimization. A well-typed BER MetaOCaml program generates only well-scoped and well-typed programs: The generated code shall compile without type errors. The generated code may run in the future but it is type checked now. BER MetaOCaml is a complete re-implementation of the original MetaOCaml by Walid Taha, Cristiano Calcagno and collaborators.

Introduction to staging and MetaOCaml

The standard example of meta-programming -- the running example of A.P.Ershov's 1977 paper that begat partial evaluation -- is the power function, computing x^n. In OCaml:

let square x = x * x let rec power n x = if n = 0 then 1 else if n mod 2 = 0 then square (power (n/2) x) else x * (power (n-1) x) 

[...]

In MetaOCaml, we may also specialize the power function to a particular value n, obtaining the code which will later receive x and compute x^n. We re-write power n x annotating expressions as computed now' (when n is known) or later' (when x is given).

let rec spower n x = if n = 0 then .<1>. else if n mod 2 = 0 then .<square .~(spower (n/2) x)>. else .<.~x * .~(spower (n-1) x)>.;; 
A brief history of (BER) MetaOCaml

As MetaOCaml was being developed, new versions of the mainline OCaml were released with sometimes many fixes and improvements. The MetaOCaml team tracked new OCaml releases and merged the changes into MetaOCaml. (The MetaOCaml version number has as its base OCaml's release version.) The merge was not painless. For example, any new function in the OCaml compiler that dealt with Parsetree (AST) or Typedtree has to be modified to handle MetaOCaml extensions to these data structures. The merge process became more and more painful as the two languages diverged. For instance, native code compilation that first appeared in MetaOCaml 3.07 relied on SCaml, a large set of patches to OCaml by malc at pulsesoft.com to support dynamic linking. OCaml 3.08 brought many changes that were incompatible with SCaml. Therefore, in MetaOCaml 3.08 the native compilation mode was broken. The mode was brought back in the Summer 2005, by re-engineering the SCaml patch and implementing the needed parts of dynamic linking without any modification to the OCaml code. The revived native compilation has survived through the end.

[...]

BER MetaOCaml has been re-structured to minimize the amount of changes to the OCaml type-checker and to separate the kernel' from the user-level'. The kernel is a set of patches and additions to OCaml, responsible for producing and type-checking code values. The processing of built code values -- so-called `running' -- is user-level. Currently the user-level metalib supports printing, type-checking, and byte-compiling and linking of code values. Users have added other ways of running the code, for example, compiling it to machine code, C or LLVM -- without any need to hack into (Meta)OCaml or even recompile it.

[...]

By relying on attributes, the feature of OCaml 4.02, BER N102 has become much closer integrated with OCaml. It is instructive to compare the amount of changes BER MetaOCaml makes to the OCaml distribution. The previous version (BER N101) modified 32 OCaml files. The new BER N102 modifies only 7 (that number could be further reduced to only 2; the only file with nontrivial modifications is typing/typecore.ml). It is now a distinct possibility that -- with small hooks that may be provided in the future OCaml versions -- MetaOCaml becomes just a regular library or a plug-in, rather being a fork.

## sml-family.org

In his blog, Bob Harper, in joint effort with Dave MacQueen and Lars Bergstrom, announces the launch of sml-family.org:

The Standard ML Family project provides a home for online versions of various formal definitions of Standard ML, including the "Definition of Standard ML, Revised" (Standard ML 97). The site also supports coordination between different implementations of the Standard ML (SML) programming language by maintaining common resources such as the documentation for the Standard ML Basis Library and standard test suites. The goal is to increase compatibility and resource sharing between Standard ML implementations.

The site includes a history section devoted to the history of ML, and of Standard ML in particular. This section will contain a collection of original source documents relating to the design of the language.

## Inferring algebraic effects

We present a complete polymorphic effect inference algorithm for an ML-style language with handlers of not only exceptions, but of any other algebraic effect such as input & output, mutable references and many others. Our main aim is to offer the programmer a useful insight into the effectful behaviour of programs. Handlers help here by cutting down possible effects and the resulting lengthy output that often plagues precise effect systems. Additionally, we present a set of methods that further simplify the displayed types, some even by deliberately hiding inferred information from the programmer.

Pretnar and Bauer's Eff has made previous appearances here on LtU. Apart from the new fangled polymorphic effect system, this paper also contains an Eff tutorial.

## Cost semantics for functional languages

There is an ongoing discussion in LtU (there, and there) on whether RAM and other machine models are inherently a better basis to reason about (time and) memory usage than lambda-calculus and functional languages. Guy Blelloch and his colleagues have been doing very important work on this question that seems to have escaped LtU's notice so far.

A portion of the functional programming community has long been of the opinion that we do not need to refer to machines of the Turing tradition to reason about execution of functional programs. Dynamic semantics (which are often perceived as more abstract and elegant) are adequate, self-contained descriptions of computational behavior, which we can elevate to the status of (functional) machine model -- just like "abstract machines" can be seen as just machines.

This opinion has been made scientifically precise by various brands of work, including for example implicit (computational) complexity, resource analysis and cost semantics for functional languages. Guy Blelloch developed a family of cost semantics, which correspond to annotations of operational semantics of functional languages with new information that captures more intentional behavior of the computation: not only the result, but also running time, memory usage, degree of parallelism and, more recently, interaction with a memory cache. Cost semantics are self-contained way to think of the efficiency of functional programs; they can of course be put in correspondence with existing machine models, and Blelloch and his colleagues have proved a vast amount of two-way correspondences, with the occasional extra logarithmic overhead -- or, from another point of view, provided probably cost-effective implementations of functional languages in imperative languages and conversely.

This topic has been discussed by Robert Harper in two blog posts, Language and Machines which develops the general argument, and a second post on recent joint work by Guy and him on integrating cache-efficiency into the model. Harper also presents various cost semantics (called "cost dynamics") in his book "Practical Foundations for Programming Languages".

In chronological order, three papers that are representative of the evolution of this work are the following.

Parallelism In Sequential Functional Languages
Guy E. Blelloch and John Greiner, 1995.
This paper is focused on parallelism, but is also one of the earliest work carefully relating a lambda-calculus cost semantics with several machine models.

This paper formally studies the question of how much parallelism is available in call-by-value functional languages with no parallel extensions (i.e., the functional subsets of ML or Scheme). In particular we are interested in placing bounds on how much parallelism is available for various problems. To do this we introduce a complexity model, the PAL, based on the call-by-value lambda-calculus. The model is defined in terms of a profiling semantics and measures complexity in terms of the total work and the parallel depth of a computation. We describe a simulation of the A-PAL (the PAL extended with arithmetic operations) on various parallel machine models, including the butterfly, hypercube, and PRAM models and prove simulation bounds. In particular the simulations are work-efficient (the processor-time product on the machines is within a constant factor of the work on the A-PAL), and for P processors the slowdown (time on the machines divided by depth on the A-PAL) is proportional to at most O(log P). We also prove bounds for simulating the PRAM on the A-PAL.

Space Profiling for Functional Programs
Daniel Spoonhower, Guy E. Blelloch, Robert Harper, and Phillip B. Gibbons, 2011 (conference version 2008)

This paper clearly defines a notion of ideal memory usage (the set of store locations that are referenced by a value or an ongoing computation) that is highly reminiscent of garbage collection specifications, but without making any reference to an actual garbage collection implementation.

We present a semantic space profiler for parallel functional programs. Building on previous work in sequential profiling, our tools help programmers to relate runtime resource use back to program source code. Unlike many profiling tools, our profiler is based on a cost semantics. This provides a means to reason about performance without requiring a detailed understanding of the compiler or runtime system. It also provides a specification for language implementers. This is critical in that it enables us to separate cleanly the performance of the application from that of the language implementation. Some aspects of the implementation can have significant effects on performance. Our cost semantics enables programmers to understand the impact of different scheduling policies while hiding many of the details of their implementations. We show applications where the choice of scheduling policy has asymptotic effects on space use. We explain these use patterns through a demonstration of our tools. We also validate our methodology by observing similar performance in our implementation of a parallel extension of Standard ML

Cache and I/O efficient functional algorithms
Guy E. Blelloch, Robert Harper, 2013 (see also the shorter CACM version)

The cost semantics in this last work incorporates more notions from garbage collection, to reason about cache-efficient allocation of values -- in that it relies on work on formalizing garbage collection that has been mentioned on LtU before.

The widely studied I/O and ideal-cache models were developed to account for the large difference in costs to access memory at different levels of the memory hierarchy. Both models are based on a two level memory hierarchy with a fixed size primary memory (cache) of size $$M$$, an unbounded secondary memory, and assume unit cost for transferring blocks of size $$B$$ between the two. Many algorithms have been analyzed in these models and indeed these models predict the relative performance of algorithms much more accurately than the standard RAM model. The models, however, require specifying algorithms at a very low level requiring the user to carefully lay out their data in arrays in memory and manage their own memory allocation.

In this paper we present a cost model for analyzing the memory efficiency of algorithms expressed in a simple functional language. We show how many algorithms written in standard forms using just lists and trees (no arrays) and requiring no explicit memory layout or memory management are efficient in the model. We then describe an implementation of the language and show provable bounds for mapping the cost in our model to the cost in the ideal-cache model. These bound imply that purely functional programs based on lists and trees with no special attention to any details of memory layout can be as asymptotically as efficient as the carefully designed imperative I/O efficient algorithms. For example we describe an $$O(\frac{n}{B} \log_{M/B} \frac{n}{B})$$ cost sorting algorithm, which is optimal in the ideal cache and I/O models.

## Safely Composable Type-Specific Languages

Cyrus Omar, Darya Kurilova, Ligia Nistor, Benjamin Chung, Alex Potanin, and Jonathan Aldrich, "Safely Composable Type-Specific Languages", ECOOP14.

Programming languages often include specialized syntax for common datatypes (e.g. lists) and some also build in support for specific specialized datatypes (e.g. regular expressions), but user-defined types must use general-purpose syntax. Frustration with this causes developers to use strings, rather than structured data, with alarming frequency, leading to correctness, performance, security, and usability issues. Allowing library providers to modularly extend a language with new syntax could help address these issues. Unfortunately, prior mechanisms either limit expressiveness or are not safely composable: individually unambiguous extensions can still cause ambiguities when used together. We introduce type-specific languages (TSLs): logic associated with a type that determines how the bodies of generic literals, able to contain arbitrary syntax, are parsed and elaborated, hygienically. The TSL for a type is invoked only when a literal appears where a term of that type is expected, guaranteeing non-interference. We give evidence supporting the applicability of this approach and formally specify it with a bidirectionally typed elaboration semantics for the Wyvern programming language.

## .NET Compiler Platform ("Roslyn")

The .NET Compiler Platform (Roslyn) provides open-source C# and Visual Basic compilers with rich code analysis APIs. You can build code analysis tools with the same APIs that Microsoft is using to implement Visual Studio!

In a nutshell: OPEN SOURCE C# COMPILER. Putting aside possible practical implications of this for the .NET ecosystem, I think it is good for programming language geeks to be able to peruse the source code for compilers and language tools.

For the debate about MS being evil, you can head directly to HN where you'll also find an explanation of what bootstrapping a compiler means.

## Determinism Is Not Enough: Making Parallel Programs Reliable with Stable Multithreading

Junfeng Yang, Heming Cui, Jingyue Wu, Yang Tang, and Gang Hu, "Determinism Is Not Enough: Making Parallel Programs Reliable with Stable Multithreading", Communications of the ACM, Vol. 57 No. 3, Pages 58-69.

We believe what makes multithreading hard is rather quantitative: multithreaded programs have too many schedules. The number of schedules for each input is already enormous because the parallel threads may interleave in many ways, depending on such factors as hardware timing and operating system scheduling. Aggregated over all inputs, the number is even greater. Finding a few schedules that trigger concurrency errors out of all enormously many schedules (so developers can prevent them) is like finding needles in a haystack. Although Deterministic Multi-Threading reduces schedules for each input, it may map each input to a different schedule, so the total set of schedules for all inputs remains enormous.

We attacked this root cause by asking: are all the enormously many schedules necessary? Our study reveals that many real-world programs can use a small set of schedules to efficiently process a wide range of inputs. Leveraging this insight, we envision a new approach we call stable multithreading (StableMT) that reuses each schedule on a wide range of inputs, mapping all inputs to a dramatically reduced set of schedules. By vastly shrinking the haystack, it makes the needles much easier to find. By mapping many inputs to the same schedule, it stabilizes program behaviors against small input perturbations.

The link above is to a publicly available pre-print of the article that appeared in the most recent CACM. The CACM article is a summary of work by Junfeng Yang's research group. Additional papers related to this research can be found at http://www.cs.columbia.edu/~junfeng/