SML# targets LLVM

I will always have a soft spot in my (otherwise cold, desolate, inchoate) heart for SML.

Their main page is unreachable for me just now (you can get e.g. the google cached version of it still if you like), but their last announcement was in April of this year (2014) so I hope the project is still alive.

SML# is an extension of Standard ML with practically important features, including record polymorphism, seamless interoperability with C, true separate compilation and linking, and native multithread support on multicore CPUs.

The most notable change in SML# version 2 is that the SML# compiler now works with the LLVM Compiler Infrastructure. The new SML# compiler compiles SML# code including all of the above features to LLVM IR code and produces native code through LLVM. More than half of compilation phases and library modules has been rewritten for the LLVM support. These changes also greatly speed up the compilation processes.

The major difficulties we have overcome in SML#-LLVM codegen is the treatment of polymorphism and separate compilation with SML#'s native ("unboxed") data representations. This aspect requires both a certain amount of additional type theoretical development specific to LLVM target and careful constructions of LLVM codegen. We hope to report this somewhere.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.


There actually were two presentations about SML# at ICFP last September, one at the main conference, and one at the ML Workshop. The former was about SML# in industry, the latter about their LLVM backend. Papers and videos should be linked from the conference website, e.g. you can watch the latter talk here.

My take-away was that targeting LLVM from high-level languages is possible. But it's not really designed for it, some constructs require jumping through hoops, and implementing anything GC-related involves substantial amounts of trial-and-error and is going to be slow. It's cool that they pulled it off.

safepoint placement work

Philip Reames from Azul has been working recently on major changes to the GC-interface exposed by LLVM (see discussions on his blog). I'm not sure whether that means Azul plans to invest in LLVM development, but if one large-ish player starts relying on the GC-related part of LLVM, that may change the current development dynamics (mostly C-like-languages-centric) in interesting ways.

It seems serious

Philip's contributions are being taken seriously enough that they have landed in llvm 3.6 (fork to happen ~Jan 2015) and there is already motion underway to deprecate the previous mechanism.

I've got a question in to Philip about adjusting his work so that it can be used by JIT engines. For the moment, it looks like the stack map and the safepoint PCs arent accessible to JIT, in much the way that they weren't in the gc.root approach. It also looks like the "fix" is just as easy.

I think there are a lot of reasons to want a from-scratch back-end that is designed with the requirements of managed languages in mind from the start. It is my impression that C-- never got critical mass, and there are some interesting opportunities to be realized from a type-preserving compilation.

That being said, LLVM with statepoints is now good enough that creating a new backend optimizer for managed languages can be set aside as a separable problem. At the very least, LLVM is good enough to be used as an intermediate path for language bring-up. What would be nice is a decent tutorial on how to use it properly. At this point I'm planning to target LLVM with BitC v1, so I'll try to keep notes as I go.

I'm highly amused at the whole idea in LLVM of using %alloca for everything, arranging for an alias to get lost, and then running the mem2reg pass. I wonder if people realize that this is basically the Henderson precise GC technique in disguise? I also wonder if making that connection clearer wouldn't go a fair ways toward helping people use the LLVM support more effectively. Either way, the conceptual model in statepoints is cleaner, and that can only help.


Pure and Julia both use LLVM exclusively as their backends, unfortunately in incompatible ways at present (Pure can import and produce bitcode files, but Julia can't). Both are impure functional high-level dynamically typed languages with garbage collection: Pure is about general term rewriting, whereas Julia is about classes and multimethods.

both look interesting

Julia has some nice features.

My question about general term rewriting is how do you make a system that does what you want that's actually confluent (ie converges on an answer). I remember naive prolog programs that never find an answer and I suspect that naive Pure programs will have that problem worse.


Could you clarify what you mean by "confluent" here? In the context of rewriting I'm thinking of confluence as a determinism property (also called "the Church-Rosser property"), which is not related to always answering in finite time.

Pure programs are not guaranteed either confluent or terminating

For example, the rule loop = loop; obviously is not terminating, but is a valid Pure program. As for confluence, Pure has a fixed order of evaluation, namely left-to-right and inward out (subexpressions are rewritten before the expressions in which they are contained), and rules are applied in the order given. So for example fact 0 = 1; fact n = n * fact (n - 1); terminates on all positive integral values of n, but if you put the rules in the opposite order it will never terminate on any value of n, because the base case will never fire.