Scientific Programming

Google Brain's Jax and Flax

Google's AI division, Google Brain, has two main products for deep learning: TensorFlow and Jax. While TensorFlow is best known, Jax can be thought of as a higher-level language for specifying deep learning algorithms while automatically eliding code that doesn't need to run as part of the model.

Jax evolved from Autograd, and is a combination of Autograd and XLA. Autograd "can automatically differentiate native Python and Numpy code. It can handle a large subset of Python's features, including loops, ifs, recursion and closures, and it can even take derivatives of derivatives of derivatives. It supports reverse-mode differentiation (a.k.a. backpropagation), which means it can efficiently take gradients of scalar-valued functions with respect to array-valued arguments, as well as forward-mode differentiation, and the two can be composed arbitrarily. The main intended application of Autograd is gradient-based optimization."

Flax is then built on top of Jax, and allows for easier customization of existing models.

What do you see as the future of domain specific languages for AI?

kdb+ 3.5 released last month

kdb+ is a real-time time series database, known in the financial services universe as the fastest tick database on the market. It was first conceived by Arthur Whitney at Morgan Stanley as a prototype, and over the last 35+ years has grown to add many features. The database makes such aggressive usage of mmap() POSIX function for mapping file chunks into main memory, to the point where it has exposed issues with the implementation of mmap itself.

Recently, the company now behind kdb+ has also built Kx for DAAS (Data-as-a-Service), which is basically a cloud-based, massively clustered version of kdb+ that deals with the curious oddity that kdb+ is effectively entirely singly threaded. For those interested in reading more about kdb+'s unique cloud architecture (as compared to "big data" solutions like Hadoop), you can read the following whitepapers as suggestive guidelines for how the q community thinks about truly "big data" several orders of magnitude faster and larger than most Hadoop data sets:

While I don't suggest these papers are the blueprint for copying/mimicking the DAAS product, it does help the LtU reader imagine a "different world" of data processing than the often cited Map/Reduce paper and other more mainstream approaches. What is particularly striking is how tiny q.exe (the program that runs kdb+ and provides a CLI for q scripting) is. Language researchers are looking at provably correct C compilers, and it is not a huge leap to think about the world soon seeing provably correct real-time time series databases using kdb+ as an inspiration.

Another curiosity, relevant to us here at LtU, is that kdb+ has its own programming language, q. q is a variant of APL with a special library for statistics. Most "big data" solutions don't have native implementations for weighted average, which is a fairly important and frequently used function in quantitative finance, useful for computing volume weighted average price (VWAP) as well as tilt and weighted spread. q is itself implemented in another language, k. The whole language of each is just a couple lines of (terse) code.

Conservation laws for free!

In this year's POPL, Bob Atkey made a splash by showing how to get from parametricity to conservation laws, via Noether's theorem:

Invariance is of paramount importance in programming languages and in physics. In programming languages, John Reynolds’ theory of relational parametricity demonstrates that parametric polymorphic programs are invariant under change of data representation, a property that yields “free” theorems about programs just from their types. In physics, Emmy Noether showed that if the action of a physical system is invariant under change of coordinates, then the physical system has a conserved quantity: a quantity that remains constant for all time. Knowledge of conserved quantities can reveal deep properties of physical systems. For example, the conservation of energy, which by Noether’s theorem is a consequence of a system’s invariance under time-shifting.

In this paper, we link Reynolds’ relational parametricity with Noether’s theorem for deriving conserved quantities. We propose an extension of System Fω with new kinds, types and term constants for writing programs that describe classical mechanical systems in terms of their Lagrangians. We show, by constructing a relationally parametric model of our extension of Fω, that relational parametricity is enough to satisfy the hypotheses of Noether’s theorem, and so to derive conserved quantities for free, directly from the polymorphic types of Lagrangians expressed in our system.

Interactive scientific computing; of pythonic parts and goldilocks languages

Graydon Hoare has an excellent series of (two) blog posts about programming languages for interactive scientific computing.
technicalities: interactive scientific computing #1 of 2, pythonic parts
technicalities: interactive scientific computing #2 of 2, goldilocks languages

The scenario of these posts is to explain and constrast the difference between two scientific computing languages, Python and "SciPy/SymPy/NumPy, IPython, and Sage" on one side, and Julia on the other, as the result of two different design traditions, one (Python) following Ousterhout's Dichotomy of having a convenient scripting language on top of a fast system language, and the other rejecting it (in the tradition of Lisp/Dylan and ML), promoting a single general-purpose language.

I don't necessarily buy the whole argument, but the posts are a good read, and have some rather insightful comments about programming language use and design.

Quotes from the first post:

There is a further split in scientific computing worth noting, though I won't delve too deep into it here; I'll return to it in the second post on Julia. There is a division between "numerical" and "symbolic" scientific systems. The difference has to do with whether the tool is specialized to working with definite (numerical) data, or indefinite (symbolic) expressions, and it turns out that this split has given rise to quite radically different programming languages at the interaction layer of such tools, over the course of computing history. The symbolic systems typically (though not always) have much better-engineered languages. For reasons we'll get to in the next post.

[..]

I think these systems are a big deal because, at least in the category of tools that accept Ousterhout's Dichotomy, they seem to be about as good a set of hybrid systems as we've managed to get so far. The Python language is very human-friendly, the systems-level languages and libraries that it binds to are well enough supported to provide adequate speed for many tasks, the environments seem as rich as any interactive scientific computing systems to date, and (crucially) they're free, open source, universally available, easily shared and publication-friendly. So I'm enjoying them, and somewhat hopeful that they take over this space.

Quotes from the second:

the interesting history here is that in the process of implementing formal reasoning tools that manipulate symbolic expressions, researchers working on logical frameworks -- i.e. with background in mathematical logic -- have had a tendency to produce implementation languages along the way that are very ... "tidy". Tidy in a way that befits a mathematical logician: orthogonal, minimal, utterly clear and unambiguous, defined more in terms of mathematical logic than machine concepts. Much clearer than other languages at the time, and much more amenable to reasoning about. The original manual for the Logic Theory Machine and IPL (1956) makes it clear that the authors were deeply concerned that nothing sneak in to their implementation language that was some silly artifact of a machine; they wanted a language that they could hand-simulate the reasoning steps of, that they could be certain of the meaning of their programs. They were, after all, translating Russel and Whitehead into mechanical form!

[..]

In fact, the first couple generations of "web languages" were abysmally inefficient. Indirect-threaded bytecode interpreters were the fast case: many were just AST-walking interpreters. PHP initially implemented its loops by fseek() in the source code. It's a testament to the amount of effort that had to go into building the other parts of the web -- the protocols, security, naming, linking and information-organization aspects -- that the programming languages underlying it all could be pretty much anything, technology-wise, so long as they were sufficiently web-friendly.

Of course, performance always eventually returns to consideration -- computers are about speed, fundamentally -- and the more-dynamic nature of many of the web languages eventually meant (re)deployment of many of the old performance-enhancing tricks of the Lisp and Smalltalk family, in order to speed up later generations of the web languages: generational GC, JITs, runtime type analysis and specialization, and polymorphic inline caching in particular. None of these were "new" techniques; but it was new for industry to be willing to rely completely on languages that benefit, or even require, such techniques in the first place.

[..]

Julia, like Dylan and Lisp before it, is a Goldilocks language. Done by a bunch of Lisp hackers who seriously know what they're doing.

It is trying to span the entire spectrum of its target users' needs, from numerical inner loops to glue-language scripting to dynamic code generation and reflection. And it's doing a very credible job at it. Its designers have produced a language that seems to be a strict improvement on Dylan, which was itself excellent. Julia's multimethods are type-parametric. It ships with really good multi-language FFIs, green coroutines and integrated package management. Its codegen is LLVM-MCJIT, which is as good as it gets these days.

Python and Scientific Computing

This interesting blog post argues that in recent years Python has gained libraries making it the choice language for scientific computing (over MATLAB and R primarily).

I find the details discussed in the post interesting. Two small points that occur to me are that in several domains Mathematica is still the tool of choice. From what I could see nothing free, let alone open source, is even in the same ballpark in these cases. Second, I find it interesting that several of the people commenting mentioned IPython. It seems to be gaining ground as the primary environment many people use.

XML feed