Back to the Future: Lisp as a Base for a Statistical Computing System

Back to the Future: Lisp as a Base for a Statistical Computing System by Ross Ihaka and Duncan Temple Lang, and the accompanying slides.

This paper was previously discussed on comp.lang.lisp, but apparently not covered on LtU before.

The application of cutting-edge statistical methodology is limited by the capabilities of the systems in which it is implemented. In particular, the limitations of R mean that applications developed there do not scale to the larger problems of interest in practice. We identify some of the limitations of the computational model of the R language that reduces its effectiveness for dealing with large data efficiently in the modern era.

We propose developing an R-like language on top of a Lisp-based engine for statistical computing that provides a paradigm for modern challenges and which leverages the work of a wider community. At its simplest, this provides a convenient, high-level language with support for compiling code to machine instructions for very significant improvements in computational performance. But we also propose to provide a framework which supports more computationally intensive approaches for dealing with large datasets and position ourselves for dealing with future directions in high-performance computing.

We discuss some of the trade-offs and describe our efforts to realizing this approach. More abstractly, we feel that it is important that our community explore more ambitious, experimental and risky research to explore computational innovation for modern data analyses.

Foot note:
Ross Ihaka co-developed the R statistical programming language with Robert Gentleman. For those unaware, R is effectively an open source implementation of S-PLUS, which in turn was based on S. R is sort of the lingua franca of statistics, and you can usually find R code provided in the back of several Springer Verlag monographs.

Duncan Temple Lang is a core developer of R and has worked on the core engine for TIBCO's S-PLUS.

Thanks to LtU user bashyal for providing the links.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

R was discussed here several

R was discussed here several times in the past: #1, #2, #3.

Lisp-stat

Also, lisp-stat has been discussed here.

cf Incanter (Clojure)

This paper is cited on the home page of the Incanter project, "a Clojure-based, R-like statistical computing and graphics platform for the JVM".

That's how this thread got

That's how this thread got started... ;-)

ETL

Maybe it is just me, but when I worked in R I really hated how everybody had their own ad-hoc Perl scripts to do Extract, Transform & Load of statistical data, because R has no facilities for transforming streams of input data from, say, a file. It was never clear why they were dropping data from the results set, etc.

everybody had their own

everybody had their own ad-hoc Perl scripts to do Extract, Transform & Load of statistical data, because R has no facilities for transforming streams of input data from, say, a file.

I'm not exactly sure what you're referring to by this, but R does have facilities for doing so.

However, the problem--that's alluded to in the paper--is that R doesn't handle large quantities of data well because of memory handling issues, which then necessitates another language for preprocessing. Perl is well-suited for this. It's a memory use issue, not a facility issue per se.

However, I might have misunderstood what you meant.

Not stream-based. For ETL,

Not stream-based.

For ETL, if it is not stream-based, you are running into the wait-for-push bottleneck. If you wait for a complete push of data into an intermediate structure before entering the next processing phase, then you are screwed.

There was something else about how R did things that bothered me w/ regard to ETL, but I can't name it. It may've been lack of support for packaging the ETL within an ETL framework, treating it as a "package" where you could parameterize it with external values. In short, dependency injection. - but I can't remember if this was simply me not taking the time to learn how to do this feature.

ETL->MySQL->R

Since there are various ETL processes for MySQL and R can access MYSQL pretty easily, wouldn't it be easier to ETL the data to MySQL first.

The CRAN manual does talk about various ways to perform Import/Export.

Good Enough?

I guess my point was that if you use a VM like JVM, then you can use an open source package to standardize on ETL, and put ETL directly into the language. MySQL also may not make the most sense here. An in-memory SQL db with ETL would potentially be better... The problem here is you are then edging into basically building a BI product :)