Lisp-Stat does not seem to be in good health lately.

The Journal of Statistical Software http://www.jstatsoft.org/ has a Special Volume devoted to the topic: "Lisp-Stat, Past, Present and Future".

In the world of statistics, it appears that XLISP-STAT http://www.stat.uiowa.edu/~luke/xls/xlsinfo/xlsinfo.html has lost out to the S family of languages: S / R / S-plus:

In fact, the S languages are not statistical per se; instead they provide an environment within which many classical and modern statistical techniques have been implemented.

An article giving an excellent overview of the special volume is: "The Health of Lisp-Stat" http://www.jstatsoft.org/v13/i10/v13i10.pdf

Some of the articles describe the declining user base of the language due to defections:

whilst other articles describe active projects using XLisp-Stat, often leveraging the power of the language, in particular for producing dynamic graphics.

The S family of languages, originally developed at Bell Labs, has much to recommend it. S is an expression language with functional and class features. However, as the original creator and main developer of XLisp-Stat, (and now R developer) Luke Tierney explains in "Some Notes on the Past and Future of Lisp-Stat" http://www.jstatsoft.org/v13/i09/v13i09.pdf ,

"While R and Lisp are internally very similar, in places where they differ the design choices of Lisp are in many cases superior."

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Social reasons

The main cause of the downfall of XLisp-Stat appears to be the primary maintainer stopping support. At least that is the impression I gain from the papers above.

R is an interesting language. It's like Scheme with a different syntax and more warts. Check out section 10.7 of the Introduction to R:

The discussion in this section is somewhat more technical than in other parts of this document. However, it details one of the major differences between S-Plus and R. [10.7 Scope]

Well I probably wouldn't wait till section 10 before I discussed scope in a programming language tutorial. This nicely illustrates the difference between the programmer and statistician communities. There is no reason for R, or S-Plus, or XLisp-Stat to exist. All their functionality could be easily implemented in any other dynamic language and then someone else would maintain the infrastructure. However the focus would be different, and this appears to be something that the statistical community values (and the existing investment in statistical code).

At least R isn't fundamentally a crock, which from my experience the other big numerical language, Matlab, most definitely is (and from the discussion of scoping rules, S-Plus is as well.)

The R core development team works principally in Scheme

There are some R-related articles on IBM Developerworks that introduce R to developers

Laird gives a short introduction to S and R:

A Bell Labs team began developing a research project called "S" back in the mid-'70s. Eventually, the project became a full-blown, general-purpose computing language, with rich statistical capabilities. ... Project leader Dr. John Chambers received the ACM Software System Award in 1999, in recognition that, among other achievements, "S has forever altered the way people analyze, visualize, and manipulate data." Among S's many strengths, it "plays nicely" with modules written in such other languages as Fortran and C.

Insightful Corporation sells a commercially successful, widely respected, descendant of S it calls S-PLUS. In the early 1990s, Robert Gentleman and Ross Ihaka of the University of Auckland began work on R, which they released as free software, and which evolved (according to Ihaka) to resemble S quite closely. R's implementation, though, along with a few of its interfaces are entirely different from S and S-PLUS. The R core development team, which Chambers joined in 1997, works principally in Scheme. ...

Language choice in statistics should become less of an issue due to the OmegaHat project. This is described by Chambers to be a component based statistical computing environment, where many languages, Java, R, S, Lisp-Stat, Fortran, C, included, can be used together.

A talk covering both of the above is The R and Omegahat Projects in Statistical Computing Ripley 2001

A Couple of Notes...

Noel, you're right. We should all just go back to using FORTRAN (actually suggested by some Swedish guy at the Joint Statistical Meetings a few years back). Seriously though, the reason these tools exist rather than simply tacking statistical packages onto an existing language (actually a suggestion made in Jan de Leeuw's paper in the above) is that most statisticians are not programmers, nor do they desire to become programmers. The interesting bit is the analysis of data, not writing elegant software or pumping out cool hacks. The bulk of R users will probably never write a package, some may never move beyond interactive use (though I would suggest to those users that they should explore R's "literate analysis" tools). Putting a "better" (from a language designer's point of view) in front of these users is not likely to impress them. Thats the same reason scope isn't introduced until section 10---most users aren't even going to read that document at all (they'll read Modern Applied Statistics in S-PLUS or some similar book).

Hell, having a full programming language may actually be a hindrance. SAS is basically the de facto standard in several fields and, I assure you, it has nothing to do with the quality of its programming language (its a jumped up mainframe batch language---IIRC related to PL/I). Among other things (its really really good at regression models on large amounts of data, for example) its trusted by entities like the FDA. Anything the statistician codes is, in some sense, suspect since there's no assurance that the results you get aren't due to a bug in the code. (Incidentally, this is a point in favor of the open and peer-reviewed development of, at least, the core statistical features of ANY statistical package.)

Anyways, I ramble. Moving on the Omegahat thing, most of the effort there (due mostly to Duncan Temple Lang) is concerned with bindings for R to other things (Java, Lisp-Stat, Matlab, Perl, Python, etc). I don't know the common statistical framework part has really caught on. Personally I like Luke's common statistical virtual machine idea--there's really no reason why R, Lisp-Stat and more domain-specific languages like BUGS and such can't all live in the same VM and share data structures and core routines (or somebody else's VM---though I think you'd really like to have vector math primitives, there's no such thing as a scalar in R for a reason).

Agreed, but...

I agree with many of your points. I'm not advocating Fortran, but a more modern language like (modern) Scheme or O'Caml, perhaps with some domain specific tweaks. These language have several advantages: they make compiling them easier, and speed is always a problem in my experience, and high-level operations (such as pattern matching and array comprehensions) make algorithms clearer and hence raises the bar for what is 'obviously' correct.

Right, so, like I said: 99% o

Right, so, like I said: 99% of R users don't care about ease of compilation or " 'obviously' correct" algorithms. For the most part they care about getting their job done. At best the things that make language designers and researchers happy are orthogonal to this task and a hindrance at worst. Nobody ever says "wow, language X's ease of compilation made fitting my non-linear model a breeze!" The only compelling reason for switching away from a special-purpose statistical language to a more general purpose language is to take advantage of the libraries available for the language, not because it makes data analysis any easier.

I Disagree

I disagree with practically every statement you make:

R users don't care about ease of compilation or " 'obviously' correct" algorithms

In the statistical work I've done (e.g. clustering HMMs) speed is really important. If something takes a week or more to run and I can make it run faster with very little effort I'm really happy. Other people I know in the area (statistical machine learning) have the same problem. A high-level language that can be easily compiled would be a real boon. Furthermore, as you say "Anything the statistician codes is, in some sense, suspect since there's no assurance that the results you get aren't due to a bug in the code" so anything that raises the bar for 'obviously correct' algorithms is a big win as well. This is "getting their job done."

At best the things that make language designers and researchers happy are orthogonal to this task and a hindrance at worst.

I've used R and I've used Matlab, and if they had pattern matching and array comprehensions it would be a lot easier to write complex algorithms in them. If they had clean semantics they could be easily compiled; see above.

So in conclusion I argue that a better language makes development faster (less code must be written; code has less bugs) and gets results faster (code runs faster). Besides, the so called "special-purpose statistical languages" aren't really that special (with the exception of Mathematica) from a PL point-of-view. Sure they tend to have some nice notation for arrays-slices, and overloaded operators for arrays, but that could be easily accommodated in a modern general-purpose language, and modern languages have features far in advance of that in, say, R.

social aspects

It's reasonable to disagree with the reasons why R has the primary marketshare among academic statisticians (among whom I'd include myself). And I wish the situation was different, but if you want to understand the social reasons for why the situation is as it is, I would back Byron up by seconding the opinion that R has so many users because (1) those users could easily "port" legacy code / coding practices (e.g. syntax) from S into R (2) for the most part they don't implement CS-style algorithms from scratch; much of the day to day work is ad-hoc data analysis and so an interactive language with access to a wide array of statistical libraries is paramount.

There are, as you explain, many reasons why it would be good for the statistical community to shift to syntactically sugared version of OCaml or Haskell or Lisp/Scheme/Qi. But the way to get this to happen is to make it easy for non-specialists to run their code in the new system and only gradually enforce type-safety / side-effects control.

For these reasons transitioning to something like Qi (possibly with even more syntactic sugar, so it looks more like Javascript/R) would be a good first step, in my opinion. And the way to get that to happen is to form a community of motivated statistical programmers to build the language / libraries that would make it possible. And this is also sociologically hard, especially in the face of good alternative half-measures, like SciPy, which is doing a good job of drawing its own (more matlab-esque) community.

I'd be happy to discuss things more with anyone interested in a serious effort to develop a serious alternative to R, or lend some of my time to the effort.

Statisticians Should Care

Some statisticians do care a great deal about the quality of programming languages. Also consider however, that many statisticians work in field-specific environments, such as epidemiology, econometrics, and so on. I worked in psychiatry research and the code from my colleagues was not programmatic at all (they did not view what they did as "programming" but "coding"), their code took forever to run, and that was okay with them. I, on the other hand, having been introduced to R, and then to Scheme from the R documentation, tried to write more programmatically. My colleagues basically wrote as if they were at the command-line. Consequently their programs were repetitive and slow, as they cut-and-pasted more than they used loops or other basic programming constructs. They also never used conditionals.

Now I find myself in a biology department where R is very popular, but so is Matlab. Thanks to both these experiences, I agree that most users of this software just "want the job done" and don't care about expressiveness or efficiency. They'll complain, but they won't do anything about it. They don't (unfortunately) see the benefit of doing things more efficiently or expressively, and if they hear about it from anybody like me, it is dismissed as being unimportant, or just too time-consuming to learn.

As to Scheme being the basis of a new statistical computing language: I'd really like to find a Scheme implementation that is good for user-level stuff, like Common Lisp is. I like Scheme, it's so much cleaner of a language, but I find it very hard to do even simple stuff like printing in most implementations. Scheme seems mainly aimed at producing low-level code, in compilers and other CS applications. Any suggestions?

No seasoned schemer myself

But, by printing do you mean formatted output as in srfi 48?

I was thinking of posting

I was thinking of posting about this in a new topic, but most of what I was thinking is related to Lisp-Stat.

In any event, Ross Ihaka (one of the original developers of R) presented a paper in which he essentially advocates moving away from R and toward a new Lisp-based language. This struck me as a bizarre, ironic twist on the history of statistical computing, especially with respect to R and Lisp-Stat, and I wanted to know more about it.

In searching for this new language that Ihaka refers to, I came across Incanter, a statistical programming system implemented in Clojure. This struck me as even more strange, as Ihaka and Lang's recent paper is mentioned, but not Lisp-Stat, even though Incanter looks essentially like Lisp-Stat ported to Clojure (and might be).

I'm sort of wondering if there's any reason to suspect that Incanter would be successful where Lisp-Stat wasn't, given that it looks to be essentially the same as the latter. It's been awhile since I used Lisp-Stat, and I don't know enough about the underlying Lisp implementations to know whether Incanter would have some fundamental advantage over Lisp-Stat from the Lisp side of things.

Why not focus on better implementations?

I know that R needs a better implementation, preferably a high-quality incremental native-code compiler. An R to Lisp or R to Scheme compiler could fit that bill, assuming the Lisp or Scheme compiler is suitable.

The paper harps on the pass-by-value semantics as a reason to move to an entirely new model, but I'm not convinced that this is an insurmountable obstacle to a high-performance implementation of R.

Does R currently copy it's values lazily? That way it's just pass-by-reference if you never mutate the structure, and (possibly parts of) values only get copied if they get mutated. I haven't thought this through carefully enough, and I think parts of this scheme would get rather hairy, but I would guess that a good lazy-copy technique would largely ameliorate the overhead of copying.

There have been some efforts

There have been some efforts to create a native-code compiler or bytecode compiler, but as far as I know, they've never really gotten off the ground very much. They've seemed promising, but haven't seemed to get past the experimental stage.

I know of at least two of them:

1. RCC, something that seems to have been based off of John Garvin's Master's Thesis, and

2. Some experimental work by Luke Tierney (again, of Lisp-Stat). He has some stuff on his website (http://www.cs.uiowa.edu/~luke/R/bytecode.html, and http://www.stat.uiowa.edu/~luke/R/compiler/), and an interesting conference paper, but as far as I know it never got any further. In newsgroup posts, he seems to indicate that usable compiled R isn't likely to appear from him anytime in the forseeable future.

In general, I get the impression that the R community is focusing on parallelizing things as a way of improving performance. I would go so far as to say that this approaches the level of "we will parallelize things, and that will address the speed issue," but that's overgeneralizing. In their recent paper, Ihaka and Lang discuss why parallelization is not going to solve all of R's performance issues.

Scalar Performance

Ehh... there are many reasons to parallelize something, but I seriously question the wisdom of focusing on parallization for purposes of performance when there are significant scalar gains left on the table.

I've seen both of those experiments. A less conventional approach would use Lisp or Scheme as the target language. ETOS, an experimental Erlang to Gambit Scheme compiler showed some promising initial results.

Obviously you'd want a Lisp implementation that has particularly efficient floating point support, and a significant portion (a majority?) of the effort would go into providing high performance, high quality numerical algorithms.

translating R into Scheme

Formally speaking, I believe R evaluates arguments lazily and sends promises/thunks as arguments. On this basis, perhaps a translation into monadic Haskell with sutiably defined universal types to account for the dynamic-type coding style would be more direct. Can anyone comment on this from a technical perspective?

Nevertheless, I think an R to scheme compiler (extending Tierney's work) could fill a useful niche, especially because I think the great majority of *user* R code is compatible with a strict, call-by-value evaluation semantics which would translate well into scheme. As you mention, if a function "modifies" an argument, it actually modifies a locally-scoped copy of the argument (except for environments which are passed by reference only), but this seems like something that could be translated accordingly.

A somewhat messier aspect, from the scheme perspective, is that many of the common library functions operate like special forms, because they textually inspect, delay the evaluation of their arguments, or modify their evaluation environment. Additionally, in R terminology, many functions are "generic," which means that they dispatch based on the class of their argument (and there are S3 and S4 object orientation systems in R). I presume that this could be translated, but I don't know how messy it would be.

So, for example, the plot function will look at the class of its arguments to dispatch the right plotting code and at the text of those arguments to label the plot. Similarly, statistical modeling functions, like lm for linear models, use a "formula" language so one can say lm(y~x, data=mydata), to mean to regress y on x using the data in mydata (which could be a data-frame (i.e. table) or an environment, which are themselves first-class).

R functions that need some of these special-form type features might benefit from a little manual annotation (regarding which arguments aren't strict) and a customized translation of common idioms. This might be a fairly large undertaking when you consider all the contributed library code.

As far as formal translation goes, though, R uses lazy evaluation ("promises") and allows for a lot of introspection and modification of evaluation environments. In this sense, properly speaking, R functions really are functions and not special forms its just that the evaluation semantics are different. In principle, I suppose, one could turn everything into a thunk and emulate the environment structure with scheme data structures and on and on.... I could be wrong, but if one really made literal and completely general translations of these language features, my suspicion is that there would be a substantial amount of overhead involved so that the benefits of translation might be disappointing.

Accordingly, a complete translation might be possible, but actually I think the most powerful/practical approach would be to let the programmer select different translation protocols for different blocks of code.

So a good short-term step might be to embed a scheme compiler into R (guile?) and allow an R user to annotate blocks of code for translation into scheme, compilation, and subsequent execution within the R run-time environment. Blocks of code that the programmer certifies will only operate on objects of certain types or obey certain semantics (e.g. strict evaluation, no side-effects on external environments) could be given especially efficient translations.

What do you think?

Stack and environment introspection

Guile is an interpreter, and one that doesn't perform very well at that. So that seems like a rather pointless endeavour to me. I'd stick to an incremental compiler such as Larceny or Ikarus.

I don't believe that Haskell's lazy evaluation could be leveraged very effectively here, due to R's ability to combine promises with side-effects, even if the side effects are local.

Honestly, I've only played with R in a rather cursory fashion, though I have promised a friend to learn more R over the next few months for a possible collaboration we've been kicking around.

Objects and promises shouldn't be too hard. I'm a little less clear what exactly your third point entails, something like macros but not exactly. It sounds like they involve run-time computations, which I would not know how to deal with very well.

After a quick stroll through R's somewhat sketchy language "definition", what worries me is the environment and stack introspection facilities. I'm not clear how R uses the environment introspection to do something akin to macros, or if you are referring to something else entirely.

In my estimation, this would be quite an obstacle to an efficient R to Scheme translation.

Lua?

I also don't know R at all well (I've played with it a little, but that's it). But the notion of environment introspection makes me wonder if a language like Lua would be a better target than a language like Scheme. The "everything is a dictionary" mentality which is so prevalent in Lua (and Python, Ruby, etc.) is very much absent in Scheme.

dictionaries in R

Hmmm. I'm not sure, although I see your point. R environments do seem analogous to Python /Lua dictionaries. Surely there's a scheme way to express this too, right? (mutable hash-tables) Oh, but your point is that then you wouldn't be able to inspect the global environment and such without additional overhead (to maintain the imitation global environment as a scheme-level object after every assignment), unless the scheme implementation exposed its own environment-based implementation directly. Do any schemes do that?

I should clarify, though, that most user-level programmers don't use environments. And, except when it's really necessary, environments would be used, not for introspection into the system environment, but just as a data-structure. For this reason I still think most user-level code would translate nicely into scheme. It might not be *decidable* that naive translation would correctly copy the semantics, but the author of the code could stipulate that they didn't do anything "evil," and then it would.

(R environments also have parents that they "inherit" from, so I think technically they'd map onto Python classes better than dictionaries as far as that goes; I'm not sure about Lua.)

Idiomatic R code mainly uses "list" data structures anyway which operate like a dictionary (they can map keys into values), but they are immutable (I guess copy on mutate would be a more precise way to explain it). It's confusing because the syntax is so similar, so I'll give an example.

Python:
>>> d1={'a':1, 'b':2}; d2=d1; d1['a']=3;
result: d1=d2={'a': 3, 'b': 2}
R using lists: (different semantics, despite similar appearance)
> l1=list(a=1,b=2);l2=l1;l1$a=3;
result: l1=list(a=3,b=2), l2=list(a=1,b=2)
R using environments: (like Python)
e1=new.env();assign('a',1,env=e1);assign('b',2,env=e1);
e2=e1;assign('a',3,env=e1);
result: e1=e2 and e1$a=3 and e1$b=2

Introspection

R environments do seem analogous to Python /Lua dictionaries.

Well, I don't know enough R or Lua to really say... but an efficient translation from R to Lua would be little more than an academic exercise given the current Lua implementation. Although somebody's working on an LLVM-based implementation of Lua, so this might someday be practical. It sounds like they are currently up against similar kinds of challenges to those we are discussing.

Do any schemes do that?

I suspect we are on the same page when we use the word "environment", but just to be sure, environments map variable names to values. A naive environment-passing interpreter might construct a new environment on entrance to a function, and restore the previous environment when a function returns, say by using a stack of hash tables, or a persistent association list or tree.

Yes, some Schemes do actually work this way, but they are toys that aren't really worth using. Instead of representing environments via explicit data structures, good compilers represent environments implicitly: some values are represented on the stack and others are stored in optimized, heap-allocated closures.

You could write a naive R to Scheme compiler that explicitly uses hashtables to maintain R environments, and run a fancy compiler on the result... but then you've destroyed one of the biggest reasons to use the fancy compiler.

R6RS does have some support for environments as first-class values, but they are limited. Their intended purpose seems to be for creating custom REPLs and extending applications in Scheme via "eval". Chez has something similar.

But, as I understand R, (and please do elaborate or correct me!) it sounds like you can basically get ahold of the environment you are currently in, as well as the environment of the calling function, and so on up the call stack.

There is one deliciously evil hack that occurs to me: abusing a built-in debugger that you can get programmatic control over. With a little luck, you could support introspection without paying the price until you actually use it. This might enable one to implement environment introspection while still making good use of the implicit environments a fancy compiler generates.

Chez includes a good debugger, and it might have enough features to support this kind of misadventure, but Chez is proprietary and very expensive. I don't know about Larceny, and Ikarus is new and doesn't have a debugger yet.

I should clarify, though, that most user-level programmers don't use environments. And, except when it's really necessary, environments would be used, not for introspection into the system environment, but just as a data-structure.

I guessed as much, and one could legitimately ask whether or not we could get away with dropping support for introspection... after all, we are translating to scheme, so it's likely that we'd support TCO and thus change the semantics of stack introspection. (although not doing TCO when using compiler that does TCO is easy enough.)

But, from a sociological point of view, the more drop-in the replacement is, the better. Prospective users need their extension packages... and while it may or may not be realistic to support the same foreign API, it would be nice that "pure" R packages would just work.

Another possibility would be a conservative, static analysis for when code won't use introspection, and choose a translation strategy accordingly, but due to R's object-oriented method dispatch, I don't think you could make the analysis accurate enough to be useful without also making it unsound.

R allows people to write imperatively, while prohibiting some of the worst trangressions. The current implementation does have the curious sociological benefit of encouraging users to learn to write more functionally by severely penalizing imperative code. In all fairness it penalizes some functional idioms too.

Got here by accident

Looking for a sort of map for R->Lisp correspondences.
I'm presently using R to solve some problems. I've mostly used it for data interaction in the past, as it is very crufty and unpleasant, to say nothing of slow, for building large applications. It's worth noting most of the R packages have Fortran or C intestines.

I looked into XLispStat for another project. Too dead for me to risk. Instead I wrote it in Lush, which actually has rather a lot of statistical functionality built in, as it's a DSL for machine learning problems. Chicken would have been a good choice also, if it had any numerics written for it. Bad enough reinventing exponential smoothing without having to invent matrix math, like I would have in Chicken (someone please correct me if I'm wrong: too late now, but it would be nice to know about if there is a Chicken environment which knows about, like, the floating point part of the CPU).

FWIIW, the only reason I didn't use Lush for my present project is the lack of SQL intestines. While other lisps do have SQL, they don't have enough floating point stuff to be useful.

Lush

Any opinions on Lush (especially regarding string processing)?

Papers inaccessible

All the jstatsoft.org links are 404. Do I need a subscription?

New links

The location for the PDFs seems to have changed slightly, but they are still freely available. Just go to http://jstatsoft.org/v13 for the issue's table of contents, and use its links to individual papers.