Why and How People Use R

Compelling lecture by John Cook

Abstract:

R is a strange, deeply flawed language that nevertheless has an enthusiastic and rapidly growing user base. What about R accounts for its popularity in its niche? What can language designers learn from R's success?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Slightly disappointed

I did not watch the video but read the slides that are easily readable on the web. Thanks for the efforts to make it easily accessible.

I was hoping for some ideas that could be transferred out of the specific statics domain into general domain-specific language design considerations. I was a bit disappointed because I don't think I got that out, but it may very well be that one must watch the recorded talk, instead of just reading the slides, to get the whole picture.

The slides say that R is popular despite being ugly, slow and not well tooled, because it is convenient, free, and has an efficient global code sharing place.

Two points of interest for me were the mention that R comes from the research community (Bell Labs, then University of Auckland according to Wikipedia), and that it was designed by statisticians. It seems obvious in retrospect, but still, I think it could be fun to go see domain experts¹, ask them to basically invent a programming language for their domain, probably by giving concrete examples, and from that work out a way to develop domain-specific libraries in principled ways.

¹: with as little programming experience as possible, to avoid interference from previous knowledge restricting innovation. My wild, unsubstantiated guess would be that if you teach {Java,C,Python} to domain experts and then ask them to "invent a language for their field", you are likely to get {Java,C,Python} with small variations.

Faster Horses

My wild, unsubstantiated guess would be that if you teach {Java,C,Python} to domain experts and then ask them to "invent a language for their field", you are likely to get {Java,C,Python} with small variations.

According to H.Ford, when doing such an interview you get "faster horses".

This claim is not entirely unsubstantiated. I'm working in a domain where we occasionally purchase test tools from vendors because of the test suites. The tests were developed by tool vendors in proprietary languages which are rip offs of Pascal, VB or JavaScript, depending on the time and age the designer looked at the programming language pool. The languages are astoundingly bad and so are the test suites. They are so bad actually that it makes sense to develop a complete test suite on ones own, if only the language was appropriate and one had a working implementation. The time spent for this is not needed to debug those 3rd party tools, of which some are ridiculously buggy but are nevertheless mandated in the certification process. It's so far beyond quality control that it borders fraud. All of this happens in a niche market, so it might not be too interesting as a business model and it also can't go open source, because the specs are proprietary.

If you watch the video

If you watch the video you'll hear more comments about what's good about R for its specific domain. The thing I found most impressive was its incredibly terse syntax for fitting a regression model. Obviously this is irrelevant for any other domain; it's just fantastic for statisticians. I guess the take-home message from that is that a DSL will be successful if it matches its domain really well, no matter how bad it is in more general terms.

The video also has some excellent comments about how even programs which are so buggy that they crash all the time (by which the speaker doesn't mean R, but some other statistical programs) will do well if they enable people to get their work done.

Jan Vitek and his students

Jan Vitek and his students are looking at R a lot recently. Check out their ECOOP 2012 paper: Evaluating the Design of the R Language. Abstract:

R is a dynamic language for statistical computing that combines lazy functional features and object-oriented programming. This rather unlikely linguistic cocktail would probably never have been prepared by computer scientists, yet the language has become surprisingly popular. With millions of lines of R code available in repositories, we have an opportunity to evaluate the fundamental choices underlying the R language design. Using a combination of static and dynamic program analysis we can assess the impact and success of different language features.

According to the above

According to the above presentation, this misses the point. The language design aspects they consider are largely irrelevant. What matters is how easy it is to get started and do common tasks like linear regression, handling data sets, etc. As he puts it succinctly "DSL: D, not L".

Their own conclusions support this. On the choice of old vs. new object system in R:

The simplest object system provided by R is mostly used to provide printing method for different data types. The more powerful object system is struggling to gain acceptance.

They also conclude that performance is abysmal, but R is still wildly popular. This is of course not such good news for language designers.

Jan talked to me about this

Jan talked to me about this when he visited in prep for PLDI 2012 (everyone please come to Beijing...). He totally gets that and is pragmatic about what R is, why it is successful, and believes there is a lot of fruit to be had in having PL expertise focus on R rather than thumb are noses at it. There is especially low hanging fruit to be had in the area of performance.

I don't think the language design choices are irrelevant, just that many other factors are at play here. "Batteries Included" (D) drives more adoption than language features do, but L tends to be important in the long run.

Right, I actually agree that

Right, I actually agree that language design is important (and the presenter likely does too). But for adoption it is probably not that important, especially if your audience does not consist of people who are primarily programmers. As you say in the long run it leads to problems. That said the presenter does hit the nail on the head with some of the problems I had with Python/Numpy/Scipy/Matplotlib not being domain specific enough (in the context of physics simulation, not stats). It's just very nice to be able to install 1 standard thing, open a repl and type plot(...) and get a plot instead of having to install a separate package which has intricate ways of plotting that requires documentation hunting, an import statement and a couple of lines to imperatively assemble a plot from its pieces. It is also nice to be able to say A \ b instead of "from numpy.linalg.sparse import spsolve; spsolve(A,b)" for sparse matrices, "from numpy.linalg import solve; solve(A,b)" for square dense matrices, something else for non-square matrices, etc.

Language designers can help to design the libraries in a more consistent way. For example a mini language for specifying optimization problems, statistical models, etc. Even on the very small scale they can help. One thing that bothered me in numpy for example is that its vectorization is inconsistent. Many primitive operations are extended to work on arrays. For example you can apply sin(x) to an array and it will apply it to each element. But this principle is not applied consistently. If you transpose a 2 dimensional array, then it will do what you want. But if you transpose a 3 dimensional array it will transpose the outer dimension instead of treating it as an array of 2-dimensional arrays and transposing each of inner ones. I think a language designer with an eye for these things paired together with a domain expert can achieve the best things.

One thing that shader

One thing that shader languages (HLSL and GSL) do right is universal vectorization of built-in functions, but this is more of a performance concern than an elegance issue (still, it helps a lot). Bling was extremely consistent here, and most functions were designed to work over vectors and matrices of arbitrary lengths and ranks (the hacks I had to do in C#'s type system to get that to work were quite crazy though). Designing a math library is quite fun.

We could push much more aggressively on library support in languages. I think that there should be a language and just ONE library, installed aside the language and its dev runtime/environment. There is really no such thing as just a language anymore; people want the whole package programmer experience in one click. If you must have a package management system because your lib ecosystem is too large, make that a part of the library and mostly transparent to the programmer. If people need to share code, do that through the library (using ACLs to protect the code from people you don't want to share with), ditto for versioning. Ok, I've said this many times before.

And of course, you'll need the right language abstractions to make that really large library viable. This is where language design can be play a huge role: you want adoption, your users want libraries, boot strap that networking effect in your language.

You can do this in Ruby like

You can do this in Ruby like this ;)

def import lib
  require lib
rescue LoadError
  `gem install #{lib}`
  require lib
end

import 'sinatra'
import 'json'

The `foo` runs shell command "foo" (this is called system("foo") in C).

It's an interesting idea and it could really lower barrier to entry for domain specific programmers. Getting stuff up and running is ofen more work and definitely more frustrating than actually writing the small amound of domain specific code that uses the libraries

Not what Sean was asking for

Sean is looking for a system that either provides a uniform very broad standard library, or else provides strong language support for packaging such that a standard corpus of third party languages looks essentially like it always exists on the system.

I'm sure plenty of people have noticed that current packaging systems require manual or semi-automated extraction of dependency information that's present in source code, object code headers, and build files, and then redundant storage of this information in some packaging DSL. I think part of Sean's question is "have we seen language features that would better allow languages like R to leverage some large trusted corpus of libraries by reducing the information that must be repeated between source code and the packaging system?"

Jules's proposed solution may work to sweep under the rug some mistakes made by package maintainers, but doesn't allow the packaging system to statically resolve and install dependencies. For one, the proposal breaks if the program is installed while connected to the Internet, but is then run in a remote area or restricted environment. For another, I'm not sure if gems is usually setuid root or will automatically install packages in some alternate location if the user running the program has sane (restricted) permissions. More importantly, there appears to be no information available about versioning of libraries or versioning of interfaces/interface semantics. This sort of information is very important in packaging systems, but Jules's proposal doesn't make it explicit and doesn't provide a mechanism for automatically inferring this information.

Right, it was just a joke,

Right, it was just a joke, though it is not hard to imagine an import function that logs its use to a configuration file that is later used for bundling the libraries with the final program. Of course, this doesn't do anything for the problem that Sean is trying to tackle, namely programming with those huge libraries.

Its not that bad of a

Its not that bad of a solution actually, and it gets the point across quite clearly. And Javascript allows programmers to reference libraries by URL right?

I'm assuming the physical bit problem is easy enough to solve; we have the internet at any rate. But we liked the internet a whole lot more when we could find useful things in it (Google).

What long run?

Which long run are you referring to? Do you mean that adoption will increase later? Or that language features are more important for some type of domain?

R and Matlab

Vector//Matrix math intrinsics, built in plotting tools, and read//eval//print interactivity were important features for the areas where S-Plus//R and Matlab became popular. Here is a link to a pdf comparing them. My suggested approach to the question would be to divide an conquer: Q1) Why are those features listed above paramount in some application domains? and Q2) What were the important differentiating features in the competition between languages with those features - include Mathematica and other competitors listed here under the heading "language oriented". In the case of R, being a free and open source implementation of S-Plus and getting the be the preferred choice of an active community of academic developers during a time when data mining and machine learning were really taking off application areas off were probably important contributing factors.

Why people stopped using XLISP-STAT

To understand the appeal of R, it would also be helpful to look at why there was the rapid abandonment of XLISP-STAT in the 1990s. There was much hand-wringing over why people would dump a stable and well-documented system in favor of R. Several articles were written and I think there was even a whole journal issue devoted to it. My own take on the situation was that people had been using XLISP-STAT because it was the only powerful and free system available. People switched as soon as there was an alternative that wasn't Lisp.

Expresssion scope

If I had to pick one defining feature of R, it would be its call-by-value-but-not-really. For example, if you define an R structure foo with fields bar and baz (technically, it's a "data frame", but bear with me), you can then say

qplot(bar, baz, data=foo)

Notice the lack of quotes. In R is that every expression in the position of a function parameter is, by default, unevaluated. In addition, R lets you *change* the expression scope of an unevaluated expression. So qplot can say, in effect, "get the first two parameters, and evaluate them in this new scope I just constructed". You can use R structures as dictionaries for scope lookup as well; think Javascript-with on steroids.

Here's what the R language manual has to say about it (this will make a few of you cringe):

[...] R is a functional programming language and allows dynamic creation and manipulation of functions and language objects, and has additional features reflecting this fact.

So in practice, this is an extremely convenient way to quickly build little ASTs to be passed into functions. This is used very widely in libraries (so you can fit or plot sqrt(x) instead of x in a model, but where "sqrt" is evaluated in a different scope and it might mean something entirely new)

Of course, it also means that debugging is as bad as you're thinking. But it's a fun, crazy little model to think about.

Lazy evaluation

This sounds a bit like a DSL I've used, and in this DSL there was some handy use of this "pass by AST" calling convention. For one, it allows ordinary functions to act a bit like macros. The common boilerplate

if not key in cache:
     cache[ key ] = ExpensiveFunction( key )
value = cache[ key ]

could be replaced by the much easier to read

value = ComponentEnsure( cache, key, ExpensiveFunction( key ) )

where inside ComponentEnsure, the third argument is lazily evaluated in the scope one frame up the call stack. There are things not to like about the DSL, but the calling convention's ability to allow writing of ComponetEnsure as just another library function is nice. Of course, the power of the callee to determine argument strictness and evaluation scope should be used very judiciously.

Pythons decorators is simple

Pythons decorators feature is simple yet amazingly useful syntactic sugar.

def memo(f):
  table = dict()
  def fmemo(x):
    if x not in table:
      table[x] = f(x)
    return table[x]
  return fmemo

@memo
def fib(n):
  if n <= 1: return n
  else: return fib(n-1) + fib(n-2)

fib(100) # this is quick

Not only sugar

There must be something else "behind the curtain" otherwise there is no obvious link between 'fmemo' and fib: fib code doesn't call fmemo..

fmemo is returned from memo.

fmemo is returned from memo. It is the memoized version of f. The syntactic sugar works like this:

@f
def g(X): Y

===>

g = f(lambda X: Y)

This ensures that recursive calls to g also call the wrapped version. Note that f does not even have to return a function, and that f can be an expression itself, so it's quite flexible. For example a common use is to expose functions from request to a HTML response via a web server:

@route('/foo/bar')
def foobar(req): ...

Python decorators

A python decorator is used to re-bind the function it decorates. Consider something like

@foo
def bar():
    return baz()

This is de-sugared into

def bar():
    return baz()
bar = foo(bar)

Since the memo-decorator returns the fmemo function, fib will be re-bound to that function (which in turn calls the original fib function).

Side-effecting sugar



@foo
def bar(args):
  code

can be understood as


def bar(args):
  code
bar = foo(bar)

The latter assignment really is a mutation, not a variable shadowing. All further calls to `bar`, including those coming from 'code', will use the updated rather than the original definition.

If you wanted to be fully explicit about mutation, in ML you would write:

let memo f =
  let cache = Hashtbl.create 19 in
  fun x ->
    try Hashtbl.find cache x
    with Not_found ->
      let result = f x in
      Hashtbl.add cache x result;
      result

(* tie the knot *)
let fib = ref (fun _ -> assert false);;
fib := function
  | 0 | 1 -> 1
  | n -> !fib (n - 1) + !fib (n - 2);;

(* this is slow *)
!fib 35;;

(* decorate *)
fib := memo !fib;;

(* this is fast *)
!fib 100;;

Indeed, if you want to do that without mutation, you should formulate
this as a fixpoint operation:
`fib_derec : (int -> int) -> (int -> int)`,
`fib = fixpoint(memo(fib_derec))`.

You can also do this:let

You can also do this:

let rec fib = memo (fun n -> ...)

I'm not sure about ML but in F# I think you can (can't test now), and in Python you definitely can. This hides the mutation, and will result in a run time error if you try to call fib before memo returns (which memo does not do).

You can't do that in OCaml

You can't do that in OCaml at least, because the type-checker has a restrictive analysis of recursive definitions that are statically known not to force yet-unitialized recursive values, which of course forbids arbitrary function application (one would need a finer type system for this, which have been discussed previously in this thread).

Funnily enough, OCaml recursive modules do not enforce this static safety, so you can write:

module rec M : sig val fib : int -> int end = struct
  let fib = memo (function
    | 0 | 1 -> 1
    | n -> M.fib (n - 1) + M.fib (n - 2))
end

A bit like macros, exactly

Yes, that is precisely what it feels like. And I suspect this is at least partly why the language is described as being inspired by Scheme.

Reminds me of fexprs

Promises remind me of Kernel's fexprs. The differences are that

  • fexprs don't receive operands bundled together with their environment, but rather receive the environment in which the fexpr is called as a separate parameter
  • there is no implicit forcing - a fexpr has to decide on its own whether and when to evaluate an operand.

Paper on R

Our paper on R is here.

It's a strange world.

Thanks. This is

Thanks. This is interesting.

By the way, I always found it ironic that in many cases we find languages with huge libraries, and hence wide adoption, that are sub-optimal for writing libraries (think C); while languages presumably better suited for writing libraries, often have very limited library support (think Ada). I hardly think this is a coincidence.

Difficult to say

> I hardly think this is a coincidence.

Maybe but I don't think so, for me it's the cost of the Ada tools which hindered Ada adoption: if the DoD had the foresight to start GNAT in the 80s and not in the 90s, then it's possible that we would use Ada instead of C++ in many case..

I so didn't want to focus on

I so didn't want to focus on the Ada case... In general I agree that the adoption of Ada was to a large extent hindered by extra-linguistic factors.

To put more flesh on my provocative statement: I think that languages that privilege library writing at the expense of day-to-day coding (of the kind relevant to the particular language community), will not be adopted, and libraries don't get written for languages that are not used. I think this tension comes up nicely in Jan's paper.

Unsure

I don't know: Java seems to me as a 'not very good' language which has been used to write lots of libraries and is quite successful.

So which language in your opinion focus on library writing instead of day-to-day coding? What are the criteria?

Java is grist for my mill.

Java is grist for my mill. Your first sentence mirrors my argument.

Scala vs Java

Scala vs Java might be a good contrast...

What contrast are you

What contrast are you referring to?

Scala's support for powerful abstractions make writing grand libraries (DSLs even) possible. But then the complexity that these abstractions seem to attract discourages mass adoption (those libraries aren't so easy to use), so in a way it is both better and worse than Java.

Basically, yes.

Ehud: "To put more flesh on my provocative statement: I think that languages that privilege library writing at the expense of day-to-day coding (of the kind relevant to the particular language community), will not be adopted..."

Scala seems to "privilege library writing" while Java is optimized for "day-to-day coding." I won't make any claims regarding whether the one comes "at the expense of" the other, but that's certainly part of the blogospheric argument over Scala adoption.

Yes, I suppose that's a good

Yes, I suppose that's a good example (though not an egregious case). To put more flesh on the "at the expense of" part: The point is not that there is a necessary trade off; merely that when day-to-day coding (of the relevant kind) is considered difficult, down the road advantages are seldom enough to get adoption.

Divorcing the adoption of

Divorcing the adoption of Scala from the adoption of Java ignores much of Scala's design and success, and especially when considering libraries.

I wonder if we are talking about the same thing?

Scala definitely benefits from an existing Java ecosystem of Java libraries and such. But my point was that Scala's value add, its more powerful/typeful abstractions, has attracted a more high-brow ecosystem that many mainstream Java users aren't so excited about. Whereas less expressive languages seem to attract more down-to-earth ecosystems (at least with respect to their targeted users).

If I understand correctly,

If I understand correctly, you're saying that one problem with Scala is that it attracts Haskell developers that scare the other people with blogs about union types, continuation-passing style, or higher-kinded types.

Could you be more specific about what you call an "ecosystem" here? Is it the to-the-outside communication (are the Haskell-Scala people so noisy because they're so good at communicating, is their visibility proportional to their size in the wider Scala community?), or the internal discussion lists, or maybe also the released libraries (besides Scalaz, do you have examples of visible "high-brow" Scala software)?

Do we have any recommendation about how to manage a disparate community of language users from different backgrounds? Do some language feature facilitate finding a common ground, or allow to protect oneself from different and unknown exterior practices? Is this related to the "only one way to do it" vs. "more than one way to do it" debate?

Ecosystem = libraries +

Ecosystem = libraries + users (the plants and the animals that eat them).

My exposure to Scala in the last couple of years has been through PL-interest blog posts, so I could be way off here, but things like collections 2.8 really seemed like way too much to me. Some libraries are down to earth, like Play, so what do I know?

I think you are asking the wrong questions. The community (and ecosystem) will grow organically around the language and you can't really predict the end result. But you can definitely shape the ecosystem, by including (or excluding) features.

Scala 2.8 collections

That an ... interesting point of view. I believe almost anyone who has actually worked with Scala 2.8 collections believes they are the ecosystem's biggest asset.

wrong audience

The question was not do programmers using scala like the collections. Rather, is there a reason more programmers are not adopting scala, and are the collections a factor. Those are significantly different questions.

The point is not if Scala is

The point is not if Scala is appreciated by its base, but how it appears to not-yet potential users, if you are aiming for mass adoption. Higher-kinded types are too much for me personally to take in, they could be completely reasonable for other people.

Scala definitely has a very rich ecosystem, this wasn't the argument. The question is...does the nature of the ecosystem prevent mass adoption. For example, imagine if all your users were Haskell refugees, then Scala libraries would gravitate toward more Haskell-like interfaces, third-party documentation would push lazy functional programming, monads, and so on, non-Haskell programmers would begin to stay away. There is also platform bias, being only on the JVM, which could attract more back-end web developers than say, UI programmers (who wants to build Swing UIs anymore?), hence a focus on build tools vs. nice UX.

[replying to both Sean and

[replying to both Sean and Patrick]. My main frustration with the collections discussion is that collections in Scala after 2.8 are really simple. They are much more intuitive and uniform than collections in any other library I know. Almost everyone who has even a little bit of practical experience with them confirms this.

So it's much more a matter of perception than anything else. Btw: Higher kinded types are not a part of client facing collection APIs. Their only usage is if you want to pull in a helper class to construct a rich collection factory. That's completely optional and reserved for people implementing collections, people using collections do not see these types at all.

Perception and Culture

Yes, for better or worse, perception and culture matter at least as much as, probably more than, a rational technical evaluation when it comes to programming language adoption.

extra-linguistic factors

In general I agree that the adoption of Ada was to a large extent hindered by extra-linguistic factors.

It seems to me that in almost all cases, language adoption (or lack thereof) is largely driven by extra-linguistic factors... Java/C# being prime examples.

And, presumably, the

And, presumably, the argument will be that the "good" languages did not succeed because they lacked such support, or were hindered by extra-linguistic factors. So why wasn't CLU a great success? Why didn't Modula-2 (or, heck -3) conquer the world? I could go on, but I think the point is clear.

CLU is an easy case; as far

CLU is an easy case; as far as I know, it was largely confined to the (relatively uncommon) Honeywell architecture that the CLU team was using. Certainly, being available for the right architectures at the right times is hugely important, though certainly not sufficient to guarantee success either.

I really can't speak to Modula-2, other than my impression is that it was largely overshadowed by its predecesor, Pascal, which itself was soon overshadowed by C. (Though, I suppose why this occurred is an interesting question...)

Modula-3, I'm a little unsure. The SRC implementation was open source and gratis and available for a wide variety of architectures, though Digital never supported and promoted it the way Sun did Java, which largely overshadowed it. And it was open source at the right time, though it never developed the community to sustain the language the way say, the GNU Compiler Collection or Python did.

I think the SRC implementation really did miss an opportunity with the open source route, by not embracing the concept fully enough. I think if you could go back in time to 1994-95 and convince Digital to release it under a GPL-compatible license so that it might become an official part of the GCC, pay attention to distributing it on Linux and FreeBSD, and invest resources into developing a community and making it easier to contribute patches to the mainline compiler, we might be living in a different world today. (And, if that doesn't work, implement a Modula-3 compiler free of Digital's license.) And heck, because you are from the future, you should promptly set up something like CTAN/CPAN/CRAN while you are at it.

Of course I'd want add support for explict tail calls and functional closures, but the point is this almost certainly wouldn't make a difference with respect to adoption. Also, if you had a trip back to 1994-ish to influence PL, would you spend it on Modula-3? I seriously doubt I'd bother.

Worse is better

Languages like R, Javascript, PHP, have a razer focus on their domains and don't worry about getting "it right." Its good enough just to be good enough, release at the right time, and be useful. Notice that many of these languages weren't even designed professionally on big budgets (counter examples being C# and Java), they were designed by people with some knowledge of existing work and a very good understanding of the problems in their domain. Also note that marketing didn't have as much to do with it as we often claim, these languages were just good for their domains.

PL research is more useful as a feeder into production languages. Hopefully a research language will capture the right person's attention at the right time (e.g., Scheme/Self's influence on Javascript). That is how we should define success.

Yep.

Yep.

Active research

Hopefully there are more ways for people from the research community to have impact on those "right persons"? You basically describe laying around and making noise in hope to be just lucky. Could research results also have impact during an open design process (or do your notion of "good enough" implies that no one hears of the language until design is settled?), or to help the latter evolution of the language?

(Javascript current evolution is, since a few years already, informed by a lot of people that know about language design; the R thread mentioned ongoing work that could have impact. I would like to say the same for PHP, but I really don't know.)

Guy Steele calls for care in "Growing a language". Should we study more the question of "Fixing a language"?

We put out knowledge, the

We put out knowledge, the knowledge gets disseminated (maybe slowly) but eventually, if its good enough, will have some influence. Our goals as PL researchers should be to experiment with language features, build new languages that push the envelope in extreme ways, and then reflect on what we've done.

I think that "Growing a Language" has not come to pass as the "right way;" even "fixing a language" is very limited. The so-called committees you talk about have limited influence: Javascript will evolve but not in any drastic ways; its initial design stands and all we can do is move it a little one way or the other. There is no heavy duty language design going on there; a lot of bike shedding perhaps.

PL research has the most impact when the language is initially conceived and designed; at which point it is not on the language researcher's radar who would join committees for already popular languages as a matter of prestige. There is no such thing as a viable open design process; design-by-committee fails often and almost all successful languages have been designed by a small number of people (often = 1) in a closed process.

"getting it right"

Languages like R, Javascript, PHP, have a razer focus on their domains and don't worry about getting "it right."

Sure, that's one path of language adoption, but what about Python, Lua, or Ruby? None had a particularly specific domain in mind, at least not in the way that R, PHP, or JavaScript did. Yet Python is comparable in popularity to PHP and JavaScript, Ruby isn't too far behind, and Lua is more popular than R.

Also, Python, Ruby, and particularly Lua are concerned with "getting it right". JavaScript and R too, to a lesser extent. PHP seems to be the one language that has outright hostile attitude towards "the right thing".

I should have said "don't

I should have said "don't obsess over elegance" rather than "don't worry about getting it right."

I had the pleasure of hearing Roberto talk about Lua in March; the slides are here. Some of the trade offs made would make us academic/research PL designers cringe (can't we be more elegant than assoc arrays everywhere?) but it turns out to be a very good combination for their targeted domain; there is a lot of thought put into it.

Huh? Lua was explicitly made

Huh? Lua was explicitly made for embedded systems (and, similar to Scala's success in finance, was bolstered by government policies). I thought Python originated as a new OS scripting language. I don't know about the original design of Ruby, but it effectively didn't matter until Rails (which was the real domain-specific design).

Lua did well on its own

Perhaps Lua was helped by public money in the beginning (like many other languages), but since the game developers found out about it, it seems to have done all right on its own merits.

Lua also has the advantage of being able to rid itself of bad design decisions by moulting.

It wasn't just public money

It wasn't just public money for building the language, but a restriction on languages that local developers could use. Likewise, I suspect that the embedding concerns it was designed for carried over to the game space.

I'd go further with the molting: adoption has important natural social benefits in terms of technology evolution such as social learning and adaptation.

Why Ruby

This is an old comment but...

At the time Ruby was invented there was a scripting war between Perl and Python. While both had limited object oriented support neither was fully object oriented. Ruby was created to offer all the power of the scripting languages, including functional manipulations but an objected oriented structure.

javascript is razor sharp focussed on a domain?

What domain is javascript focussed on like a razor? Of the reasons I can list for its popularity, that would not make the top 10, if it would make the list at all.

I'm not sure what your point

I'm not sure what your point is? JS definitely has a "domain," web programming, and contains abstractions that support that (at least, you can define libraries like JQuery). Its turned out pretty well for them also.

Luke Tierney, the main

Luke Tierney, the main author of XLispStat who switched to R at some point, has slides on his website about some work he is doing to improve the efficiency of R. He'd likely be interested in a copy of your paper.

Sorry for the mis-threading

Sorry for the mis-threading - I meant that as a reply to Jan Vitek

Luke is indeed...

Luke is indeed working on a bytecode implementation of R. We talk from time to time. The fact that he, as a statistician, has to learn about how to implement a bytecode compiler points out to our failure as a community. He should be spending his energy on problems that are relevant to his domain rather than re-inventing Java bytecode. And we should be providing him with an efficient implementation as well as advice on how to evolve the language.

R like many scripting languages has the advantage of a low barrier to entry for users. The statistic department at Purdue teaches R in one week. (For comparison we take four months to teach Java). But as pointed out above it is a lousy language for writing libraries. It has no concurrency. No encapsulation. Etc.

It is a shame that we can't come up with a way to nudge the large R user base towards a better programming model.

Just for *very* simple tasks

Despite being a statistician I feel it's good only if you need to perform some taks which is already implemented *exactly* in the way you want in some library and do it in a simple interactive session or script.
For anything else it's a mess:
- incredibily slow: most interpreted languages are not very fast (with notable exceptions) but non-vectorised R is tipically 1/1000 the speed of a C program. So in then end 99% of the libraries are coded in C for efficiency reasons!
- very inconsistent: some functions follow the a.funcion convenction, some a_function, some aFunction, some ... (and this in the core R as well) Other inconsistencies are present at syntax and semantics levels
- while it's easy to like being able to write "qplot(bar, baz, data=foo)", this will hunt you in non-trivial programs: the scoping and environments rules are very complicated. You will *not* be happy when debugging any program
- the implementation it's not exactly "elegant", as neither is the design of the types/structures involved in the C implementation (it suffice to read the code)
- no way to compile R ouside gcc/mingw, especially in Visual Studio (it's C!)
- the data containers available makes it impossible to implement some algorithms with optimal computational complexity
- operations you may think are easy to do are not or may leave data structures in inconsistent state (adding a row or col to a data.frame?)

The advantages are that the huge majority of stats libraries / implementations of research papers are in R (:") and the ggplot2 + rply + reshape combination allow tearse syntax for a number of problems, but that's it...

ggplot2 + rply + reshape?

I don't know anything about R. Could you say a bit more about what ggplot2 + rply + reshape are, or provide pointers? By googling around I found this on ggplot and a related reshape page (but it seems to be talking about melt/cast commands that supplant an existing "reshape" command; are you talking about the reshape command or the reshape package?); nothing about rply.

The ggplot library definitely seems LtU-relevant -- and would interest people resurrecting the STEPS thread.

The Grammar of Graphics [6] presents a comprehensive grammar with which to describe graphics. The key concept is that a graphic is produced by mapping data values to aesthetic attributes of graphical objects (grobs). For example, a scatterplot maps the values of the x and y variables to the position of the points. A bubble plot additionally maps the area of the points to a third variable. Mappings range from simple, e.g. a linear mapping from data value to position on the screen, to complex, e.g. mapping a transformed variable to a colour scale. These mappings are supplemented by guides, such as legends and axes, to complete the graphic. The grammar of graphics concentrates on static graphics. A detailed grammar for interactive and dynamic graphics has yet to be developed, although some efforts have been made in that direction [5].

info

Sorry, I meant plyr (not rply) :P

Everything has been developed by this guy: http://had.co.nz/

I suggest reading the papers (linked on the respective pages of his website) published in JSS.

These three packages toghther allows for easy data reorganisation (reshape), devide and conquer, or mapreduce if you like, functionality (plyr) and plotting (ggplot).

Good sumup of drawbacks about R

That's exactly what I thought about R, after some limited experience with this language a few years ago. From what I know, it seems that the people who likes this language most are those from biology background. They don't care about the language, they just need a few of lines codes (which they can happily copy from tutorials, documentations and examples) to get their job done.

Even though R seems very popular now in the relevant fields, I believe it may fade away eventually, when the fields evolve or there is a better alternative. For example, in bioinformatics, the first language that became popular in the field was Perl, then it came R, now Python. Clearly the choice of language is highly correlated with the kind of problems needed to be solved in the field: first a lot of text processing, then a lot of statistics, now many bioinformatics problems have gone beyond text processing and statistics, python start getting popular. So the fate of R in bioinformatics might be the same as the fate of Perl in this field: relevant, but unlikely be the first choice (for new comers of course).

Everything fades. In the

Everything fades.

In the meantime, what matters is libraries. Stats, Bio, Finance, all have massive amounts of code written in R or easily callable from R. There is a very large R community of users. There are many books teaching the language. There are many classes using R.

I am afraid that it is here for a few more years (it's been around for 15 by now).

Also, I don't think that R and Python are really competing. It more like R is the academic alternative to Excel, or Matlab.

Indeed

That's what I said as well in my initial comment, and the reason I use it when some library is readily available for some specific task :-)
I'm just disappointed about the "language itself", intended as its core functions + semantics + syntax + implementation.
Excel is awful and I am not sure whether Matlab is better or worse than R, I'm not proficient enough in Matlab to comment on this...