The Right Tool

David MacIver is doing a bit of a sociological study on how programmers pick The Right Tool for the job. Programmers select all the languages they know from a fairly mainstream and popular list and then rank those languages according to statements like "I find it easy to write efficient code in this language" and "When I write code in this language I can be very sure it is correct". At the end of the process the survey taker can see how languages ranked overall under each statement and what statements have been most strongly associated with each language.

Obviously this isn't a formal study and, as with all online surveys, there are going to be challenges with selection bias and with people trying to game the system. None-the-less, it is pretty interesting and fun as is. Perhaps something similar would be worth doing under more controlled circumstances (although it beats me how to feasibly get a large sample size of programmers without introducing selection bias).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

selection bias, indeed. If

selection bias, indeed.

If you only know one language, it still makes you fill out the whole survey...

Yeah, there are a large

Yeah, there are a large number of usability issues with the site at the moment.

This one is fairly benign: It's an evidently silly thing to do, and if you do it the results don't actually affect the rankings at all.

i'd like a version of this with only LtUers

if you could run it where it required a valid current LtU account, then i'd be more interested in the results! :-)

Not a bad idea

Ha. Apparently not unjustified. A lot of data oddities have corrected themselves since LtU has joined in. :-)

It's unfortunate that it considers "Fortran" one language.

The differences between Fortran 77 and Fortran 2003 are nearly at the scale of the differences between C and C++, if not greater. The latter is (in its niche) often remarkably concise and powerful, whereas the former tends to be like a weird and pointerless variant of C in both measures with its only power coming from raw simplicity.

Nonetheless, it's an interesting poll.

This is what comes of adding

This is what comes of adding languages I don't know personally. I hadn't realised the difference was so great, sorry!

It's hard for me to split a language once it's already down (I mean, it's easy to hide the existing language and add two new ones, but that loses interesting information as they have to start from scratch. I can't just rename an existing one because that would mean that I'd be changing what people voted for). Do you have a sense of whether most Fortran users would mean one vs the other when just saying "Fortran"? What would be a good set of divisions here?

I don't really have a good

I don't really have a good sense what most Fortran users would mean; partly, I've been out of the community for a few years, but it's also rather different userbases. There's a very large userbase of people working with legacy scientific code from decades ago written in Fortran 77 (often with idioms left over from earlier versions) and things built on that legacy code, and there is also a userbase of people who are using the new stuff for things that are a lot less legacy. I don't know how much the newer versions are creeping into the old legacy codebases these days, nor what the breakdowns are amongst people who are likely to fill out your poll.

In any case, I would think that a good division is "Fortran 77 (and earlier)", and "Fortran 95/2003/2008". It's not ideal, in that there are some notable additions in the 2003 and 2008 versions of the language -- including objects and concurrency support -- but I'd place most of the critical changes between Fortran 77 and Fortran 95. This also leaves out Fortran 90, which is by intention -- it's best thought of as (or forgotten as) a pre-1.0 version of Fortran 95, and nobody uses it now.

I'm also not sure that separating the versions would completely remove the bimodal distribution of opinions. There is a certain sort of programmer who can write horrid code in any language including the best variants of Fortran, and they haven't gone away with the language improvements -- and there's an awful lot of Fortran code written by people in that camp, because a lot of these people are good scientists and engineers and don't have a good programming training (or learned from other people who didn't, or learned from people who did but in the 1960s). So you have respondents who've been exposed to Fortran by having to deal with the horrors produced thereby, and you have responents who like the language and use it for well-written code.

It'll be interesting to see how that falls out, really.

Edit to add: I looked again, and the current responses seem to be a pretty accurate cover of the common features to all the versions of Fortran, so probably some of my concerns were misplaced.

In particular, the juxtaposition of "The thought that I may still be using this language in twenty years time fills me with dread" and "This language is likely to be around for a very long time" is rather entertainingly apt (even if I hope that the first statement primarily refers to older versions, the latter seems true for both).

A few interesting early results

Oh, this is fun. MacIver's blog has a link to some preliminary results, relating common matches between languages and propositions, some of which are interesting. If you haven't provided your answers yet, you may wish to do so before reading this to avoid biases. That said:

  • Apparently Coq is more highly-rated for being good at text processing than for having a strong static type system. Who knew?
  • Python is top-three for both "very large projects" and "very small projects". Note to self: avoid medium-sized Python projects.
  • The most readable languages are apparently Python, Smalltalk, and Ruby. Translation: nobody likes curly-braces.
  • F# was inferred to be more similar to Haskell than to OCaml, which may come as some surprise to users of those languages.
  • Visual Basic was inferred to be most similar to PHP, confirming many suspicions. It was second-most similar to Objective-C, which is somewhat less expected.
  • The languages considered bad for beginners are C++, APL, and Ada, a juxtaposition I find inexplicably amusing.
  • It is apparently unusual for J programmers to discover unfamiliar features, perhaps for the same reason that a fish is unlikely to discover water.
  • Scala is judged likely to have a strong influence on future languages, and also likely to be a passing fad, which seems kind of depressing.
  • C++ is top-three for having a strong static type system. I'm not sure what I could say that would make that funnier than it already is.
  • Languages considered "minimal" include C and Scheme; languages judged "large" include C++ and Common Lisp. There's probably some sort of analogy that can be drawn here that would offend fans of all four.
  • Languages that many people enjoy playing with but wouldn't use for "real code" include OCaml and Haskell. Of those, the former is judged good for terse, efficient, correct, and easily debugged code; the latter is judged good for terse, elegant, correct, concurrent, and reliable code. From this we can infer that "real code" is defined as verbose, sluggish, single-threaded code that is unreliable due to hard-to-find bugs.

Of course, rankings may change dramatically as more answers arrive; I very much look forward to seeing where it goes.

Thanks for the feedback. A

Thanks for the feedback. A couple responses:

  • The Coq thing is a bug. I'll fix it tomorrow. Essentially there's not enough data there to actually budge the position of Coq in any of the rankings, so its entries are basically random.
  • The similarity feature is very rudimentary at the moment. It should be considered to be more amusing than actually accurate. It's really just a silly hack at the moment which turned out to give results that were sufficiently better than I expected that I included it anyway. The rankings are significantly better justified.
  • C++ and J both appear to get weirdly placed all over the shop, I think in both cases because there is a relatively small overlap between serious programmers in these languages and serious programmers in other languages (people who tend to know them well live and breathe them). I hope to investigate further at some point.
  • I feel a bit bad about the "real code" one. I may remove it. I suspect it's too divisive to ever produce useful results.

No worries!

Oh dear, I hope you aren't taking my observations as a serious criticism of your project! Your responses are pretty much in line with what I'd assumed to be the case given the nature of the data you're collecting, especially regarding lack of data for niche languages and strong inter-correlations within the set of languages a given person knows well.

My earlier remarks should be considered, like the "similarity" lists, "more amusing than actually accurate".

I feel a bit bad about the "real code" one. I may remove it. I suspect it's too divisive to ever produce useful results.

I think the intent is probably valid, but the phrasing is unclear, particularly with the scare quotes around "real code". As it stands, I could see justifiably reading "real code" as anywhere from "has strict requirements for behavior/performance" to "acceptably mainstream" to "suitable for being posted on the Daily WTF".

Anyway, quibbling aside, I do think the whole thing is a neat idea--thanks for doing it.

Not worried

Oh, don't worry. I did take it in that light, but some of them were legitimate issues (amusingly the bug with Coq has "gone away" without my fixing it by virtue of a mob of LtU readers descending on the site and providing enough data about Coq to get its results sensible).

A few other ideas for looking at the data

When reading through the data, I found some other interesting things that I might suggest adding to your presentation of the results:

For programming languages, it can be interesting to see what statements it's the worst match for, as well as the list of top matches.

Also, you have a "languages similar to this language" match, and I think it might be quite interesting to see the same sort of list for the statements. In particular, I was struck by the fact that "this is a low-level language" and "this language allows me to write code where I can easily tell what's going on under the hood" have the same list of leaders. Likewise, you could do "this statement and this statement are most opposite".

For that matter, "this language is most unlike these languages" might be fun, though I'm not sure how enlightening it would be.

Your wish is my command, at

Your wish is my command, at least on the ones it's easy for me to add immediately. :-) Languages now have the 10 least applicable statements and 5 most dissimilar languages.

The similar languages feature continues to surprise me with how well it works (it really shouldn't work well). I didn't really expect the most dissimilar ones to make a lot of sense, but they actually look pretty reasonable.

The page for Visual Basic makes me smile a lot.

I'll see what I can do about statement similarities, but right now there's a lot to do on mundane issues like UI, deployment and data dumps, so probably won't be tinkering with it just yet.

Yeah, the one for Visual Basic is pretty amusing!

And that list of "most dissimilar languages" looks like a good list of interesting langauges to learn, too.

I also notice that Fortran and Python seem to be at opposite ends of some spectrum, which probably says a lot about why Python is becoming popular as a top-level language for connecting scientific-simulation pieces written in Fortran or Fortran-like C and C++. This evidence would suggest that there's a clear split between them, and Python's good at exactly the things that are missing in Fortran.

Scheme is listed as the language most unlike C++ -- and, yet, not very long ago I was doing some C++ template metaprogramming and found that the natural way of expressing ideas there felt very like an exercise out of "The Little Schemer". As the poll says, "This language is large". (It seems rather more strange that C++ is considered most unlike Smalltalk, which I guess reinforces the point that the poll isn't about language idioms and "orientation" -- which makes it interesting how it nonetheless seems to do a bit of orientation-clustering.)

My original motivation

My original motivation for doing this was actually to try and find what the "natural" axes for describing programming languages on were. The python to fortran thing definitely seems like an instance of that.

So far I've not had a great deal of success (I've tried a few things without massively convincing results), but it definitely looks like there should be something there. Fortunately there seems to be enough other stuff of interest emerging out of this survey that I'm not too worried if I don't find it. :-)

Quite so.

When you posted this, I was justing editing my earlier comment to include the things I was noticing about C++'s "opposites", which seem directly relevant to this.

(Edit: I had also put something here about thinking that the opposites-as-a-guide-to-things-to-learn idea scaled beyond just VB, but as I look more, I'm less convinced it does. Ah, well.)

It might be a bit better

It might be a bit better now. I've improved the ranking algorithm, and modified the similarity scores to use ranks rather than scores (if I didn't mention before: They're basically pearson correlations across the different statements. This is significantly dodgy because the scores for different statements are highly unlikely to be independent, but doesn't seem to be too bad in practice). The ranks are definitely better now, and my impression is that the similarity scores are a bit improved too.

For what it's worth (and I'm not sure it's much), the following is a markov cluster of languages by similarity (with an expansion factor of 5)

Common Lisp Coq D Smalltalk O'Caml Haskell Clojure F# Scheme Standard ML Mozart-Oz Erlang Go Scala
AWK Cobol Haxe Javascript Lua TCL Visual Basic Io Perl R ELisp PHP
Fortran Eiffel C++ C# Ada Delphi C Objective C Java
J Factor APL Forth
Python Ruby Groovy
Assembler
Prolog

There's something interesting going on there, but not a great deal. Back to the drawing board. :-)

The most widely-used pure functional language

not very long ago I was doing some C++ template metaprogramming and found that the natural way of expressing ideas there felt very like an exercise out of "The Little Schemer".

It's well-known that C++ template metaprogramming is Turing-complete, but it's really a completely different language from C++ itself; much closer to a functional language than anything else. For instance, this article demonstrates simple correspondences between snippets of C++ template metaprograms and some Haskell code.

C++ template metaprogramming doesn't really have anything like Haskell's type system, though, and lacks most built-in features of a functional language. So, maybe something like the pure untyped lambda calculus is a better analogy. And Scheme, of course, is probably the (non-obscure) language that's closest to untyped lambda calculus.

Addendum: Speaking of Haskell... with certain GHC extensions enabled, its type system is also Turing-complete, but is most easily described as a logic programming language. So we just need a metaprogramming system for Prolog that contains a Turing-complete imperative/OOP language to complete the cycle.

Snobol?

There's a category for "This language excels at text processing", but Snobol isn't on the list of languages. That's surprising.

Conveniently

Conveniently, you can now suggest languages and statements that you'd like added to the site. :-)

Interesting Exercise

Flawed or not it will be interesting to see the results and it should provide fuel for many an LtU debate.

To reduce selection bias you also need to advertise this widely, if users of all languages don't have equal opportunity it will get less useful results. You've already noted the LtU effect on Coq data :)

Can you provide a summary table of how many and/or % of respondents that knew each language?

Surprisingly widespread

It's been pretty widespread so far - there are near to 100k (98800 and counting) responses from about 6k (6375 at the moment) people. It's got hits from reddit, hacker news, various places on twitter, and then seems to have intermittent outbreaks in different language communities as someone goes "Hey guys, our language isn't very well represented. Let's do something about that". I'm pretty astonished by the uptake.

It's worth noting that the ranking system doesn't know anything about the number of people answering. 1000 or 100 people ranking a language doesn't make a different if they all rank it in about the same place. There's some smoothing for very small data (bayesian average with N=5, so we don't start believing info about a language very strongly until there are about 10 or so respondents), but once you're past that the only difference it makes is the quality of the result.

Here's a breakdown of number of people reporting known languages (for some of the earlier results - the first thousand respondents or so - these are estimated based on answers, as the UI changed - previously I didn't get people to answer the "what languages to you know" bit up front, but the old UI confused everyone): http://pastebin.com/qW8LzHpm

Edit: Forgot to mention, there are data dumps available here: http://data.hammerprinciple.com/the-right-tool.tar.bz2

They're updated nightly. They don't currently contain the known languages responses though. I should fix that.