Social Influences on Language Adoption

This paper quantitatively analyzes why some programming language succeed and others fail. We analyze several large datasets, including over 200,000 SourceForge projects and multiple surveys of 1,000-13,000 programmers. We observe trends in language popularity and adoption. Popularity follows an exponential curve: the tail accounts for only insignificant development effort and the top few languages succeed across a wide range of domains. Examining adoption, we find that social factors usually outweigh technical ones. In fact, the larger the organization, the more important social factors become. Likewise, developers are willing to adopt new languages, but are heavily shaped by their education. Developers prioritize expressivity over correctness, and perceive static types to be more helpful for the latter than the former. Taken together, our results help explain the process by which languages become adopted or not.

Paper by Leo Meyerovich and Ariel Rabkin, ostensibly a part of their Socio-PLT effort.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Should we be expecting Typed

Should we be expecting Typed Racket to take over the world then?

Amateur bias

This paper relies on data sources that are easy to get, not ones that represent what's being used in production code. The inputs were SourceForge, some online intro to programming course, a general popularity site, and Slashdot. They didn't look at commercial code from Microsoft, IBM, Oracle, DoD, or even Google or Facebook.

Slashdot surveys are part of the humor section. Yet this paper takes them seriously.

John, we were indeed worried

John, we were indeed worried about "garbage in, garbage out" -- this paper is one of the first of its kind (only?) to treat the issue seriously for a quantitative analysis. We looked at a variety of sources and tracked demographics, and based on these, got a representative chunk of the developer population.

* The 200,000 SourceForge projects represent most of the open source from 2000-2010 and comes from 200,000+ individuals. Not discussed, the migration to other open source repos only began in bulk in ~2009 (google code, github, bitbucket, etc.). A good question is how open source mirrors closed source: obviously, sometimes its better and sometimes it isn't. We found evidence that work done from hobby vs. commercial projects is performed differently, but that wasn't our focus here.

* All in all, for the actual developers surveys, we looked at responses from ~20,000 developers. Most of the responses were about recent commercial projects, including companies with > 1000 programmers.

* The online intro course was Berkeley's MOOC, which is one of the first of its kind and thus attracted a lot of attention from practitioners. As a result, the demographics skewed to programmers with at least one degree and those that are already employed. A fascinating bias occurred in it, but not the ones you're focusing on: it had a high international uptake and the gender ratio is better than mediums like tech sites and open source communities.

* The 'slashdot' survey wasn't a non-sensical slashdot poll but a full survey running alongside our visualizations that enthusiasts were spreading around via Slashdot, Wired, Facebook, Twitter, etc.

I agree that getting the methodology is tricky. We submitted a position paper to PLATEAU discussing the challenges we encountered -- there wasn't enough in space in the methodology section of our paper. Feel free to email me for a draft if you'd be interested in a more serious discussion about it.


Yes, I'd love to read your draft paper on methodology please. Thanks!

The fact that academic papers don't allow enough space for methodology is a real problem, and one of my main professional intresests. (I'm a philosopher of science.)

Great work! Have the authors

Great work! Have the authors been in touch with a statistician? If statistical rigor is a goal that's definitely a good idea (it's even easier to make mistakes than in programming, but all errors are silent: there is no code that blows up when you compile/run it).

Tricky business

Stats/machine learning helped more with the early analysis of sourceforge and hammer principle data, where we have a lot of sparse and conflicting data. Stuff like p-values are fun, but having one for a biased sample gives you a lot of confidence about something no one cares about, and if you don't pay attention to the bias, everyone is misled by the seeming statistical rigor. Because of the unreliability of source data, precise notions like average, std dev, std err, etc. (IMO) are useful for broad hypothesis testing, not for pin-pointing precise values. Some papers treat these values more seriously, but we tried to be conservative about it.

We've been speaking more with social scientists, e.g., quantitative ones in information science and political science. In practice, what matters more ends up to be wording of a question and its answers, demographics, source of data, etc. It can be subtle! One MSR researcher told me he has a particular day of the week where he gets the most responses from MS employees. Depending on what he's asking, I bet that sways the numbers.

Amen to the programming challenge. We found programming visualizations to be crucial for sanity checking, and if you look closely at the ones I put online, even a bug there can throw everything off :) [I hope to track down at least one bug there while on vacation next week :) ] Peter Norvig cast debugging machine learning algorithms (e.g., a search engine) as one of the next big software challenges that we don't really have a grip on, and I can see why. Here's a tricky question: if the survey analysis throws out malformed responses, does that bias the results to meticulous respondents/programmers? ;-)

Still, be careful how fast

Still, be careful how fast and loose you use statistics. It's too easy to draw wrong conclusions, especially if you're looking for conclusions in a big data set. I have yet to read it in detail, but for example figure 1 raises some red flags. The caption reads: "Popularity of different languages Language popularity fits an exponential distribution with R2 = 0.95 better than a power law. (SourceForge)". If I see it correctly, the way the figure and conclusion was obtained is by taking the top 100 languages, sorting them by popularity, and then plotting the log number of users versus the array index. Then an exponential was fitted against that figure. This is not what an exponential distribution is. Even if the data were truly exponentially distributed, you would not see a straight line. To fit an exponential distribution to a data set you need to do something else (the basic method is called "maximum likelihood estimation").

Here is your procedure applied to a truly exponentially distributed data set.

The R command to get this figure is plot(1:1000,sort(rexp(1000,0.001)),log="y"). We first generate 1000 samples from an exponential distribution, sort them, and then log plot them against their index. As you can see that does not produce a straight line. It does however produce something that looks S-shaped like your figure, especially if you consider that you only took the top N, which removes the low tail on the left.

Besides this, the exponential and power law functions that you fitted to do not account for horizontal translation, and you chose an arbitrary zero point (namely at the 100th most popular language). In other words, why did you consider x^1.3 and not (x-2)^1.3 when x=0 has arbitrarily chosen meaning?

Anyway, my point is that correct statistics is hard. In other experimental fields either a statistician is involved, or the statistics is much more trivial and follows a standard recipe, often limited to a student's t-test or ANOVA (and even then the scientists have had at least one course in statistics).

Jules, I fit it using a

Jules, I fit it using a library function. If I was going to nit on this one, it'd be:

1) You're seeing a library call to the (transformed) regression model. I try not to hand-compute these things to avoid human error, though that's not an ideal regression calc for exponentials.

2) I didn't try many fits beyond these (stretched exponential, etc.) and, to understand the spike near lisp, only tried a few split models (e.g., switching from power to exponential).

3) many projects used multiple languages; what does popularity mean?

For your last point, I've done enough statistics and statistical algorithms to know to limit how much I read into them ;-) As another example, the base of our Hammer rankings uses the Glicko 2 algorithm to rank languages across different dimensions. It computes confidence, which sounds good. However, if you talk to researchers who design ranking algorithms, they're pretty upset about how easy it is tweak rankings across algorithms. That led to us picking the most 'standard' algorithm we saw for the context, and discussing not "lang X is better than Y by some small Z" but trends across many rankings.

For the '100' issue, that's all we had in the data set. The ones at the tail, e.g., Algol 68, only had 1 project.

We'll be releasing the data once we have time to anonymize it. Sounds like you might find something :)

Ah, but the problem I tried

Ah, but the problem I tried to describe is not that the fit was done incorrectly (I have no doubt that it wasn't). The problem is that in the text you're in some places talking like you fitted an exponential distribution, while you fitted an exponential function to the data sorted in a particular order with a particular choice of x=0. Although they are easy to confuse, and both involve exponentials, they are very different things. Your data looks like it *does* correspond closely to an exponential distribution, however the way to determine this is to do a distribution fit with your favorite method (maximum likelihood being the simplest and most common, bayesian inference if you want to be fancy).

Maximum likelihood works as follows: you have a distribution that you think your data conforms to, but that distribution is parameterized. For example the exponential distribution is parameterized by its "rate" which is just a real number. Maximum likelihood says that we choose the rate that maximizes the probability of generating your data set. When you have the data of that graph available, I can do such a fit if you like.

Ah, that's what you meant --

Ah, that's what you meant -- thanks for catching that. Yes, I fit an exponential function with 2 degrees of freedom rather than, with the exponential distribution, 1. The caption was mislabeled but the formulas and rest of the text were right, saying 'curve' rather than 'distribution'.

I'll take a look to see if any common distribution in the exponential family actually fits. I'm actually rather skeptical because of the reduced freedom. R and scipy packages can do goodness of fit tests (including MLEs) out-of-the-box for most distributions and, if you have more than confidence than me, with custom functions. Will see.

(And, thanks again for picking up on that. That made it through several iterations of review!)