The origin of zero-based array indexing

An amusing historical analysis of the origin of zero based array indexing (hint: C wasn't the first). There's a twist to the story which I won't reveal, so as not to spoil the story for you. All in all, it's a nice anecdote, but it seems to me that many of the objections raised in the comments are valid.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Interesting but dubious

This is a well-written, engaging story. But I can't help thinking that he misinterpreted the explanation he was given as "saving off computation time", while it was in fact "in this context, indexing at 0 gives me a natural, more elegant design than indexing at 1". The quote says: "I can see no sensible reason why the first element of a BCPL array should have subscript one." This is a comment about design and taste, not efficiency.

I find it awkward that we've got a post that eloquently warns us about mythology and neglect of history, insists on the importance of going back to historical sources, yet exemplifies this process by an extremely debatable interpretation.

The twist in the story is

The twist in the story is the claim that the time that was at stake was the time it took the code to compile, not run time efficiency.

Hm

I'm still not convinced that compilation efficiency is actually the major justification for this design choice. I've seen no convincing argument that it may have mattered at the time (would adding 1 to a compile-time constant actually take noticeable time, given the bounded number of array access operations appearing in source programs?), and Martin Richards' text doesn't mention compilation time, but uses a taste/design vocabulary: "Naturally ...", "behaves like", "I can see no sensible reason why...".

I am also not convinced. But

I am also not convinced. But it's an unusual idea, for sure!

Oh, sorry for mostly

Oh, sorry for mostly repeating myself then. I found the link interesting, and the yacht story sure is funny.

That's ok. I didn't want to

That's ok. I didn't want to put my view in the original post, since it would give the story away. I guess the compilation time consideration is not an impossibility (and compilation time efficiency is often something language designers think about!). But the evidence in this case is not convincing enough, I suppose.

Why does it matter which convention is used?

While I like his writing style, I don't see where Mike Hoye gets the idea anyone spends time justifying zero- vs one-based indexing styles. I never heard C's zero-based indexing discussed when I was in college, nor in years of industry afterward — like, ever. Never had a conversation about it in school or at work. But Hoye makes it sound like we're south pacific islanders, cargo-culting our way along, without any logical rationale:

We tell and retell this collection of unsourced, inaccurate stories about the nature of the world without ever doing the research ourselves, and there’s no other word for that but "mythology".

Where are all these folks telling and retelling stories? I get an urge to write an Uncle Remus story about how Brer Rabbit fools Brer Fox by using better indexing. We can change the ending of the Tar Baby episode this way.

Is it productive to discuss? Will it ever change for a language already in use? Analyzing it smacks of a style statement, so folks who do it another way can be dinged for bad taste. (When you approach the pearly gates, Saint Peter asks whether you prefer zero-based or one-based indexes, then studies you carefully keeping one hand on the lever for a trap-door under your feet.)

Mike Hoye's motive seems to be a concern for off-by-one errors, given this part:

How many off-by-one disasters could we have avoided if the “foreach” construct that existed in BCPL had made it into C?

Such off-by-one errors are rare in practice. I never catch anyone doing it in review. They forget to free allocations after error when they goto a cleanup label, sure, but never make off-by-one errors. I'm sure I must have made such an error at least once in a loop, but it was so long ago I don't recall. The bugs we close at work never have a resolution, "Oh, it was an off-by-one error." As a source of problems, indexing dwindles to insignificance compared to simple things like wondering: what was this function ever expected to do?

Sometimes I write an assertion that fails, because it turns out it really is possible to target all of a collection, and not just a strict subset. It's the reasoning that's wrong, not the indexing, like the time you find a partial match can actually correspond to all the bytes in a referenced source, not just some of them. (It's a strict subset in the source where you find it, but the whole thing in the destination where you decode it.)

When I work in C I'm usually managing space, and I imagine a picture. I don't talk to myself about the "first" element of an array. There's a block of memory where content is located, and I know it's address. I do arithmetic in offsets, rather than counting elements. When it gets really complex, and it becomes necessary to analyze what I'm doing formally, I think in terms of coordinate systems on the number line, and map between these. Sometimes there are five or more coordinate systems when doing things like pattern matching one scatter-gather array with another, where both are subsets in the middle of bigger streams.

For example, the following picture comes up in TCP deduplication when packets show up in ethernet frames presented as scatter-gather iovec arrays, when you try to figure out which parts correspond to content already stored earlier in a cache — also accessed via scatter-gather for direct comparison, but in pages much bigger than ethernet frames. So this isn't to scale. But irregularity of iovec sizing shown is correct. You have offsets in streams, offsets in remaining content not yet encoded, offsets in iovec fragments, offsets in stores, offsets in pages, etc.

                 |<---------------- unmatched --------------->|
     |<-- hit- ->|<-- before -->|<----------- after --------->|
     +--+---+-+--v-+--+---+---+-:+----+---+--+----+---+--+----+
 ... |..|...|.|..ab|cd|efg|hij|kl|mnop|qrs|tu|vwxy|zAB|CD|EFGH| ... (iovec array)
     +--+---+-+----+--+---+---+-:+----+---+--+----+---+--+----+
                                ^
                                fingerprint cutpoint
                            
                        |<-before ->|<------------ after ---$---->|
           +---------+--v------+----:---+---------+---------+
 ......... |ZYXWVUTSR|QPONMdefg|hijklmno|pqrstuvwx|yz#ZYXWVU| ... (store pages)
           +---------+---------+----:---+---------+---------+
                                    ^
                                    fingerprint indexed
match:
left="defghijk" right="lmnopqrstuvwyz"

Code is hard to understand without such a diagram. It's the one in my mind's eye when I invent the algorithm, and it's what I see when writing code. The reasoning is spatial rather than verbal. The only talking-to-myself that occurs is code comments I write: explanations of edge conditions, consequences in code elsewhere, and micro-proofs of local sub-problems so you can verify them yourself.

A colon (':') above annotates byte 'l' in two places corresponding to a point in the TCP stream that would be indexed if it was, and to a point in the store where it actually was indexed earlier. I won't tell you how this location is picked, but speed and uniqueness are both important, as well as tendency to resynchronize after content perturbations. To get a maximum sized hit, matching must occur both backwards and forwards. I think of it as vectors on the number line when each subspace corresponds to either pointer plus length or offset plus length. Putting the origin at zero is less complex than putting it at one when I think of things as coordinate system translations.

Here's an interesting tidbit: the C++ version of the scatter-gather comparison is faster than the C version, despite the fact I wrote the C version to be identical, as a reference for the assembler guy to hand-tune. Comparing them side-by-side gives you roughly the same variables and memory references, but the C++ optimizes better, possibly because inline methods give better clues about scope and lifetime of expressions. (When I told folks this, they said, "You're lying." So I said: compare it yourself.) Given two iovec arrays with no alignment whatsoever, write an iterator that normalizes iovec breaks; the next pair of iovecs returned is wherever a break occurs next in either. Now it's easy.

Let me know if it's not entertaining when I include more detail like this than necessary just to comment on zero-based indexing. I'll use this example in any further comments I write about zero vs one-based indexing.

Interesting but dubious +1

[Speaking as somebody who was programming in BCPL in the mid-1970's, as an undergraduate ...]
The article quotes Martin Richards with a perfectly sensible explanation that has nothing to do with efficiency; and then ignores him.
When the article says "BCPL ... doesn’t support pointer arithmetic, much less data structures", it is just plain WRONG. The Richards quote about `!` as both monadic and dyadic is exactly pointer arithmetic. If p is a pointer, then `!p` is its referent. and if p points to a vector of machine words (which might be a data structure, which might be an array, or might be a list or a tree implemented with pointers), then `p+5` is a pointer to offset 5, and `p!5` and/or `5!p` is shorthand for `!(p+5)`. So it is absolutely vital that `p!0` is equiv to `!(p+0)` is equiv to `!p`; otherwise we'll get illegal memory access.
There's no need for those offsets to be literals. This is perfectly cromulent: `FOR i = 0 TO 5 { p!i := i }` equiv `FOR i = p TO (p+5) { !i = (i - p) }`. So the argument for efficient compilation is also nonsense. (That said, BCPL does try to evaluate expressions at compile time, so using literals can improve efficiency.)
Correct that BCPL didn't have malloc(), but it did (does) have library routines that achieve the same, and all rely on this pointer arithmetic.
I very much doubt that BCPL was where this started. True that Algol 60's arrays are 1-based, and that BCPL is one of the diaspora of Algol-like languages; but BCPL is much more aimed at being a machine-independent machine-level language than at being a high-level language. And to be machine-independent, it tries to make minimal assumptions about the architecture of the machine. This is expressed in BCPL's intermediate abstract machine language INTCODE. (I believe BCPL pioneered this approach for bootstrapping/porting the compiler to other machines: compile the compiler to INTCODE on the source machine; hand-craft an interpreter for INTCODE on the target machine; lo! you have a compiler on the target.)
There are typically machine instructions for chasing pointers (monadic `!`); but there might be machines with no dyadic indexing. So INTCODE implements dyadic indexing as: pushdown arg1; pushdown arg2; add; fetch-at-index.

Thanks. These are excellent

Thanks. These are excellent points.

Algol 60 allowed arbitrary lower bounds

What do you mean by "True that Algol 60's arrays are 1-based ..."? Here's the report: http://www.softwarepreservation.org/projects/ALGOL/report/Algol60_report_CACM_1960_June.pdf . Section 5.2 (Array Declarations) gives these examples:

array a, b, c[7:n,2:m], s[-2:10]

own integer array A[if c

real array q[-7:-1]

dubious +1 +1

(Thanks Ehud, little did I think I'd ever need to drag my BCPL coding skills out of the archives ;-)

Reading Hoye's piece more closely, I can see no evidence whatever for his main claim: "So: the technical reason we started counting arrays at zero is that in the mid-1960′s, you could shave a few cycles off of a program’s compilation time on an IBM 7094."

Yes, the anecdotes supports that shaving cycles in general was vital.

But there's no explanation (that I can see) of why compilation time was more important than execution time. Neither is there an explanation of why 0-base improves compilation time. (Well, OK, compilers have markedly different characteristics to data processing programs. They have a lot of small data structures linked by pointers to represent the AST. But Hoye doesn't discuss that. If your AST nodes need 5 words, you still have to add an offset to the pointer.)

Something else Hoye doesn't mention: strings (of which you'd have plenty in a compiler, for the name table) in BCPL are held differently vs. C's null-terminated. A 5-char string is held as a 6-word vector, with offset 0 holding the length (`p!0` === `!(p+0)` === `!p`). So for name lookup you can easily detect that two strings are unequal just in case they have unequal lengths. (If your words are wide enough, you can pack multiple chars into a word, but you still have offset 0 holding the length. You might restrict the length of names to 256 (8 bits same as a char), and pack the length and the first few chars into the 0 offset. Still you can detect inequality by comparing at offset 0. Given that in early 1960's everything was upper case, you could restrict your charset to 64 symbols, restrict the length to 64, and pack SIXBIT chars 6-to-a-word on a 36-bit 7090, or on a 36-bit DEC System 10 on which I first programmed BCPL. You could pack tighter with SQUOZE format.)

Any of this would have been interesting historical detail to (possibly) substantiate the claim. Richards could have confirmed it. And contra Hoye's peeving, it's all still in the public domain. (Strewth! there's even a scanned copy of my varsity's 1978 BCPL reference manual on the web.)

But actually: BCPL is only relevant to the story if it was the main production language on the IBM 7090 at MIT. I would guess that in the early 1960's, BCPL was a research project, and the production languages would be FORTRAN and assembler. 'Citation Needed' indeed!

I thought BCPL was relevant

I thought BCPL was relevant being a predecessor of C.

And I used to think that the

And I used to think that the convention of zero based indexing had something to do with mathematical induction; first you prove a theorem for case of n=0; then given n you prove it for n+1

But well, when a person counts on his fingers /'enumerates' stuff then the convention is to start from 1.
In my toy language the convention is to index arrays starting with 1.

Maybe there's some experience bearing on this question

Two distinct questions here. One is what really happened historically. I agree with AntC that Hoyes doesn't make a case for his claim - it's a straw man, and contradicted by what Richards says. And Hoyes clearly cares more about ax-grinding and ridicule than the truth, so his claims aren't that interesting.

The other question is what to advise a programmer or language designer on what to do now.

I'd say that if you're optimizing for familiarity and for avoiding the need to learn new ways to do things, and your users are familiar with 1-based, then there may be a case for 1-based. Some people I respect advocate for it, so I can't dismiss it. If on the other hand you're optimizing for cleaner, clearer, simpler programs, and trying to help people to learn how to think, then in principle it's an empirical question - the experiment has been performed.

I've done a lot of programming both ways, and I know that with zero-based indexing there's a lot less fussing with adding and subtracting ones when calculating subscript ranges, lengths, counts, and so on. This may be hard to convey to inexperienced programmers or to people who only have experience with one of the two regimes. The clutter issues has to do with the 'user interface' - how easy it is to reason about programs and get them right. Efficiency is obviously a red herring.

I think Graham, Knuth, and Patashnik make the point well in their book Concrete Mathematics. If you compare their 0-based formulas to the conventional 1-based ones you'll see there's an awful lot less clutter in discrete math when you write it with 0-based indexing. You're able to think about them more easily. I don't think I'm giving an opinion when I say this.

(Clearly the advantage was not big enough to convince anyone else to adopt it for discrete math, but history is full of situations like this, e.g. the Dvorak keyboard.)

Someone who cares about the right answer, and not just baiting, might want to make a poll of programmers who have significant experience doing things both ways. Or one could read a matched sampling of programs written under the two regimes. One rough measure would be count the number of times you find + 1 or - 1 applied to indexes (other than to step an index in a loop). Each additionally bit of fuss is an opportunity to make a mistake, or (equivalently) an obstacle to comprehension - I know that isn't universally true, but I think it is true in this situation.

I don't think it's a coincidence that 0-based indexing was both simpler in the BCPL implementation and has turned out to be better for programming, but I'm ok with calling that an opinion.

Natural numbers

All this seems closely related to, though on a more pragmatic note than, the question of whether zero is a natural number. From what I understand, traditionally zero was perceived as one of those strange, unnatural creatures, like negatives, fractions, irrationals; probably part of the "all else" in "God made the natural numbers; all else is the work of man". Zero-as-natural seems to be spreading amongst mathematicians based on greater emphasis on nice algebraic properties, but afaik there's still a strong subcommunity in mathematics who follow the traditional, probably-philosophy-based exclusion of zero. I suspect the computer-science preference for natural zero comes from our emphasis on finding clean ways to do things, hence practical properties of algebras. (Myself, I've long since adopted the practice of referring to "ℕ" as "nonnegative", and avoiding the term "natural" altogether.)

Hu?

afaik there's still a strong subcommunity in mathematics who follow the traditional, probably-philosophy-based exclusion of zero

Do you have a pointer? I've never heard of anyone doing that in modern times.

Until well into grad school

Until well into grad school I'd only encountered the unnatural-zero position in second-hand descriptions. Then, in the late 1990s I think — about the time I'd decided unnatural-zero could be relegated to an historical footnote — I met a mathematician who took it for granted that only ignorance of mathematics would cause anyone to call zero a natural number; evidently they'd been ensconced in a population who all considered zero unnatural.

Doing a google for "natural number", the first match I get is the Wikipedia article, which notes there isn't general agreement about zero; the second is Wolfram, which notes there isn't general agreement about zero; and the third is mathwords, which states unequivocally the natural numbers start with 1.

Number theory

It's common in number theory to take N = {1, 2, 3, ... }.

Linguistics and social structure of natural numbers

I looked around on the web and you certainly seem right. That 0 is a natural number is not a widely accepted convention. The mathwords link you give opposes "natural number" (after 1) to "whole number" (after 0), but even "whole number" is used inconsistently to mean "after 1", "after 0" or Z.

I was taught mathematics in France, and in French there is absolutely no possible ambiguity. The symbol N denotes "entier naturels" which start at 0 (interestingly we idiomatically use "entier" rather than "nombre" here, which literally means "whole"; "entier relatif" is Z, and "entier" may be either depending on the context, but is opposed to anything with a non-empty decimal part). French also consistently defaults to non-strict inequalities, by opposition to the english language: "positif" means non-negative, "croissant" means non-decreasing, etc.

French is a centralized state, and French mathematics are also a centralized knowledge domain. Most teachers will look into Nicolas Bourbaki texts for terminology advice (but accept some slight variations/modernizations; interestingly, I was told Bourbaki influence could be distinctly felt as far as Cambridge (UK)), and there is an national, official, centralized programme that all schools teach. The French states dictates that natural numbers start at 0, and nobody would ever say otherwise -- what would be the point?

I'm not necessarily a partisan of strong centralization in all situations, but it is amusing to contemplate the confusions that may occur among english-speaking mathematicians. That said, I think you should fix your metric system(s) first.

(I think there is a consensus to attribute axiomatization of natural numbers to Guiseppe Peano. Both the English and the French wikipedia articles on Peano axioms include 0 in the set of natural numbers. Quite interestingly, I just learned that, as can be checked in the original latin edition of Peano's text, Peano in fact gave his axioms starting at 1.)

I perceive serious pursuit

I perceive serious pursuit of mathematics to have a lot in common with linguistics, in that every mathematician has their own terminology and, to read the works of many varied mathematicians, one has to make a careful study of what they actually mean by the words they use. I remember reading several pages in to a graph theory paper and still not being sure whether or not they intended "graph" to allow multiple edges, or loops; eventually I was able to deduce what they meant by painstakingly working out what they had to be assuming in order for their proofs to work out correctly. Iirc, Morris Kline mentioned the linguistic difficulty in his Mathematical Thought From Ancient to Modern Times: modern mathematics involves lots of definitions, whereas algebra made great leaps forward once a standard compact notation was adopted (much of it around Euler's generation); Kline as I recall described a treatise from some centuries earlier, in which a system of (third? or was it higher than that?) order equations is solved... entirely in Latin prose.

Re units of measure, my father noted of my college physics textbooks that several of the metric units were different than he'd learned in school. I figure the eccentricities of the "traditional English units" of measure are just something that accumulates in any system over time; give the new system a few centuries and it'll be a mess too. (I used to boggle at defining 1 horsepower as 550 foot-pounds per second, till I realized that's just what happens when you put it in foot-pounds per second; 1 horsepower is 1 rod-ton per minute.)

In addition to noting that

In addition to noting that Wikipedia is wrong, would you please fix it?!

I don't think wikipedia is

I don't think wikipedia is wrong, only that scientific conventions evolved and what today's mathematicians mean by "Peano Axioms" starts as 0 (just as the original sentences that described what is now known as "Euclidian Geometry" were nowhere close to our current description of it). An added note about starting-at-1 may be of historical interest, though.

Ah. Right. A historical

Ah. Right.

A historical note WOULD be interesting, yes.

Personal recollection

I recall having been taught usually that N = {1,2,3,...}, although it was apparent to me by the time I finished high school that both conventions were common. To this day I automatically look for evidence of the author's intent whenever I encounter something like "take x in N..."

It is suprisingly common

in the philosophy of mathematics and other fields like cognitive linguistics. For example the book "Where Mathematics Comes From" by Lakoff and Núñez was published in 2000. Mathematical reviewers had some issues with the book, one being they defined the natural numbers as being the positive integers.

Are arrays even a good idea?

It seems to me the best answer to "should this collection be indexed from 0 or 1?" is usually "no". Requiring a correspondence of indices with natural numbers should IMO be rare. (Though, when it is required, I agree that zero should be included as a natural number.)

This thread wasn't really

This thread wasn't really about which is better, but about the history of the choice. Not the same thing. But I admit that reading the discussion I also thought about loop-less programming and some such, where arrays are treated as wholes. Maybe the same kind of conclusion should be reached from the debates about typing regiments.

Of course, Dijkstra already knew this

See: EWD831

Actually, 1982 is not that early, but I thought I'd mention it.

All compile time?

I find this hard to swallow.

Imagine indexing into an array using a user supplied number.


array[⎕] ⍝ I thought this discussion of indexing
⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝⍝ could use a little APL :p

0 and 1

The history of choosing 0 over 1 doesn't answer the question (it does lead to more questions though, as this and the original thread show), but that's the implicit point -> Things are as they are even if we don't *really* know how they came to be. Compile time shaving or binary preference, it is still 0 and 1.

Kinds of cause

There are a lot of good comments here -- however, I'd like to propose some distinction

The article searches for a cause why to use 1-based or 0-based indexing. When asking about a cause, we can follow (at least) three distinct ways of reasoning, and these get often mixed up, leading to a lot of confusion

  • practical -- both types of indexing are part of a coherent pattern of dealing with cardinality, indexing and loop conditions. A pattern is a practical tool to handle some problematic situation, and we choose them (both the problems and the solution) on each application anew
  • logical -- we try to derive a method from principles, and conclude it is the right method, because it can be derived most convincingly. This is the realm of the "natural" solution and "obviously wrong" approach.
  • historical -- how did we get at doing things a given way? There is a specific twist here, since we can't judge matters in hindsight. A proper historical reasoning needs to be rooted in documents from the time in question and before. It is pointless to imagine the old Greek -- we are bound to let the old Greek speak for themselves

Now the surprising observation is that these different kinds of causation are only loosely related to each other. The "wrong" solution can be more practical, and a document might reveal a vision and intention which was completely lost while its offspring survives.

And there is a strange prevalence of the logical mode of causation. Logical reasoning tends to usurp and twist relations after-the-fact. It is very common to find a retrofitted logical justification for practical actions, or to find history overshadowed by how it needs to be.

Interesting

Thanks for your interesting comment. My personal observation is that when you don't know how to assess the "practical" (for example because you want practical for use-cases that are do not exist yet, but will appear in the future), "logical" is a pretty good proxy for it. I've been basing a lot of my API-design decisions on the "logical", on the explicit conjecture that it implies "practical" on the long term.