The Genuine Sieve of Eratosthenes

Melissa E. O’Neill, The Genuine Sieve of Eratosthenes.

A much beloved and widely used example showing the elegance and simplicity of lazy functional programming represents itself as "The Sieve of Eratosthenes." This paper shows that this example is not the sieve and presents an implementation that actually is.

Starting with the classic one-liner sieve (p : xs) = p : sieve [x | x <- xs, x ‘mod‘ p > 0] O'Neill proceeds to show why this standard rendition of the Sieve of Eratosthenes does not in fact "cross-off" the multiples of each prime in the same way the "real" Sieve does.

She notes that "Some readers may feel that despite all of these concerns, the earlier algorithm is somehow “morally” the Sieve of Eratosthenes. I would argue, however, that they are confusing a mathematical abstraction drawn from the Sieve of Eratosthenes with the actual algorithm. The algorithmic details, such as how you remove all the multiples of 17, matter."

A fun read.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

In one word...

Breathtaking. If they don't accept this paper I'll personally get on a plane and hit someone on his head ;-).

I also liked she asked Bird, Meertens and Peyton Jones for comments. It somehow exposes that naive 70's FP feeling (apart from being a quality paper, I really liked the performance analysis).

[that might have been 80's, I am not sure]

[As a side note: what I also liked is that she discusses that FP programmers prefer lists for a lot of operations, whereas a lot of algorithms are better implemented using other data structures. Personally, I always felt that 'the' data-structure for FP should not be the list, but the 'map'.]

Very interesting. Thanks!

Very interesting. Thanks!

Off topic?

How is this not off topic, Ehud? ;-)

Not that I'm complaining about the choice of topic. I read this paper a few months ago, and it is definitely worth reading. The fact that the classic example cited by the paper is not actually the Sieve of Eratosthenes is so blindingly obvious in retrospect that it is really quite embarrassing that functional programmers have let it slide for so long. Kudos to Melissa O'Neill!

Myself, I never really picked up on the purported "sieve" as particularly interesting, and never thought about it deeply. What really stood out (in my mind) from my early days of FP was fold-left, fold-right, and the corecursive definition of the Fibonacci sequence.

I know I did implement the genuine sieve in Haskell years ago, but I used mutable monadic arrays. I might as well have been writing in an imperative language. I know I had seen the purported "sieve", but I don't recall why I didn't use that. Maybe it didn't come to mind, maybe I had tried it and found it too slow, or maybe I thought it was a silly implementation from the outset.

I don't want to start a big meta-discussion about LtU in this particular thread, but I've been troubled by the militancy of "off topic" accusations lately. However, as this seems to me to be OT by your own standards, it seems worth pointing out.

Don't get me wrong, some of the accusations were justified. I appreciate having reasonably intelligent discussions a few levels above the tripe on Slashdot, and I realize that a certain amount this behavior is probably necessary to maintain this.

I personally feel the "off topic" standards should be relaxed a bit; as long as the post is of reasonable quality and likely to be of interest to a significant fraction of people also interested in programming languages in general, (such as backreferences and deterministic automata), then it should not be discouraged.

Can't speak for Ehud...

...but in my own barely on-topic foray into algorithms I was trying to ask the question of whether the language that is used to express algorithms can be truly PL or paradigm agnostic. Personally, I find the particulars of the Sieve to be of less interest than the fact that the language used to express the algorithm is intrinsically tied to the PL that implements it. The only formal expression of the algorithm given by the author is modified Haskell code that supposedly fits the informal description of the problem given in the natural language description. Which to me states both the importance and the failure of programming languages in expressing algorithms.

As a side question, are the mutable monadic arrays in Haskell capable of growing dynamically? I'd conjecture that the expression of the Sieve in many non-functional languages (using fixed size arrays) are not truly expressing the fundamental algorithm which has no stated upper bounds. This was one of the advantages that has been given for the discredited Haskell solution.

On the meta-level of discussion, I don't think questioning whether a particular response is off-topic is necessarily a conversation killer. In most of those posts that are borderline are implicit calls to the poster to justify why they think it should be considered as relevant to discussions of PLs. This is not to say that there is not a lot of gray area... it is to say that it is encumbent upon the person to try and shed light on the role that PLs play in the discussion of any particular item.

The classical sieve

Haha, personally, I find the particulars of the sieve far more interesting than many (most?) PL discussions, which I find tend to be rather boring and dry without real problems and compelling examples to motivate them. Or maybe they are just too metamathematical for me to easily grasp.

I'd conjecture that the expression of the Sieve in many non-functional languages (using fixed size arrays) are not truly expressing the fundamental algorithm which has no stated upper bounds.

Oh, but you are absolutely wrong. The classical sieve has no inherent upper bound, but you must pick one before you start. Once you pick that bound, it can't really be increased inside the realm of Eratosthenes' algorithm.

Without too much difficulty, one can extend the genuine algorithm to allow that bound to be increased, in the "obvious" and straightforward way. To restart the sieve, you take each prime p in your sieve and compute old_bound + ((-old_bound) `mod` p) and then cross out every pth number after that. Note that the ancients could not have performed this efficiently, as they lacked the long division algorithm necessary to compute the appropriate multiple of each prime.

However, this method of continuation results in a significant point of departure from the original sieve, and changes the time complexity. In the worst case, where you increase the upper bound by a single number each time, this extension degenerates into trial division. Perhaps this is why the impostor "sieve" went undetected for so long.

Well, there is a potential

Well, there is a potential variant that would allow increasing the bound without involving long division (at the cost of increasing the storage requirements) and thus would have been possible for "the ancients". For each prime, you record the last number that you crossed off. Then, when you want to increase the upper bound, you simply go through the list and see if the next one is in range or not.

Yup.

Yup. The paper basically presents a slightly more clever implementation of that exact idea, using a priority queue instead of a list, and storing the next number to be crossed off for each prime, instead of the last number that was crossed off. Then one can dispense with the classical algorithm entirely, if one so chooses, or attempt a hybrid approach.

Whether or not this is the "geniune" sieve is a matter of semantics, as it's definitely not the classical algorithm, but the overall effect is the same. You get the same time complexity, and not that of trial division. Notably, you relax the need for an upper bound, at the cost of increased storage requirements. :-)

Good company

Oh, but you are absolutely wrong.

Probably so, but since a lot of smart Haskell programmers were also wrong about their Sieve, I don't feel so bad. :-)

The classical sieve has no inherent upper bound, but you must pick one before you start.

Backtracking, the algorithm does imply binding a value N to the function before starting out, so that the returned list is a finite set. Out of curiosity, I'm wondering if there is an extant version of the algorithm that survived the burning of the library at Alexandria? Just wondering what the ancient PL looked like? If we take wikipedia as authoratative, we have:

  1. Create a contiguous list of numbers from two to some highest number n.
  2. Strike out from the list all multiples of two (4, 6, 8 etc.).
  3. The list's next number that has not been struck out is a prime number.
  4. Strike out from the list all multiples of the number you identified in the previous step.
  5. Repeat steps 3 and 4 until you reach a number that is greater than the square root of n (the highest number in the list).
  6. All the remaining numbers in the list are prime.
The spurious Haskell implementation is:
primes = sieve [2..]
sieve (p : xs) = p : sieve [x | x  0]
which doesn't look at all the same to any casual observer - and the corrected version doesn't look much better. The underlying premise of the paper is that we are trying to express an algorithm given in a non-formal non-PL form and express that in Haskell (or any other PL for that matter). We can't determine the algorithmic equivalence based purely on input and output, but must factor in cost as well (am reminded of the aspersion that Lisp programmers know the value of everything but the cost of nothing). The paper assumes that if the input and output of a function are equivalent and the time complexity involved is the same, then the algorithm might be the same. Of course, one can quibble on what a "list of numbers" is (the chosen data structure), and what it means to "strike out" a number (change its state). We end up recognizing that algorithms expressed in PLs are not the same thing as we are modeling. Though there is less intuition involved, we still can't formally prove that our implementation formally captures the essence of the algorithm (Haskell functions are not the same thing as functions in the mathematical sense) - though we can prove two algorithms as non-equivalent by observing their time complexity.

Haha, personally, I find the particulars of the sieve far more interesting than many (most?) PL discussions, which I find tend to be rather boring and dry without real problems and compelling examples to motivate them. Or maybe they are just too metamathematical for me to easily grasp.

In all the solutions that I develop on a day-to-day basis, there is the problem of taking incomplete (and sometime incoherent) specifications and expressing them in a PL. I agree that this can be interesting when applied to particular problems but there has to be some reflection on how it echoes to the general notion of PLs in general.

Well put

These are exactly the kind of considerations I was trying to get at in the case of quicksort (below).

Been wrestling with it

I was trying to make a similar point with MergeSort on my algorithms post where I was trying to point out that CLRS and Knuth explicitly define the algorithms with state variables. A much simpler example would be Euclid's GCD. TAOCP, which for some is the bible of algorithms, lays it out as:

Given two positive integers m and n, find the greatest common divisor, i.e., the largest positive integer which evenly divides both m and n.
  • E1. [Find remainder] Divide m by n and let r be the remainder. (We will have 0 <= r < n).
  • E2. [Is it Zero?] If r = 0 the algorithm terminates; n is the answer.
  • E3. [Interchange.] Set m <- n, n <- r, and go back to step E1.
The nearest I got to a literal translation in ML was:

fun gcd(m, n) =
   let
      val m = ref m
      val n = ref n
      val r = ref 0
   in
      while
         (
            r := !m mod !n;      (* E1 *)
            !r <> 0              (* E2 *)
         ) do (
            m := !n;             (* E3 *)
            n := !r
         );
      !n
   end

But that looks somewhat ugly compared to the recursive solution of:

fun gcd (m, 0) = m
  | gcd (m, n) = gcd(n, m mod n)

The question would be (1) whether these are the same algorithms and how would determine that?; and (2) whether we can express the algorithm using the Knuth type of language that would describe what both implementations have in common?

(1) whether these are the

(1) whether these are the same algorithms and how would determine that?

This paper discusses exactly this type of question.

(1) whether these are the

(1) whether these are the same algorithms and how would determine that?

No, but only because your first implementation has a bug. :-) Namely, the first gcd(m,0) causes a divide by zero, and the second one doesn't.

Upon re-reading Knuth's assumptions, he does specify positive integers, so it's not entirely fair to label it a "bug". This probably isn't a good assumption, but it does avoid discussion of gcd(0,0), which probably ought to return an error.

But, modulo that quibble, I daresay yes. In fact, many compilers would take the latter implementation and produce object code that would be similar to Knuth's specification in the same sense that your original code is "similar" to Knuth's.

The compiler would likely create a temporary variable (let's call it "r", pun intended) to store m mod n. As the call to gcd is a tail call, (n,r) will get passed in the same location as (m,n). The compiler shouldn't use further temporaries to set this up. At that point, the call is just a simple goto.

Ironically, your original implementation would probably result in a less direct rendering of Knuth's specification in object code. Your SML compiler might well store the references on the heap and refer to them with a pointer, especially SML/NJ as it supports call/cc.

I'm not sure I understand what question (2) is trying to ask.

Arrays are harder

The question would be (1) whether these are the same algorithms and how would determine that?

What happens to work well in this case is to translate Knuth's original description into set of tail-calling functions E1, E2 and E3 passing m, n and r between them — we know what is the Ultimate, after all. Then it's just a matter of simplification before arriving at something very close to the "functional" solution.

Now, most interesting imperative algorithms deal with mutable arrays, and a mere SSA conversion is clearly insufficient for turning Hoare's Quicksort into the clean functional version. Even using some variant of functional arrays for the CPS-conversion, one would miss the point by a mile.

Casual Observation

What's a casual observer? To somebody who doesn't know any programming whatsoever, or maybe just not any functional programming, sure, they appear to have nothing to do with each other. However, would C really be any better, other than more people have a passing familiarity with C?

However, at a glance, I *do* think that the two descriptions look alike, at least to somebody with some familiarity of Haskell and a basic understanding of the sieve. Perhaps this is another reason why the fake sieve went undetected for so long. It makes me wonder what other surprises might be lurking.

I interpret the paper's title as provocative, not prescriptive. Moreover, I don't think the author is trying to argue that her algorithm is the same as the classical one, merely that it's an acceptable substitute. Certainly, it is a "genuine sieve" of some variety.

Since sieves can be applied to other problems, e.g. factorization, and the sieves of both Melissa O'Neill and Eratosthenes deal only with primality, it seems reasonable to call it "a Sieve of Eratosthenes", even if it isn't "the Sieve of Eratosthenes."

After all, O'Neill's sieve is fully incremental: if you want all the primes between 1 and 10^9, you don't have to wait for multiples of 2, 3, 5, and 7 to get crossed off to get back 53. So by the time criteria, it certainly is a different algorithm.

It seems on-topic to me,

It seems on-topic to me, because there is widespread belief that laziness brings programming much closer to mathematics, and this paper explores this belief using a well-known example. It also explores the bias functional programmers have towards list-based algorithms, despite how other data structures are often a better fit. Most industry languages come with a broad collection libraries, and most industry programmers won't even consider a language without such collection support.

Well put.

You certainly made the case as to why this discussion is on topic. (not that I ever really disagreed.) But as "algorithms" are "off topic," somebody needed to say it. Further meta-discussion should be relegated to the Meta LTU Thread

And, Yorick is absolutely right that most of the interesting imperative algorithms involve mutable arrays. In fact, my sieve, though not incremental, performs a significant, but not unreasonable constant-factor better than Melissa O'Neill's for this exact reason. In the process of tearing down our biases toward linked lists, she also ironically confirmed it.

Now, for a bit of speculation, I've never really found a completely satisfactory approach for expressing graph algorithms in a purely functional way. The advantage of FP is relatively much greater for trees.

So, maybe finding better ways of expressing array algorithms "less imperatively" and "more declaratively" could lead to better ways of expressing graph algorithms.

Anybody have an opinion on GHC's Difference Arrays? I've looked at the source, and thought about trying to extend them to encompass union-find, but I've never really used them enough to know their strengths and weaknesses.

[edit: Yorick said that, not naasking. Still, both comments are good. :-)]

seems important

to this grashopper -- sure, fp is great in some (lots) of situations, but to be really honest and respectful one has to think about and address what happens in the situations where fp has issues.

Reminds me...

I've always had a similar feeling about the standard Haskell quicksort example, and as it happens, I was just having a conversation about it the other day... This bit in particular applies:

I would argue, however, that they are confusing a mathematical abstraction drawn from [quicksort] with the actual algorithm. The algorithmic details, such as how you [cleverly swap elements in-place using pointer arithmetic], matter.

The Haskell quicksort has the remarkable property of making the essence of the algorithmic idea much clearer, while simultaneously destroying the essential cleverness that makes it such a good idea. So what is quicksort, really?

Anyway I haven't read this paper yet, although hopefully I'll have a chance in the morning.

Implementation differences

Well, yes, the standard example of how to implement quicksort in Haskell is _not_ a good implementation. However, it is "morally" the actual quicksort algorithm. I remember one friend of mine who had implemented quicksort in C++ and Java but did not understand why it worked until he saw the Haskell solution.

The worst downfall of the quicksort example is that naive use of list concatination leads to quadratic behavior. However, it _does_ make the exact same number of comparisons as the imperative counterpart.

If you fix the concatination issue, you get something on par with quicksort, even though it is still not a great implementation. Ignoring laziness, I believe that the quicksort example has the same asymptotic running time, but uses appreciably more memory. (O(n * log n) instead of O(log n))

However, the situation with the "sieve" example is far worse. Superficially, it appears to be the same, but it's really just trial division in disguise, because filter p . filter q is the same as filter (\x -> q x && p x). (in terms of time, ignoring memory consumption.)

And of course, the genuine sieve is appreciably better than trial division. Asymptotically, trial division tests all the numbers for divisibility by 2, 1/2 of the numbers for divisibility by 3, 1/2 * 2/3 of the numbers for divisibilty by 5, 1/2 * 2/3 * 4/5 of the numbers for divisibility by 7, etc.

However, the actual sieve "tests" 1/2 of the numbers for divisibility by 2, 1/3 of the numbers for divisibility by 3, 1/5 of the numbers for divisibility by 5, 1/7 of the numbers for divisibilty by 7, etc. As Melissa O'Neill points out, the exact nature of the "test" isn't essential.

Well, yes, the standard

Well, yes, the standard example of how to implement quicksort in Haskell is _not_ a good implementation. However, it is "morally" the actual quicksort algorithm. I remember one friend of mine who had implemented quicksort in C++ and Java but did not understand why it worked until he saw the Haskell solution.

And I know people who never understood why it worked well until they saw the C solution. I very much appreciate your use of the word "morally" in the same sense as the OP. So what is the morality of quicksort? Is it really quicksort if it's quadratic? Haven't we missed the point?

As for the sieve, I'll take your word for it, as I still haven't read the paper or given it much thought.

So what is the morality of

So what is the morality of quicksort? Is it really quicksort if it's quadratic? Haven't we missed the point?

I suppose that depends on the point one is trying to make. If your point is to actually implement quicksort, then yes, the Haskell example does miss the point.

However, the example is at least useful as first approximation. The reason behind the quadratic behavior has nothing to do with sorting. In some abstract mathematical sense, it distills the essence of the algorithm.

But yes, you don't fully understand quicksort until you understand the C version as well. In-place partitioning is an important aspect of a practical implementation.

For the benefit of lurkers, the example we are discussing appears several places, such as A Gentle Introduction to Haskell

For example, here is a concise definition of everybody's favorite sorting algorithm:

quicksort  []           =  []
quicksort (x:xs)        =  quicksort [y | y <- xs, y<x ]
                        ++ [x]
                        ++ quicksort [y | y <- xs, y>=x]

Huh. I seem to remember reading commentary in my early days of Haskell saying that it isn't exactly fair to compare this to an imperative version, e.g. C, due to clever use of in-place partitioning, etc. Apparently it wasn't there.

In defense of the tutorial, it does describe this as a "definition" and not an "implementation". But it's easy to see how somebody might come away with false impressions.

Wadler

The standard quicksort is one of the standard examples of using an accumulating parameter (for example Wadler's The concatenate vanishes):

qsort' [] ys = 0
qsort' (x:xs) ys = qsort' [y| y <-xs, y<= x] (x: (qsort' [y|y<-xs, y> x] ys))

I think in the end a quicksort is the quickie Haskell version and a good quick sort involves the median of 3, the avoiding small sorts and pushing the final to insertion sort.... All depends on what you are trying to explain.

Good example.

It's a good example of how to fix concatination; but certainly not as useful for explaining quicksort, IMO.

Does this tickle anybody?

Not quite so neat as the "fake" sieve, but I would call this genuine (some optimizations at the expense of clarity are obviously possible):

sieve (_:ns) (False:ps) = sieve ns ps
sieve (n:ns) (True:ps) = n : (sieve ns $ zipWith (&&) ps $ cycle $ replicate (n-1) True ++ [False])
primes = sieve [2..] $ repeat True

Of course, it's still lists, and as O'Neill says, data structures matter...

It's still naive trial divisibility

I'll try to come up with a nice, convincing algebraic argument later, but this algorithm has the same issues as the fake sieve, dressed up a bit differently.

You avoid the "division" part of trial division; but the use of division is not what makes that algorithm slow, nor is avoiding it what makes Eratosthenes Sieve fast. It's the "trial" part that is costly.

The key is that, like trial division and unlike the Sieve of Eratosthenes, you are still testing every number n for divisibility by every prime less than n. Exactly how that is achieved is not so important.

Hmmm. Alright.

The algorithm feels more genuine because I am using a "cross off every n" approach, but you are right that I'm still doing an asymptotically large number of &&s.

Avoiding the log-log term

The O'Neill paper notes:

The (Eratosthenes) algorithm is efficient because each composite number, c, gets crossed off f times, where f is the number of unique factors of c less than square-root(c). The average value for f increases slowly, being less than 3 for the first 10^12 composites, and less than 4 for the first 10^34

This f is what contributes the log-log term to the time complexity of the Genuine Sieve - O(n log log n). The optimizations which sieve only odd numbers, or wheel-generated numbers can be seen as a way to reduce the value of f in common cases, although the complexity remains unchanged.

But it is possible to strike off each composite number exactly once, and so remove the log-log term entirely. (And at the same time switch to a list-based implementation):

primes = 2 : diff [3..] (composites primes)

composites (p:ps) = cs
    where cs = (p*p) : merge (map (p*) (merge ps cs)) (composites ps)

merge (x:xs) (y:ys) | x<y           = x : merge xs (y:ys)
                    | otherwise     = y : merge (x:xs) ys

diff (x:xs) (y:ys)  | x==y          = diff xs ys
                    | x<y           = x : diff xs (y:ys)
                    | x>y           = diff (x:xs) ys

I believe the above algorithm has time complexity O(n). But can it still be regarded as the Sieve of Eratosthenes?

[edited to fix bug in diff - "<" and ">" were swapped - thanks Sjoerd!]

Computing every composite

Computing each composite number by multiplying is prime factors isn't part of the genuine Sieve, but this is very close to the spirit and... is a pretty cool algorithm.

Analysis?

It looks that at the end of the day, you will have evaluated the thunks for each of these lists, at least up to N:

composites [2,3,5...]
composites [3,5,7...]
composites [5,7,11,...]
 ...

So you're saying the sum of the sizes of these lists as a fraction of N converges in the limit. Is this obvious?

Think Lazier

You're not thinking lazily enough. The 'composites' function only ever depends, in any immediate sense, on the first prime in the list and the first composite in the list, both of which Nick is careful to provide. The primes themselves are computed lazily. The 'sum of the sizes of these lists' never gets involved (which is good, because each list is infinite).

I'm not entirely convinced it is O(N) yet, since the time complexity of an algorithm can never be better than the space complexity of the same algorithm and I haven't examined how much space is being consumed by all the thunks (though I suspect the total space cost is less than O(N)). But the algorithm does compute each composite, and each prime, exactly once.

Too lazy

I think I was actually being too lazy - with my choice of words. I realize each of the lists computed by composite is infinite and when I referred to their "sizes", I was referring to the size of the prefix actually computed (the portion up to the first element N or greater). Have you thought through the analysis of the algorithm? How many lazy cells get evaluated?

A simple bound

So for example, we can obtain the following bound easily. The number of elements in the list of composites made from Kth or greater primes is less than N/K. Since the number of primes P is asymptotically N/ln N, and since the harmonic series grows like ln N, we have the number of evaluated thunks bounded at:

*edit: removed incorrect computation*

I think the operations of the algorithm can be put into correspondence with the thunks in those lists such that each thunk corresponds to a bounded number of operations, so this would put the runtime of this algorithm at as least as good as the Sieve, asymptotically.

Lower Bound

  1. We know this computes each composite up to N exactly once.
  2. Each composite is produced by a number of multiplications based on its prime factorization
  3. Let c(N) be the sum of exponents in a prime factorization: 1 for prime numbers, 2 or more for composites
  4. Let sum_c(N) is equal to (SUM c(N) from 1 to N)
  5. And average_c(N) is sum_c(N)/N
  6. So the average number of multiplies to produce composite numbers up to N is (average_c(N)-1)

This produces a lower bound on complexity (multiplies plus merge/diff operations) of N*average_c(N).

Now, I don't know much math-trivia, so I don't know average_c(N) with regards to N, but what I do know is that average_c(N) must be greater than the average number of prime factors for N (since the sum of exponents is greater than the count of prime factors). According to Nick, the prime factor duplication is, by itself, sufficient to result in numbers being crossed off an average of O(log log N).

So I suppose this algorithm is actually 'worse' than an addition-based solution. It's at least O(N log log N)

Poking a hole

Each composite is produced by a number of multiplications, but some of those multiplications are shared. For example, 5*(2*3) and 7*(2*3) both reuse the computation of 2*3. Thus it's not enough to merely count the average number of factors - you must consider how frequently they are reused.

(Also, if you can establish the Nlog log N lower bound, I think I've shown a log log N upper bound in another part of the thread. That bound is based on what feel like pretty conservative estimates, though, so I'd find it a little bit surprising if the cost were actually N log log N)

Point granted. I didn't

Point granted. I didn't consider sharing of multiplications.

Based on sharing, and examining the algorithm again, it seems every composite number involves multiplying exactly once from a prior prime or composite number. Given that every composite is also reached only once, the number of multiplies is thus less than N to produce all primes up to N.

So that leaves only consideration of how many 'merges' the production of a composite number will suffer, but this also seems to exhibit the same sort of 'sharing'. It does seem that each (p*c) will be merged a number of times equal to the number of primes smaller than p, so the cost of computing a composite is: a multiply, plus a number of merge events proportional to O(log(p)) for the smallest prime factor p. That doesn't look to contribute much, if anything, to the time complexity, but I lack the skill to further explore it.

So, O(N) it could be.

Agreed

It does seem that each (p*c) will be merged a number of times equal to the number of primes smaller than p, so the cost of computing a composite is: a multiply, plus a number of merge events proportional to O(log(p)) for the smallest prime factor p.

Agreed, and this is equivalent to asking how many elements less than N are in those composites lists, since the Kth list consists of elements whose smallest prime factor is the Kth prime or greater. It looks like potentially hard to me, so I'm done. It'd be easy to measure and graph performance, but that probably wouldn't do much good since, as it's been said, log log N has been proven to diverge but has never been observed doing so.

Avoiding the log-log term, but at what cost?

The complexity of O(n log log n) stated for the Sieve seems to be based on being able to perform the additions need to iterate through the multiples of each prime in constant time. But for a large prime p, it will actually take O(log p) space to represent p and O(log p) time to add p to a multiple. I suspect that if you assume this for additions, O(n log log n) isn't even right. And since multiplication isn't a linear operation like addition, this algorithm would actually make the situation worse.

Operation Cost

I was actually thinking about this earlier then thinking: but it still beats the division algorithm, and we get infinite lists in the bargain!

The cost for grade-school integer multiply algorithm is O(log A * log B) where A and B are the operands. In Nick's algorithm, we can guarantee that one operand (A) is a prime number, and that at least half of all results will be multiples of 2, a third of all remaining composites results will be multiples of 3, etc. I'm not quite certain how all this adds up, but it probably is far less than O(log^2 N) for result N.

The cost for a grade-school integer addition is O(log A + log B) for operands A and B, given the need to produce a new representation. You should be able to do better than that by sharing representation when adding small numbers to large numbers.

I'd be interested in finding out how these differences affect the sieve.

Shorter version

Nobody tried this code? Diff has a bug. And when you fix it, you'll see that merge and diff are actually the same function, symmetric difference, or xor:

primes = 2 : xor [3..] (composites primes)

composites (p:ps) = cs
    where cs = (p*p) : xor (map (p*) (xor ps cs)) (composites ps)

xor (x:xs) (y:ys)  | x==y          = xor xs ys
                   | x<y           = x : xor xs (y:ys)
                   | x>y           = y : xor (x:xs) ys

When merging, the x==y part is never used, and when diffing, the x>y part is never used.

Very nice!

Very nice!

I'll bite


lps@azarel:~/Programs/snippets $  ghc -O2 -c MaybeSieve.hs 
lps@azarel:~/Programs/snippets $  ghci MaybeSieve.hs 
GHCi, version 6.8.2: http://www.haskell.org/ghc/  :? for help
Loading package base ... linking ... done.
Ok, modules loaded: MaybeSieve.
Prelude MaybeSieve> length (take 10000 nick's_primes)
10000
(0.40 secs, 0 bytes)
Prelude MaybeSieve> length (take 10000 sjord's_primes)
10000
(0.40 secs, 0 bytes)
Prelude MaybeSieve> length (take 10000 primes)
10000
(0.03 secs, 0 bytes)   [edit:  (41.15 secs, 2160096056 bytes) ]
Prelude MaybeSieve> take 10000 nick's_primes == take 10000 primes
False  [edit: True]
(0.01 secs, 0 bytes)
Prelude MaybeSieve> take 10 (xor nick's_primes primes)
[25,35,49,55,65,77,85,91,95,115] 
(0.01 secs, 0 bytes)
Prelude MaybeSieve> take 10000 nick's_primes == take 10000 sjord's_primes
True
(0.03 secs, 0 bytes)

Looks like you applied a correct transformation to incorrect code, Sjoerd. [edit: that would be a correct transformation to correct code]

As for being fast, it's not. [edit: fast is relative. It's not great, but it's not too unreasonable.] primes is the impostor sieve that we've been discussing, and it's over 10x slower than that. [edit: I got that sieve wrong. It's over 10x slower than nick's/sjord's] As for being linear, I'm pretty sure it's not. [edit: this is one of the few correct statements I made here] In fact, it appears to be asymptotically worse than the impostor. [edit: nope. turn that statement around] If we restart ghci:

Prelude MaybeSieve> length (take 50000 nick's_primes)
50000
(7.55 secs, 166816348 bytes)
Prelude MaybeSieve> length (take 50000 primes)
50000
(0.11 secs, 0 bytes)
Prelude MaybeSieve> nick's_primes !! 10000
104743
(0.01 secs, 0 bytes)
Prelude MaybeSieve> nick's_primes !! 50000
611957
(0.01 secs, 0 bytes)

That's 19x more time to solve a problem instance 6x as big. I hate being a curmudgeon, but I also hate not being rigorous. [edit: both true] :-)

Recheck?

Can you recheck your definitions file and then post the code if you don't find a problem? I copy and pasted the code for sieve in the original post with Sjoerd's code, changing only two quotes to backquotes, and the two produce equivalent lists of primes with Sjoerd's being much faster.

Oops, how embarrassing.

I thought that was kind of fast for the fake sieve, but I subconciously assumed it was too simple to mess up. (haha, right.) I forgot to recurse, so it only filters out the factors of 3 out of the odd numbers.

But the fact remains, there is no way this is linear. It's very similar to Richard Bird's code at the end of the paper, but it does improve on it, so not bad! I suspect it's asymptotically the same as Bird's, which Melissa O'Neill does discuss as being suboptimal.

So, Nick/Sjord's code: (performs identically)

ghci> let test f n = length (takeWhile (< n) (f ()))

ghci> test nick's_primes 100000
9592
(0.42 secs, 0 bytes)
ghci> test nick's_primes 200000
17984
(0.88 secs, 0 bytes)
ghci> test nick's_primes 400000
33860
(3.85 secs, 105371172 bytes)
ghci> test nick's_primes 800000
63951
(11.37 secs, 239707676 bytes)

Richard Bird's code from the end of the paper:

ghci> test bird's_primes 100000
9592
(1.03 secs, 50939444 bytes)
ghci> test bird's_primes 200000
17984
(2.33 secs, 122054156 bytes)
ghci> test bird's_primes 400000
33860
(5.47 secs, 304108996 bytes)
ghci> test bird's_primes 800000
63951
(13.11 secs, 785552660 bytes)

And, just for fun, my own sieve that I use from time to time. The expression sieve n finds the smallest prime factor of every number less than n, so it's a bit more general than Eratosthenes. It's not incremental in any way, shape, or form, but there are ways to extend the algorithm.

ghci> factor (sieve 100000) 3
[(3,1)]
(0.03 secs, 0 bytes)
ghci> factor (sieve 1000000) 3
[(3,1)]
(0.15 secs, 2002168 bytes)
ghci> factor (sieve 10000000) 3
[(3,1)]
(1.65 secs, 11002884 bytes)
ghci> factor (sieve 100000000) 360
[(2,3),(3,2),(5,1)]
(19.57 secs, 101004300 bytes)

Flaw in Test

I suspect that a lot of the difference you're seeing is due to heap-reallocation and associated garbage collection. The prior tests are affecting the latter tests by consuming heap and leaving garbage upon it that is only collected when running the latter tests.

Memory Management

I suspect that a lot of the difference you're seeing is due to heap-reallocation and associated garbage collection.

That's certainly part of it, but that's also an intrinsic part of the respective algorithms.

Nick's code appears to allocate about 1/3 of the memory of Bird's code, but probably has basically the same runtime overall. That's probably why it does comparatively well in the beginning, but quickly catches up. Even so, reducing memory consumption is a good thing, even if you don't save much time.

My sieve doesn't create any garbage. The only heap allocation is a single unboxed ST array. Then after it's done filling it, it uses runSTUArray to convert that to an immutable array without copying. Two simple tricks are used to reduce the size of the array by 75%.

But even disregarding the garbage avoidance and the significant constant-factor advantages of arrays, the algorithm my code uses has the same asymptotic complexity as Eratosthenes' sieve. Bird's code only does a little bit better than trial division.

The prior tests are affecting the latter tests by consuming heap and leaving garbage upon it that is only collected when running the latter tests.

Doubt it. In fact, the first expression run during a session will often take a bit longer, as GHCi might need to load a package.

Unlike Hugs, GHC's garbage collection is fast, and GHCi does not offer :gc to force a collection. Since humans type slow, there is almost always at least a partial GC done between expressions. I haven't exactly figured out bytes reported by ghci mean, but I believe it's the amount of garbage collected while the algorithm was running. (There may be a race condition in there to confuse things a bit.)

Besides, on the latter examples, there was far more memory collected than the standard 64M heap that I was working on. The collector ran multiple times, and the times were pretty consistent.

"There is no way this is linear"

I was going to scoff that you couldn't distinguish ln ln N from a constant, but I was getting similar numbers to yours (measured in operation count rather than just runtime), so I revisited the proof of my bound above and realized it has a big hole (my turn for oops!). It actually looks likely to me that one could work out the exact asymptotic runtime, but I suppose the lesson I should take away from this is that math should be done carefully or not at all.

I suspect you're right that it's not linear or even N ln ln N.

Someone ...

should really build a library on top of Haskell, like HUnit, which will experimentally supply a good estimate on the complexity of a given algorithm.

I concur

It's a wonderful idea

Maybe

You should contact the author Melissa.

O(n log^2 n)?

The complexity of this function is O(n) times the average value of pi(lpf(n)) restricted to the composite numbers, since lpf(n) determines how many different (composites (p:...)) lists n occurs in. Here lpf(n) is the least prime factor of n. Unfortunately I don't know how to estimate that analytically. A numeric estimate out to 20000000 makes me think it's between O(n log^2 n) and O(n log^3 n). Here's a few more details: http://naml.us/blog/2009/01/sieving-primes.

Great analysis

the total cost due to k is roughly π(lpf(k)) where lpf(k) is the least prime factor of k.

Why? Is it an estimate of the length [of the composite list where k was generated]? If so, I think you might not take sharing into account in your summation.

[My, very simple, rough estimate was something like O(π(n) + 2 * (n - π(n))) is O(2n - π(n)).] [Just counting the number of unique multiplications] [I guess 2 should be a constant C]

O(n)? Another argument?

Rationale:

The list of prime numbers up to n is the list of numbers (O(n), trivially) minus the list of composites.

The composite list of multiples is generated recursively from the list of primes. That composite list is the merger of three strictly increasing lists where the head of the list is known. Thus?, to determine the n-th composite O(n) comparisons are needed, or, it takes constant time to determine the next composite [if the list of primes would be known].

Let c0 be the constant to generate the next natural number.

Let c1 be the constant to generate the next composite. [c1 > c0]

Then it takes approximately c0*π(n) time to generate all primes, and c1*(n-π(n)) time to generate all composites below n.

Thus c0*π(n) + c1*(n-π(n)) or approximately c0*n/ln(n) + c1*n - c1*n/ln(n) = c1*n + (c0-c1)*n/ln(n) time to generate all primes below n. Thus O(n).

[Uh, edited the last formula substantially.] [Hmpf, I am not taking into account that π(n) lists should be merged; but it might be that it doesn't matter.]

[?It's O(n), trivially, since if n grows, the average distance between prime numbers grows, therefore the complexity 'grows' to the complexity of generating natural numbers.]

I don't think it's O(n)

I think the sequence plotting that's been done in this thread all points to it not being O(n). David, Geoffery, and I have each posted an analysis to this thread (Geoffery's, by link, is the best presentation) and all come to the same conclusion that it's the sum of a certain series. I suspect you'll need to do a bit of harder math to convert that to a simple closed form answer. The problems I see with your argument are:

That composite list is the merger of three strictly increasing lists where the head of the list is known.

Except you need to account for how long it takes to compute the three lists - one of those is a recursive call.

It's O(n), trivially, since if n grows, the average distance between prime numbers grows, therefore the complexity 'grows' to the complexity of generating natural numbers.

I don't really understand this reasoning. Is this possibly the fallacy that if a sequence is ever increasing, then it is unbounded? (which isn't true since you can have a sequence like 1, 1.5, ... ,2 - 1/n, ..., which approaches but never reaches 2). [Well, probably not, since the average separation between primes actually does increase without bound. So I just don't understand this reasoning.]

I think the sequence

I think the sequence plotting that's been done in this thread all points to it not being O(n). David, Geoffery, and I have each posted an analysis to this thread (Geoffery's, by link, is the best presentation) and all come to the same conclusion that it's the sum of a certain series. I suspect you'll need to do a bit of harder math to convert that to a simple closed form answer. The problems I see with your argument are:

That composite list is the merger of three strictly increasing lists where the head of the list is known.

I have to admit, I didn't read all the analysis.

Except you need to account for how long it takes to compute the three lists - one of those is a recursive call.

It shouldn't (ok, might not) matter, with sufficient sharing, it is only relevant how many terms are generated: which is at most one term for every number below n, which is either a prime or a composite. Since you cannot evaluate more, and every term is evaluated only once, [and every composite number is generated uniquely?], the above argument might hold. But I do admit that I don't trust the argument fully myself.

It's O(n), trivially, since if n grows, the average distance between prime numbers grows, therefore the complexity 'grows' to the complexity of generating natural numbers.

I don't really understand this reasoning. Is this possibly the fallacy that if a sequence is ever increasing, then it is unbounded? (which isn't true since you can have a sequence like 1, 1.5, ... ,2 - 1/n, ..., which approaches but never reaches 2). [Well, probably not, since the average separation between primes actually does increase without bound. So I just don't understand this reasoning.]

I tried to find an intuition why it might be O(n). Since a new list of composites is generated for each prime, and the density of primes goes to zero asymptotically, I though that that might explain that you end op with O(n). But, again, I have to admit I don't trust the argument myself. There was a question mark ;o).

Hm, this actually might be very wrong reasoning ;-).

In the end it all boils down to: does it take constant time to generate the next composite number. Which might not be true since you end up merging pi(squareroot(n)) lists.

[Nah, too much BS. Will read up on that Bird analysis again and come back...]

How many terms are generated

It shouldn't (ok, might not) matter, with sufficient sharing, it is only relevant how many terms are generated: which is at most one term for every number below n, which is either a prime or a composite. Since you cannot evaluate more, and every term is evaluated only once, the above argument might hold. But I do admit that I don't trust the argument fully myself.

The algorithm doesn't just generate a list of primes and a list of composites. If you look at it, 'composites' is a function that generates a list of composites using only a given list of prime factors, and it is invoked by the algorithm for every tail of the list of primes. That is, the algorithm (eventually, if a large enough N is demanded) invokes composites [2, 3, 5, 7, ...], composites [3,5,7, 11, ...], etc. For every natural K, there is some composite that appears in at least K merged lists.

Thus, it certainly does not take constant time to generate the next composite number. But this does not settle whether or not the whole thing is O(n) - the equivalent question is whether it takes some fixed constant time on average.

I might be stupid

But...

I don't fully follow your line of reasoning. If I evaluate the length of a given list [1..10] it will invoke length [1..10], length [2..10], etc. But that still means it is O(n) right?

*scratch*

You are right. Uh, maybe.

How silly of me

I misread your statement. I didn't understand that primes is invoked twice in every invocation of composites.

[Forget it, again, will read up on the Bird analysis, then come back.]

I reread the analysis by Geoffrey. I think he is right. And by now I am certain it cannot be O(n) since an increasing number of lists is merged.

However, there is an image of evaluated comparisons here.

**************************************************************
The haskell program
**************************************************************
module Main where

import System.IO as IO
import System
import Debug.Trace

primes = 2 : xor [3..] (composites primes)

composites (p:ps) = cs
    where cs = (p*p) : xor (map (p*) (xor ps cs)) (composites ps)

xor (x:xs) (y:ys) | x==y = trace "=" ( xor xs ys )
                  | x<y  = trace "<" ( x : xor xs (y:ys) )
                  | x>y  = trace ">" ( y : xor (x:xs) ys )

l n = length (take n primes)

main =
    do
        args <- getArgs
        if length (args) == 0
            then do
                    print "usage: primes n"
            else
                 let n = read (args !! 0)::Int in
                 do
                    print ("arg+1 = " ++ show (l n))

**************************************************************
The bash program, you can cat it to 'out.txt' for example:
**************************************************************

#!/bin/bash

for (( I = 3 ; I < 1000000000 ; I = (I * 3) / 2  ));
do
J=`a.out $I | wc -l`;
echo $I $J;
done

**************************************************************
To build the picture:
**************************************************************

graph --output-format "gif" out.txt > comparisons2.gif

**************************************************************

[ Fixed some <s] [And even more]

And by now I am certain it

And by now I am certain it cannot be O(n) since an increasing number of lists is merged.

Actually, an unbounded number of lists being merged doesn't imply that it's not O(n). Here's a counter-example:

evens = multiplesof 2
multiplesof k = k : merge [k*n | n <- [2..]] (multiplesof 2*k)

Hehe

So we're still out in the open on this one? My computer lacks processing power to really evaluate a large number of terms. From the experiments it seems it can go both ways.

[Is that really O(n)?] [Again, I give up for now. Need to read up on this stuff.]

Is that really O(n)? Yes.

Is that really O(n)?

Yes. There are N/2 evens, N/4 multiples of 4, N/8 multiples of 8, etc. If you add up these terms up to the first power of 2 greater than N, you get something less than N (and if you round up when you divide that only gets you log_2(N) more). Note: This is just because 1/2 + 1/4 + 1/8 + ... is a geometric series that converges to 1.

Guess so

The length argument I got, but I wasn't sure you could discard the cost of merging the log(N) lists.

At the moment, I am slightly more than a bit baffled by the other problem.

[Reread the other posts. I am replaying the arguments. So I'll look at the cost of merging.]

Oops: O(n^1.5 / log^3 n)

My previous complexity guess was wrong (though in fairness it was only a numerical guess). Here's an analytic analysis resulting in O(n^1.5 / log^3 n):

http://naml.us/blog/2009/01/sieving-primes-2

In case others are still confused about the sharing analysis: it's impossible to share the cons cells from different calls to (composites (p::...)), because they belong to _different infinite lists_. If they were shared, it would imply that if you go out far enough, the set of numbers that can be constructed from all primes > p is the same as the set of numbers that can be constructed from all primes > q. If p != q, this is false. Apologies if someone else already made this argument.

Thanks

Nice job

A minimalist Eratosthenes Sieve

I'd like to point out another implementation of the Sieve of
Eratosthenes: in the true spirit of the original algorithm, the
implementation uses neither division nor multiplication operations. In
fact, the algorithm doesn't even use general addition or subtractions,
relying only on successor, predecessor, and zero comparison. The
algorithm easily generalizes to other number sieves (e.g., computing
lucky numbers).


http://okmij.org/ftp/Algorithms.html#Eratosthenes-sieve



http://okmij.org/ftp/Haskell/number-sieve.lhs

Thanks Oleg

Thank you for calling this to my attention. At some point, I definitely want to take a careful look at this. Off the cuff, I'm guessing it has rather high constant-factor performance issues, but that's not why this appears to be very interesting to me.

[I'm very much reminded of the wisecrack in The Evolution of a Haskell Programmer that "graduate education tends to liberate one from petty concerns about, e.g., the efficiency of hardware-based integers"]

Sorry Oleg, it's just naive trial divisibility

Ok, I took a good look at it. Your primality sieve suffers from the same problem as Tim Band's comment above. You say that you don't do any operations on the composite numbers, but you do: you traverse them repeatedly.

However, you are right to point out that certain aspects of either the Sieve of O'Neill or the Sieve of Eratosthenes don't generalize in a nice way. I'll have to think about how to implement a lucky number sieve well.

Wikipedia says that the progenitors of the lucky numbers suggested calling it the Sieve of Josephus Flavius, which recalls Chris Rathman's post here. Some interesting stuff to think about!

Different motivation

I thought his post was a little bizarre too when I first encountered it in the context of this thread (avoiding addition through repeated increment isn't much of an optimization), but I don't think he was tossing this out as an exemplar of a speed. I think the motivating factors here are simplicity in a from-first-principles sense and generality. Take the last sentence of his description from the link:

Thus the algorithm can be used with Church and Peano numerals, or members of Elliptic rings, where zero comparison and successor take constant time but other arithmetic operations are more involved.

Re: different motivation.

I don't entirely disagree.

However I think Oleg is wrong on this count. My off-the-cuff guess was wrong, which I have to admit was based on Oleg's history of being correct rather than the content of that particular comment.

The alleged Sieve of Oleg is in the same asymptotic ballpark as the impostor sieve of David Turner, as well as Tim Band's sieve, none of which come close to reasonably-implemented trial division, let alone a geniune sieve.

I have looked at elliptic rings, however briefly. Lucky numbers are new to me, and I'm not aware of the best known algorithms in this area.

The complicating factor with the lucky number sieve compared to the Sieve of Eratosthenes is that you have to know the locations of the existing lucky candidates to cross off your target. The "obvious" solution is mighty similar to Oleg's. However... maybe it's not too hard to modify the Sieve of O'Neill to handle the lucky numbers?

On what count?

However I think Oleg is wrong on this count.

On what count? I agree Oleg's algorithm has more than constant factor issues, but I don't see where Oleg claimed that it didn't.

Context

Did David Turner ever come out and explicitly state that his classic example was within a constant factor of the genuine Sieve of Eratosthenes? I'm not aware that he did.

Oleg might not have intended to imply such a statement, but without it, the surrounding context feels especially bizarre. The subject starts with the phrase "Even better Eratosthenes sieve", and explicitly mentions Melissa O'Neill's paper, but by the criteria of that paper, proceeds to commit the same fallacies.

Admittedly, the only concrete, definitive thing I can say that Oleg got wrong was his point (i), that his code performs no operations on the composite numbers. But that seems to support my overall thesis.

It's an interesting, if flawed, article. I don't think Oleg's implementation of the lucky number sieve is particularly profound, but on that count, his overall thought process is correct. I am definitely still grateful that he brought the subject to my attention.

I think the productive path for this conversation to take is to focus on the lucky number sieve and how it might be implemented.