The Genuine Sieve of Eratosthenes

Melissa E. O’Neill, The Genuine Sieve of Eratosthenes.

A much beloved and widely used example showing the elegance and simplicity of lazy functional programming represents itself as "The Sieve of Eratosthenes." This paper shows that this example is not the sieve and presents an implementation that actually is.

Starting with the classic one-liner sieve (p : xs) = p : sieve [x | x <- xs, x ‘mod‘ p > 0] O'Neill proceeds to show why this standard rendition of the Sieve of Eratosthenes does not in fact "cross-off" the multiples of each prime in the same way the "real" Sieve does.

She notes that "Some readers may feel that despite all of these concerns, the earlier algorithm is somehow “morally” the Sieve of Eratosthenes. I would argue, however, that they are confusing a mathematical abstraction drawn from the Sieve of Eratosthenes with the actual algorithm. The algorithmic details, such as how you remove all the multiples of 17, matter."

A fun read.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

In one word...

Breathtaking. If they don't accept this paper I'll personally get on a plane and hit someone on his head ;-).

I also liked she asked Bird, Meertens and Peyton Jones for comments. It somehow exposes that naive 70's FP feeling (apart from being a quality paper, I really liked the performance analysis).

[that might have been 80's, I am not sure]

[As a side note: what I also liked is that she discusses that FP programmers prefer lists for a lot of operations, whereas a lot of algorithms are better implemented using other data structures. Personally, I always felt that 'the' data-structure for FP should not be the list, but the 'map'.]

list vs map

Functional programming depends heavily on persistent immutable data structures. While a persistent list is trivially implemented, a persistent map is much more complex. I daresay that's the main reason for this distinction.

On the other hand, I'd rather tree-like structures, which give both good indexed access and good parallelism -- and maps can be built over trees.

Very interesting. Thanks!

Very interesting. Thanks!

Off topic?

How is this not off topic, Ehud? ;-)

Not that I'm complaining about the choice of topic. I read this paper a few months ago, and it is definitely worth reading. The fact that the classic example cited by the paper is not actually the Sieve of Eratosthenes is so blindingly obvious in retrospect that it is really quite embarrassing that functional programmers have let it slide for so long. Kudos to Melissa O'Neill!

Myself, I never really picked up on the purported "sieve" as particularly interesting, and never thought about it deeply. What really stood out (in my mind) from my early days of FP was fold-left, fold-right, and the corecursive definition of the Fibonacci sequence.

I know I did implement the genuine sieve in Haskell years ago, but I used mutable monadic arrays. I might as well have been writing in an imperative language. I know I had seen the purported "sieve", but I don't recall why I didn't use that. Maybe it didn't come to mind, maybe I had tried it and found it too slow, or maybe I thought it was a silly implementation from the outset.

I don't want to start a big meta-discussion about LtU in this particular thread, but I've been troubled by the militancy of "off topic" accusations lately. However, as this seems to me to be OT by your own standards, it seems worth pointing out.

Don't get me wrong, some of the accusations were justified. I appreciate having reasonably intelligent discussions a few levels above the tripe on Slashdot, and I realize that a certain amount this behavior is probably necessary to maintain this.

I personally feel the "off topic" standards should be relaxed a bit; as long as the post is of reasonable quality and likely to be of interest to a significant fraction of people also interested in programming languages in general, (such as backreferences and deterministic automata), then it should not be discouraged.

Can't speak for Ehud...

...but in my own barely on-topic foray into algorithms I was trying to ask the question of whether the language that is used to express algorithms can be truly PL or paradigm agnostic. Personally, I find the particulars of the Sieve to be of less interest than the fact that the language used to express the algorithm is intrinsically tied to the PL that implements it. The only formal expression of the algorithm given by the author is modified Haskell code that supposedly fits the informal description of the problem given in the natural language description. Which to me states both the importance and the failure of programming languages in expressing algorithms.

As a side question, are the mutable monadic arrays in Haskell capable of growing dynamically? I'd conjecture that the expression of the Sieve in many non-functional languages (using fixed size arrays) are not truly expressing the fundamental algorithm which has no stated upper bounds. This was one of the advantages that has been given for the discredited Haskell solution.

On the meta-level of discussion, I don't think questioning whether a particular response is off-topic is necessarily a conversation killer. In most of those posts that are borderline are implicit calls to the poster to justify why they think it should be considered as relevant to discussions of PLs. This is not to say that there is not a lot of gray area... it is to say that it is encumbent upon the person to try and shed light on the role that PLs play in the discussion of any particular item.

The classical sieve

Haha, personally, I find the particulars of the sieve far more interesting than many (most?) PL discussions, which I find tend to be rather boring and dry without real problems and compelling examples to motivate them. Or maybe they are just too metamathematical for me to easily grasp.

I'd conjecture that the expression of the Sieve in many non-functional languages (using fixed size arrays) are not truly expressing the fundamental algorithm which has no stated upper bounds.

Oh, but you are absolutely wrong. The classical sieve has no inherent upper bound, but you must pick one before you start. Once you pick that bound, it can't really be increased inside the realm of Eratosthenes' algorithm.

Without too much difficulty, one can extend the genuine algorithm to allow that bound to be increased, in the "obvious" and straightforward way. To restart the sieve, you take each prime p in your sieve and compute old_bound + ((-old_bound) `mod` p) and then cross out every pth number after that. Note that the ancients could not have performed this efficiently, as they lacked the long division algorithm necessary to compute the appropriate multiple of each prime.

However, this method of continuation results in a significant point of departure from the original sieve, and changes the time complexity. In the worst case, where you increase the upper bound by a single number each time, this extension degenerates into trial division. Perhaps this is why the impostor "sieve" went undetected for so long.

Well, there is a potential

Well, there is a potential variant that would allow increasing the bound without involving long division (at the cost of increasing the storage requirements) and thus would have been possible for "the ancients". For each prime, you record the last number that you crossed off. Then, when you want to increase the upper bound, you simply go through the list and see if the next one is in range or not.

Yup.

Yup. The paper basically presents a slightly more clever implementation of that exact idea, using a priority queue instead of a list, and storing the next number to be crossed off for each prime, instead of the last number that was crossed off. Then one can dispense with the classical algorithm entirely, if one so chooses, or attempt a hybrid approach.

Whether or not this is the "geniune" sieve is a matter of semantics, as it's definitely not the classical algorithm, but the overall effect is the same. You get the same time complexity, and not that of trial division. Notably, you relax the need for an upper bound, at the cost of increased storage requirements. :-)

Good company

Oh, but you are absolutely wrong.

Probably so, but since a lot of smart Haskell programmers were also wrong about their Sieve, I don't feel so bad. :-)

The classical sieve has no inherent upper bound, but you must pick one before you start.

Backtracking, the algorithm does imply binding a value N to the function before starting out, so that the returned list is a finite set. Out of curiosity, I'm wondering if there is an extant version of the algorithm that survived the burning of the library at Alexandria? Just wondering what the ancient PL looked like? If we take wikipedia as authoratative, we have:

  1. Create a contiguous list of numbers from two to some highest number n.
  2. Strike out from the list all multiples of two (4, 6, 8 etc.).
  3. The list's next number that has not been struck out is a prime number.
  4. Strike out from the list all multiples of the number you identified in the previous step.
  5. Repeat steps 3 and 4 until you reach a number that is greater than the square root of n (the highest number in the list).
  6. All the remaining numbers in the list are prime.
The spurious Haskell implementation is:
primes = sieve [2..]
sieve (p : xs) = p : sieve [x | x  0]
which doesn't look at all the same to any casual observer - and the corrected version doesn't look much better. The underlying premise of the paper is that we are trying to express an algorithm given in a non-formal non-PL form and express that in Haskell (or any other PL for that matter). We can't determine the algorithmic equivalence based purely on input and output, but must factor in cost as well (am reminded of the aspersion that Lisp programmers know the value of everything but the cost of nothing). The paper assumes that if the input and output of a function are equivalent and the time complexity involved is the same, then the algorithm might be the same. Of course, one can quibble on what a "list of numbers" is (the chosen data structure), and what it means to "strike out" a number (change its state). We end up recognizing that algorithms expressed in PLs are not the same thing as we are modeling. Though there is less intuition involved, we still can't formally prove that our implementation formally captures the essence of the algorithm (Haskell functions are not the same thing as functions in the mathematical sense) - though we can prove two algorithms as non-equivalent by observing their time complexity.

Haha, personally, I find the particulars of the sieve far more interesting than many (most?) PL discussions, which I find tend to be rather boring and dry without real problems and compelling examples to motivate them. Or maybe they are just too metamathematical for me to easily grasp.

In all the solutions that I develop on a day-to-day basis, there is the problem of taking incomplete (and sometime incoherent) specifications and expressing them in a PL. I agree that this can be interesting when applied to particular problems but there has to be some reflection on how it echoes to the general notion of PLs in general.

Well put

These are exactly the kind of considerations I was trying to get at in the case of quicksort (below).

Been wrestling with it

I was trying to make a similar point with MergeSort on my algorithms post where I was trying to point out that CLRS and Knuth explicitly define the algorithms with state variables. A much simpler example would be Euclid's GCD. TAOCP, which for some is the bible of algorithms, lays it out as:

Given two positive integers m and n, find the greatest common divisor, i.e., the largest positive integer which evenly divides both m and n.
  • E1. [Find remainder] Divide m by n and let r be the remainder. (We will have 0 <= r < n).
  • E2. [Is it Zero?] If r = 0 the algorithm terminates; n is the answer.
  • E3. [Interchange.] Set m <- n, n <- r, and go back to step E1.
The nearest I got to a literal translation in ML was:

fun gcd(m, n) =
   let
      val m = ref m
      val n = ref n
      val r = ref 0
   in
      while
         (
            r := !m mod !n;      (* E1 *)
            !r <> 0              (* E2 *)
         ) do (
            m := !n;             (* E3 *)
            n := !r
         );
      !n
   end

But that looks somewhat ugly compared to the recursive solution of:

fun gcd (m, 0) = m
  | gcd (m, n) = gcd(n, m mod n)

The question would be (1) whether these are the same algorithms and how would determine that?; and (2) whether we can express the algorithm using the Knuth type of language that would describe what both implementations have in common?

(1) whether these are the

(1) whether these are the same algorithms and how would determine that?

This paper discusses exactly this type of question.

(1) whether these are the

(1) whether these are the same algorithms and how would determine that?

No, but only because your first implementation has a bug. :-) Namely, the first gcd(m,0) causes a divide by zero, and the second one doesn't.

Upon re-reading Knuth's assumptions, he does specify positive integers, so it's not entirely fair to label it a "bug". This probably isn't a good assumption, but it does avoid discussion of gcd(0,0), which probably ought to return an error.

But, modulo that quibble, I daresay yes. In fact, many compilers would take the latter implementation and produce object code that would be similar to Knuth's specification in the same sense that your original code is "similar" to Knuth's.

The compiler would likely create a temporary variable (let's call it "r", pun intended) to store m mod n. As the call to gcd is a tail call, (n,r) will get passed in the same location as (m,n). The compiler shouldn't use further temporaries to set this up. At that point, the call is just a simple goto.

Ironically, your original implementation would probably result in a less direct rendering of Knuth's specification in object code. Your SML compiler might well store the references on the heap and refer to them with a pointer, especially SML/NJ as it supports call/cc.

I'm not sure I understand what question (2) is trying to ask.

Arrays are harder

The question would be (1) whether these are the same algorithms and how would determine that?

What happens to work well in this case is to translate Knuth's original description into set of tail-calling functions E1, E2 and E3 passing m, n and r between them — we know what is the Ultimate, after all. Then it's just a matter of simplification before arriving at something very close to the "functional" solution.

Now, most interesting imperative algorithms deal with mutable arrays, and a mere SSA conversion is clearly insufficient for turning Hoare's Quicksort into the clean functional version. Even using some variant of functional arrays for the CPS-conversion, one would miss the point by a mile.

Casual Observation

What's a casual observer? To somebody who doesn't know any programming whatsoever, or maybe just not any functional programming, sure, they appear to have nothing to do with each other. However, would C really be any better, other than more people have a passing familiarity with C?

However, at a glance, I *do* think that the two descriptions look alike, at least to somebody with some familiarity of Haskell and a basic understanding of the sieve. Perhaps this is another reason why the fake sieve went undetected for so long. It makes me wonder what other surprises might be lurking.

I interpret the paper's title as provocative, not prescriptive. Moreover, I don't think the author is trying to argue that her algorithm is the same as the classical one, merely that it's an acceptable substitute. Certainly, it is a "genuine sieve" of some variety.

Since sieves can be applied to other problems, e.g. factorization, and the sieves of both Melissa O'Neill and Eratosthenes deal only with primality, it seems reasonable to call it "a Sieve of Eratosthenes", even if it isn't "the Sieve of Eratosthenes."

After all, O'Neill's sieve is fully incremental: if you want all the primes between 1 and 10^9, you don't have to wait for multiples of 2, 3, 5, and 7 to get crossed off to get back 53. So by the time criteria, it certainly is a different algorithm.

It seems on-topic to me,

It seems on-topic to me, because there is widespread belief that laziness brings programming much closer to mathematics, and this paper explores this belief using a well-known example. It also explores the bias functional programmers have towards list-based algorithms, despite how other data structures are often a better fit. Most industry languages come with a broad collection libraries, and most industry programmers won't even consider a language without such collection support.

Well put.

You certainly made the case as to why this discussion is on topic. (not that I ever really disagreed.) But as "algorithms" are "off topic," somebody needed to say it. Further meta-discussion should be relegated to the Meta LTU Thread

And, Mattias is absolutely right that most of the interesting imperative algorithms involve mutable arrays. In fact, my sieve, though not incremental, performs a significant, but not unreasonable constant-factor better than Melissa O'Neill's for this exact reason. In the process of tearing down our biases toward linked lists, she also ironically confirmed it.

Now, for a bit of speculation, I've never really found a completely satisfactory approach for expressing graph algorithms in a purely functional way. The advantage of FP is relatively much greater for trees.

So, maybe finding better ways of expressing array algorithms "less imperatively" and "more declaratively" could lead to better ways of expressing graph algorithms.

Anybody have an opinion on GHC's Difference Arrays? I've looked at the source, and thought about trying to extend them to encompass union-find, but I've never really used them enough to know their strengths and weaknesses.

[edit: Mattias Engdegård said that, not naasking. Still, both comments are good. :-)]

seems important

to this grashopper -- sure, fp is great in some (lots) of situations, but to be really honest and respectful one has to think about and address what happens in the situations where fp has issues.

Reminds me...

I've always had a similar feeling about the standard Haskell quicksort example, and as it happens, I was just having a conversation about it the other day... This bit in particular applies:

I would argue, however, that they are confusing a mathematical abstraction drawn from [quicksort] with the actual algorithm. The algorithmic details, such as how you [cleverly swap elements in-place using pointer arithmetic], matter.

The Haskell quicksort has the remarkable property of making the essence of the algorithmic idea much clearer, while simultaneously destroying the essential cleverness that makes it such a good idea. So what is quicksort, really?

Anyway I haven't read this paper yet, although hopefully I'll have a chance in the morning.

Implementation differences

Well, yes, the standard example of how to implement quicksort in Haskell is _not_ a good implementation. However, it is "morally" the actual quicksort algorithm. I remember one friend of mine who had implemented quicksort in C++ and Java but did not understand why it worked until he saw the Haskell solution.

The worst downfall of the quicksort example is that naive use of list concatination leads to quadratic behavior. However, it _does_ make the exact same number of comparisons as the imperative counterpart.

If you fix the concatination issue, you get something on par with quicksort, even though it is still not a great implementation. Ignoring laziness, I believe that the quicksort example has the same asymptotic running time, but uses appreciably more memory. (O(n * log n) instead of O(log n))

However, the situation with the "sieve" example is far worse. Superficially, it appears to be the same, but it's really just trial division in disguise, because filter p . filter q is the same as filter (\x -> q x && p x). (in terms of time, ignoring memory consumption.)

And of course, the genuine sieve is appreciably better than trial division. Asymptotically, trial division tests all the numbers for divisibility by 2, 1/2 of the numbers for divisibility by 3, 1/2 * 2/3 of the numbers for divisibilty by 5, 1/2 * 2/3 * 4/5 of the numbers for divisibility by 7, etc.

However, the actual sieve "tests" 1/2 of the numbers for divisibility by 2, 1/3 of the numbers for divisibility by 3, 1/5 of the numbers for divisibility by 5, 1/7 of the numbers for divisibilty by 7, etc. As Melissa O'Neill points out, the exact nature of the "test" isn't essential.

Well, yes, the standard

Well, yes, the standard example of how to implement quicksort in Haskell is _not_ a good implementation. However, it is "morally" the actual quicksort algorithm. I remember one friend of mine who had implemented quicksort in C++ and Java but did not understand why it worked until he saw the Haskell solution.

And I know people who never understood why it worked well until they saw the C solution. I very much appreciate your use of the word "morally" in the same sense as the OP. So what is the morality of quicksort? Is it really quicksort if it's quadratic? Haven't we missed the point?

As for the sieve, I'll take your word for it, as I still haven't read the paper or given it much thought.

So what is the morality of

So what is the morality of quicksort? Is it really quicksort if it's quadratic? Haven't we missed the point?

I suppose that depends on the point one is trying to make. If your point is to actually implement quicksort, then yes, the Haskell example does miss the point.

However, the example is at least useful as first approximation. The reason behind the quadratic behavior has nothing to do with sorting. In some abstract mathematical sense, it distills the essence of the algorithm.

But yes, you don't fully understand quicksort until you understand the C version as well. In-place partitioning is an important aspect of a practical implementation.

For the benefit of lurkers, the example we are discussing appears several places, such as A Gentle Introduction to Haskell

For example, here is a concise definition of everybody's favorite sorting algorithm:

quicksort  []           =  []
quicksort (x:xs)        =  quicksort [y | y <- xs, y<x ]
                        ++ [x]
                        ++ quicksort [y | y <- xs, y>=x]

Huh. I seem to remember reading commentary in my early days of Haskell saying that it isn't exactly fair to compare this to an imperative version, e.g. C, due to clever use of in-place partitioning, etc. Apparently it wasn't there.

In defense of the tutorial, it does describe this as a "definition" and not an "implementation". But it's easy to see how somebody might come away with false impressions.

Wadler

The standard quicksort is one of the standard examples of using an accumulating parameter (for example Wadler's The concatenate vanishes):

qsort' [] ys = ys
qsort' (x:xs) ys = qsort' [y| y <-xs, y<= x] (x: (qsort' [y|y<-xs, y> x] ys))

I think in the end a quicksort is the quickie Haskell version and a good quick sort involves the median of 3, the avoiding small sorts and pushing the final to insertion sort.... All depends on what you are trying to explain.

Good example.

It's a good example of how to fix concatination; but certainly not as useful for explaining quicksort, IMO.

Does this tickle anybody?

Not quite so neat as the "fake" sieve, but I would call this genuine (some optimizations at the expense of clarity are obviously possible):

sieve (_:ns) (False:ps) = sieve ns ps
sieve (n:ns) (True:ps) = n : (sieve ns $ zipWith (&&) ps $ cycle $ replicate (n-1) True ++ [False])
primes = sieve [2..] $ repeat True

Of course, it's still lists, and as O'Neill says, data structures matter...

It's still naive trial divisibility

I'll try to come up with a nice, convincing algebraic argument later, but this algorithm has the same issues as the fake sieve, dressed up a bit differently.

You avoid the "division" part of trial division; but the use of division is not what makes that algorithm slow, nor is avoiding it what makes Eratosthenes Sieve fast. It's the "trial" part that is costly.

The key is that, like trial division and unlike the Sieve of Eratosthenes, you are still testing every number n for divisibility by every prime less than n. Exactly how that is achieved is not so important.

Hmmm. Alright.

The algorithm feels more genuine because I am using a "cross off every n" approach, but you are right that I'm still doing an asymptotically large number of &&s.

Avoiding the log-log term

The O'Neill paper notes:

The (Eratosthenes) algorithm is efficient because each composite number, c, gets crossed off f times, where f is the number of unique factors of c less than square-root(c). The average value for f increases slowly, being less than 3 for the first 10^12 composites, and less than 4 for the first 10^34

This f is what contributes the log-log term to the time complexity of the Genuine Sieve - O(n log log n). The optimizations which sieve only odd numbers, or wheel-generated numbers can be seen as a way to reduce the value of f in common cases, although the complexity remains unchanged.

But it is possible to strike off each composite number exactly once, and so remove the log-log term entirely. (And at the same time switch to a list-based implementation):

primes = 2 : diff [3..] (composites primes)

composites (p:ps) = cs
    where cs = (p*p) : merge (map (p*) (merge ps cs)) (composites ps)

merge (x:xs) (y:ys) | x<y           = x : merge xs (y:ys)
                    | otherwise     = y : merge (x:xs) ys

diff (x:xs) (y:ys)  | x==y          = diff xs ys
                    | x<y           = x : diff xs (y:ys)
                    | x>y           = diff (x:xs) ys

I believe the above algorithm has time complexity O(n). But can it still be regarded as the Sieve of Eratosthenes?

[edited to fix bug in diff - "<" and ">" were swapped - thanks Sjoerd!]

Computing every composite

Computing each composite number by multiplying is prime factors isn't part of the genuine Sieve, but this is very close to the spirit and... is a pretty cool algorithm.

Analysis?

It looks that at the end of the day, you will have evaluated the thunks for each of these lists, at least up to N:

composites [2,3,5...]
composites [3,5,7...]
composites [5,7,11,...]
 ...

So you're saying the sum of the sizes of these lists as a fraction of N converges in the limit. Is this obvious?

Think Lazier

You're not thinking lazily enough. The 'composites' function only ever depends, in any immediate sense, on the first prime in the list and the first composite in the list, both of which Nick is careful to provide. The primes themselves are computed lazily. The 'sum of the sizes of these lists' never gets involved (which is good, because each list is infinite).

I'm not entirely convinced it is O(N) yet, since the time complexity of an algorithm can never be better than the space complexity of the same algorithm and I haven't examined how much space is being consumed by all the thunks (though I suspect the total space cost is less than O(N)). But the algorithm does compute each composite, and each prime, exactly once.

Too lazy

I think I was actually being too lazy - with my choice of words. I realize each of the lists computed by composite is infinite and when I referred to their "sizes", I was referring to the size of the prefix actually computed (the portion up to the first element N or greater). Have you thought through the analysis of the algorithm? How many lazy cells get evaluated?

A simple bound

So for example, we can obtain the following bound easily. The number of elements in the list of composites made from Kth or greater primes is less than N/K. Since the number of primes P is asymptotically N/ln N, and since the harmonic series grows like ln N, we have the number of evaluated thunks bounded at:

*edit: removed incorrect computation*

I think the operations of the algorithm can be put into correspondence with the thunks in those lists such that each thunk corresponds to a bounded number of operations, so this would put the runtime of this algorithm at as least as good as the Sieve, asymptotically.

Lower Bound

  1. We know this computes each composite up to N exactly once.
  2. Each composite is produced by a number of multiplications based on its prime factorization
  3. Let c(N) be the sum of exponents in a prime factorization: 1 for prime numbers, 2 or more for composites
  4. Let sum_c(N) is equal to (SUM c(N) from 1 to N)
  5. And average_c(N) is sum_c(N)/N
  6. So the average number of multiplies to produce composite numbers up to N is (average_c(N)-1)

This produces a lower bound on complexity (multiplies plus merge/diff operations) of N*average_c(N).

Now, I don't know much math-trivia, so I don't know average_c(N) with regards to N, but what I do know is that average_c(N) must be greater than the average number of prime factors for N (since the sum of exponents is greater than the count of prime factors). According to Nick, the prime factor duplication is, by itself, sufficient to result in numbers being crossed off an average of O(log log N).

So I suppose this algorithm is actually 'worse' than an addition-based solution. It's at least O(N log log N)

Poking a hole

Each composite is produced by a number of multiplications, but some of those multiplications are shared. For example, 5*(2*3) and 7*(2*3) both reuse the computation of 2*3. Thus it's not enough to merely count the average number of factors - you must consider how frequently they are reused.

(Also, if you can establish the Nlog log N lower bound, I think I've shown a log log N upper bound in another part of the thread. That bound is based on what feel like pretty conservative estimates, though, so I'd find it a little bit surprising if the cost were actually N log log N)

Point granted. I didn't

Point granted. I didn't consider sharing of multiplications.

Based on sharing, and examining the algorithm again, it seems every composite number involves multiplying exactly once from a prior prime or composite number. Given that every composite is also reached only once, the number of multiplies is thus less than N to produce all primes up to N.

So that leaves only consideration of how many 'merges' the production of a composite number will suffer, but this also seems to exhibit the same sort of 'sharing'. It does seem that each (p*c) will be merged a number of times equal to the number of primes smaller than p, so the cost of computing a composite is: a multiply, plus a number of merge events proportional to O(log(p)) for the smallest prime factor p. That doesn't look to contribute much, if anything, to the time complexity, but I lack the skill to further explore it.

So, O(N) it could be.

Agreed

It does seem that each (p*c) will be merged a number of times equal to the number of primes smaller than p, so the cost of computing a composite is: a multiply, plus a number of merge events proportional to O(log(p)) for the smallest prime factor p.

Agreed, and this is equivalent to asking how many elements less than N are in those composites lists, since the Kth list consists of elements whose smallest prime factor is the Kth prime or greater. It looks like potentially hard to me, so I'm done. It'd be easy to measure and graph performance, but that probably wouldn't do much good since, as it's been said, log log N has been proven to diverge but has never been observed doing so.

Avoiding the log-log term, but at what cost?

The complexity of O(n log log n) stated for the Sieve seems to be based on being able to perform the additions need to iterate through the multiples of each prime in constant time. But for a large prime p, it will actually take O(log p) space to represent p and O(log p) time to add p to a multiple. I suspect that if you assume this for additions, O(n log log n) isn't even right. And since multiplication isn't a linear operation like addition, this algorithm would actually make the situation worse.

Operation Cost

I was actually thinking about this earlier then thinking: but it still beats the division algorithm, and we get infinite lists in the bargain!

The cost for grade-school integer multiply algorithm is O(log A * log B) where A and B are the operands. In Nick's algorithm, we can guarantee that one operand (A) is a prime number, and that at least half of all results will be multiples of 2, a third of all remaining composites results will be multiples of 3, etc. I'm not quite certain how all this adds up, but it probably is far less than O(log^2 N) for result N.

The cost for a grade-school integer addition is O(log A + log B) for operands A and B, given the need to produce a new representation. You should be able to do better than that by sharing representation when adding small numbers to large numbers.

I'd be interested in finding out how these differences affect the sieve.

Shorter version

Nobody tried this code? Diff has a bug. And when you fix it, you'll see that merge and diff are actually the same function, symmetric difference, or xor:

primes = 2 : xor [3..] (composites primes)

composites (p:ps) = cs
    where cs = (p*p) : xor (map (p*) (xor ps cs)) (composites ps)

xor (x:xs) (y:ys)  | x==y          = xor xs ys
                   | x<y           = x : xor xs (y:ys)
                   | x>y           = y : xor (x:xs) ys

When merging, the x==y part is never used, and when diffing, the x>y part is never used.

Very nice!

Very nice!

I'll bite


lps@azarel:~/Programs/snippets $  ghc -O2 -c MaybeSieve.hs 
lps@azarel:~/Programs/snippets $  ghci MaybeSieve.hs 
GHCi, version 6.8.2: http://www.haskell.org/ghc/  :? for help
Loading package base ... linking ... done.
Ok, modules loaded: MaybeSieve.
Prelude MaybeSieve> length (take 10000 nick's_primes)
10000
(0.40 secs, 0 bytes)
Prelude MaybeSieve> length (take 10000 sjord's_primes)
10000
(0.40 secs, 0 bytes)
Prelude MaybeSieve> length (take 10000 primes)
10000
(0.03 secs, 0 bytes)   [edit:  (41.15 secs, 2160096056 bytes) ]
Prelude MaybeSieve> take 10000 nick's_primes == take 10000 primes
False  [edit: True]
(0.01 secs, 0 bytes)
Prelude MaybeSieve> take 10 (xor nick's_primes primes)
[25,35,49,55,65,77,85,91,95,115] 
(0.01 secs, 0 bytes)
Prelude MaybeSieve> take 10000 nick's_primes == take 10000 sjord's_primes
True
(0.03 secs, 0 bytes)

Looks like you applied a correct transformation to incorrect code, Sjoerd. [edit: that would be a correct transformation to correct code]

As for being fast, it's not. [edit: fast is relative. It's not great, but it's not too unreasonable.] primes is the impostor sieve that we've been discussing, and it's over 10x slower than that. [edit: I got that sieve wrong. It's over 10x slower than nick's/sjord's] As for being linear, I'm pretty sure it's not. [edit: this is one of the few correct statements I made here] In fact, it appears to be asymptotically worse than the impostor. [edit: nope. turn that statement around] If we restart ghci:

Prelude MaybeSieve> length (take 50000 nick's_primes)
50000
(7.55 secs, 166816348 bytes)
Prelude MaybeSieve> length (take 50000 primes)
50000
(0.11 secs, 0 bytes)
Prelude MaybeSieve> nick's_primes !! 10000
104743
(0.01 secs, 0 bytes)
Prelude MaybeSieve> nick's_primes !! 50000
611957
(0.01 secs, 0 bytes)

That's 19x more time to solve a problem instance 6x as big. I hate being a curmudgeon, but I also hate not being rigorous. [edit: both true] :-)

Recheck?

Can you recheck your definitions file and then post the code if you don't find a problem? I copy and pasted the code for sieve in the original post with Sjoerd's code, changing only two quotes to backquotes, and the two produce equivalent lists of primes with Sjoerd's being much faster.

Oops, how embarrassing.

I thought that was kind of fast for the fake sieve, but I subconciously assumed it was too simple to mess up. (haha, right.) I forgot to recurse, so it only filters out the factors of 3 out of the odd numbers.

But the fact remains, there is no way this is linear. It's very similar to Richard Bird's code at the end of the paper, but it does improve on it, so not bad! I suspect it's asymptotically the same as Bird's, which Melissa O'Neill does discuss as being suboptimal.

So, Nick/Sjord's code: (performs identically)

ghci> let test f n = length (takeWhile (< n) (f ()))

ghci> test nick's_primes 100000
9592
(0.42 secs, 0 bytes)
ghci> test nick's_primes 200000
17984
(0.88 secs, 0 bytes)
ghci> test nick's_primes 400000
33860
(3.85 secs, 105371172 bytes)
ghci> test nick's_primes 800000
63951
(11.37 secs, 239707676 bytes)

Richard Bird's code from the end of the paper:

ghci> test bird's_primes 100000
9592
(1.03 secs, 50939444 bytes)
ghci> test bird's_primes 200000
17984
(2.33 secs, 122054156 bytes)
ghci> test bird's_primes 400000
33860
(5.47 secs, 304108996 bytes)
ghci> test bird's_primes 800000
63951
(13.11 secs, 785552660 bytes)

And, just for fun, my own sieve that I use from time to time. The expression sieve n finds the smallest prime factor of every number less than n, so it's a bit more general than Eratosthenes. It's not incremental in any way, shape, or form, but there are ways to extend the algorithm.

ghci> factor (sieve 100000) 3
[(3,1)]
(0.03 secs, 0 bytes)
ghci> factor (sieve 1000000) 3
[(3,1)]
(0.15 secs, 2002168 bytes)
ghci> factor (sieve 10000000) 3
[(3,1)]
(1.65 secs, 11002884 bytes)
ghci> factor (sieve 100000000) 360
[(2,3),(3,2),(5,1)]
(19.57 secs, 101004300 bytes)

Flaw in Test

I suspect that a lot of the difference you're seeing is due to heap-reallocation and associated garbage collection. The prior tests are affecting the latter tests by consuming heap and leaving garbage upon it that is only collected when running the latter tests.

Memory Management

I suspect that a lot of the difference you're seeing is due to heap-reallocation and associated garbage collection.

That's certainly part of it, but that's also an intrinsic part of the respective algorithms.

Nick's code appears to allocate about 1/3 of the memory of Bird's code, but probably has basically the same runtime overall. That's probably why it does comparatively well in the beginning, but quickly catches up. Even so, reducing memory consumption is a good thing, even if you don't save much time.

My sieve doesn't create any garbage. The only heap allocation is a single unboxed ST array. Then after it's done filling it, it uses runSTUArray to convert that to an immutable array without copying. Two simple tricks are used to reduce the size of the array by 75%.

But even disregarding the garbage avoidance and the significant constant-factor advantages of arrays, the algorithm my code uses has the same asymptotic complexity as Eratosthenes' sieve. Bird's code only does a little bit better than trial division.

The prior tests are affecting the latter tests by consuming heap and leaving garbage upon it that is only collected when running the latter tests.

Doubt it. In fact, the first expression run during a session will often take a bit longer, as GHCi might need to load a package.

Unlike Hugs, GHC's garbage collection is fast, and GHCi does not offer :gc to force a collection. Since humans type slow, there is almost always at least a partial GC done between expressions. I haven't exactly figured out bytes reported by ghci mean, but I believe it's the amount of garbage collected while the algorithm was running. (There may be a race condition in there to confuse things a bit.)

Besides, on the latter examples, there was far more memory collected than the standard 64M heap that I was working on. The collector ran multiple times, and the times were pretty consistent.

"There is no way this is linear"

I was going to scoff that you couldn't distinguish ln ln N from a constant, but I was getting similar numbers to yours (measured in operation count rather than just runtime), so I revisited the proof of my bound above and realized it has a big hole (my turn for oops!). It actually looks likely to me that one could work out the exact asymptotic runtime, but I suppose the lesson I should take away from this is that math should be done carefully or not at all.

I suspect you're right that it's not linear or even N ln ln N.

Someone ...

should really build a library on top of Haskell, like HUnit, which will experimentally supply a good estimate on the complexity of a given algorithm.

I concur

It's a wonderful idea

Maybe

You should contact the author Melissa.

O(n log^2 n)?

The complexity of this function is O(n) times the average value of pi(lpf(n)) restricted to the composite numbers, since lpf(n) determines how many different (composites (p:...)) lists n occurs in. Here lpf(n) is the least prime factor of n. Unfortunately I don't know how to estimate that analytically. A numeric estimate out to 20000000 makes me think it's between O(n log^2 n) and O(n log^3 n). Here's a few more details: http://naml.us/blog/2009/01/sieving-primes.

Great analysis

the total cost due to k is roughly π(lpf(k)) where lpf(k) is the least prime factor of k.

Why? Is it an estimate of the length [of the composite list where k was generated]? If so, I think you might not take sharing into account in your summation.

[My, very simple, rough estimate was something like O(π(n) + 2 * (n - π(n))) is O(2n - π(n)).] [Just counting the number of unique multiplications] [I guess 2 should be a constant C]

O(n)? Another argument?

Rationale:

The list of prime numbers up to n is the list of numbers (O(n), trivially) minus the list of composites.

The composite list of multiples is generated recursively from the list of primes. That composite list is the merger of three strictly increasing lists where the head of the list is known. Thus?, to determine the n-th composite O(n) comparisons are needed, or, it takes constant time to determine the next composite [if the list of primes would be known].

Let c0 be the constant to generate the next natural number.

Let c1 be the constant to generate the next composite. [c1 > c0]

Then it takes approximately c0*Ï€(n) time to generate all primes, and c1*(n-Ï€(n)) time to generate all composites below n.

Thus c0*Ï€(n) + c1*(n-Ï€(n)) or approximately c0*n/ln(n) + c1*n - c1*n/ln(n) = c1*n + (c0-c1)*n/ln(n) time to generate all primes below n. Thus O(n).

[Uh, edited the last formula substantially.] [Hmpf, I am not taking into account that π(n) lists should be merged; but it might be that it doesn't matter.]

[?It's O(n), trivially, since if n grows, the average distance between prime numbers grows, therefore the complexity 'grows' to the complexity of generating natural numbers.]

I don't think it's O(n)

I think the sequence plotting that's been done in this thread all points to it not being O(n). David, Geoffery, and I have each posted an analysis to this thread (Geoffery's, by link, is the best presentation) and all come to the same conclusion that it's the sum of a certain series. I suspect you'll need to do a bit of harder math to convert that to a simple closed form answer. The problems I see with your argument are:

That composite list is the merger of three strictly increasing lists where the head of the list is known.

Except you need to account for how long it takes to compute the three lists - one of those is a recursive call.

It's O(n), trivially, since if n grows, the average distance between prime numbers grows, therefore the complexity 'grows' to the complexity of generating natural numbers.

I don't really understand this reasoning. Is this possibly the fallacy that if a sequence is ever increasing, then it is unbounded? (which isn't true since you can have a sequence like 1, 1.5, ... ,2 - 1/n, ..., which approaches but never reaches 2). [Well, probably not, since the average separation between primes actually does increase without bound. So I just don't understand this reasoning.]

I think the sequence

I think the sequence plotting that's been done in this thread all points to it not being O(n). David, Geoffery, and I have each posted an analysis to this thread (Geoffery's, by link, is the best presentation) and all come to the same conclusion that it's the sum of a certain series. I suspect you'll need to do a bit of harder math to convert that to a simple closed form answer. The problems I see with your argument are:

That composite list is the merger of three strictly increasing lists where the head of the list is known.

I have to admit, I didn't read all the analysis.

Except you need to account for how long it takes to compute the three lists - one of those is a recursive call.

It shouldn't (ok, might not) matter, with sufficient sharing, it is only relevant how many terms are generated: which is at most one term for every number below n, which is either a prime or a composite. Since you cannot evaluate more, and every term is evaluated only once, [and every composite number is generated uniquely?], the above argument might hold. But I do admit that I don't trust the argument fully myself.

It's O(n), trivially, since if n grows, the average distance between prime numbers grows, therefore the complexity 'grows' to the complexity of generating natural numbers.

I don't really understand this reasoning. Is this possibly the fallacy that if a sequence is ever increasing, then it is unbounded? (which isn't true since you can have a sequence like 1, 1.5, ... ,2 - 1/n, ..., which approaches but never reaches 2). [Well, probably not, since the average separation between primes actually does increase without bound. So I just don't understand this reasoning.]

I tried to find an intuition why it might be O(n). Since a new list of composites is generated for each prime, and the density of primes goes to zero asymptotically, I though that that might explain that you end op with O(n). But, again, I have to admit I don't trust the argument myself. There was a question mark ;o).

Hm, this actually might be very wrong reasoning ;-).

In the end it all boils down to: does it take constant time to generate the next composite number. Which might not be true since you end up merging pi(squareroot(n)) lists.

[Nah, too much BS. Will read up on that Bird analysis again and come back...]

How many terms are generated

It shouldn't (ok, might not) matter, with sufficient sharing, it is only relevant how many terms are generated: which is at most one term for every number below n, which is either a prime or a composite. Since you cannot evaluate more, and every term is evaluated only once, the above argument might hold. But I do admit that I don't trust the argument fully myself.

The algorithm doesn't just generate a list of primes and a list of composites. If you look at it, 'composites' is a function that generates a list of composites using only a given list of prime factors, and it is invoked by the algorithm for every tail of the list of primes. That is, the algorithm (eventually, if a large enough N is demanded) invokes composites [2, 3, 5, 7, ...], composites [3,5,7, 11, ...], etc. For every natural K, there is some composite that appears in at least K merged lists.

Thus, it certainly does not take constant time to generate the next composite number. But this does not settle whether or not the whole thing is O(n) - the equivalent question is whether it takes some fixed constant time on average.

I might be stupid

But...

I don't fully follow your line of reasoning. If I evaluate the length of a given list [1..10] it will invoke length [1..10], length [2..10], etc. But that still means it is O(n) right?

*scratch*

You are right. Uh, maybe.

How silly of me

I misread your statement. I didn't understand that primes is invoked twice in every invocation of composites.

[Forget it, again, will read up on the Bird analysis, then come back.]

I reread the analysis by Geoffrey. I think he is right. And by now I am certain it cannot be O(n) since an increasing number of lists is merged.

However, there is an image of evaluated comparisons here.

**************************************************************
The haskell program
**************************************************************
module Main where

import System.IO as IO
import System
import Debug.Trace

primes = 2 : xor [3..] (composites primes)

composites (p:ps) = cs
    where cs = (p*p) : xor (map (p*) (xor ps cs)) (composites ps)

xor (x:xs) (y:ys) | x==y = trace "=" ( xor xs ys )
                  | x<y  = trace "<" ( x : xor xs (y:ys) )
                  | x>y  = trace ">" ( y : xor (x:xs) ys )

l n = length (take n primes)

main =
    do
        args <- getArgs
        if length (args) == 0
            then do
                    print "usage: primes n"
            else
                 let n = read (args !! 0)::Int in
                 do
                    print ("arg+1 = " ++ show (l n))

**************************************************************
The bash program, you can cat it to 'out.txt' for example:
**************************************************************

#!/bin/bash

for (( I = 3 ; I < 1000000000 ; I = (I * 3) / 2  ));
do
J=`a.out $I | wc -l`;
echo $I $J;
done

**************************************************************
To build the picture:
**************************************************************

graph --output-format "gif" out.txt > comparisons2.gif

**************************************************************

[ Fixed some <s] [And even more]

And by now I am certain it

And by now I am certain it cannot be O(n) since an increasing number of lists is merged.

Actually, an unbounded number of lists being merged doesn't imply that it's not O(n). Here's a counter-example:

evens = multiplesof 2
multiplesof k = k : merge [k*n | n <- [2..]] (multiplesof 2*k)

Hehe

So we're still out in the open on this one? My computer lacks processing power to really evaluate a large number of terms. From the experiments it seems it can go both ways.

[Is that really O(n)?] [Again, I give up for now. Need to read up on this stuff.]

Is that really O(n)? Yes.

Is that really O(n)?

Yes. There are N/2 evens, N/4 multiples of 4, N/8 multiples of 8, etc. If you add up these terms up to the first power of 2 greater than N, you get something less than N (and if you round up when you divide that only gets you log_2(N) more). Note: This is just because 1/2 + 1/4 + 1/8 + ... is a geometric series that converges to 1.

Guess so

The length argument I got, but I wasn't sure you could discard the cost of merging the log(N) lists.

At the moment, I am slightly more than a bit baffled by the other problem.

[Reread the other posts. I am replaying the arguments. So I'll look at the cost of merging.]

Oops: O(n^1.5 / log^3 n)

My previous complexity guess was wrong (though in fairness it was only a numerical guess). Here's an analytic analysis resulting in O(n^1.5 / log^3 n):

http://naml.us/blog/2009/01/sieving-primes-2

In case others are still confused about the sharing analysis: it's impossible to share the cons cells from different calls to (composites (p::...)), because they belong to _different infinite lists_. If they were shared, it would imply that if you go out far enough, the set of numbers that can be constructed from all primes > p is the same as the set of numbers that can be constructed from all primes > q. If p != q, this is false. Apologies if someone else already made this argument.

Thanks

Nice job

A minimalist Eratosthenes Sieve

I'd like to point out another implementation of the Sieve of
Eratosthenes: in the true spirit of the original algorithm, the
implementation uses neither division nor multiplication operations. In
fact, the algorithm doesn't even use general addition or subtractions,
relying only on successor, predecessor, and zero comparison. The
algorithm easily generalizes to other number sieves (e.g., computing
lucky numbers).


http://okmij.org/ftp/Algorithms.html#Eratosthenes-sieve



http://okmij.org/ftp/Haskell/number-sieve.lhs

Thanks Oleg

Thank you for calling this to my attention. At some point, I definitely want to take a careful look at this. Off the cuff, I'm guessing it has rather high constant-factor performance issues, but that's not why this appears to be very interesting to me.

[I'm very much reminded of the wisecrack in The Evolution of a Haskell Programmer that "graduate education tends to liberate one from petty concerns about, e.g., the efficiency of hardware-based integers"]

Sorry Oleg, it's just naive trial divisibility

Ok, I took a good look at it. Your primality sieve suffers from the same problem as Tim Band's comment above. You say that you don't do any operations on the composite numbers, but you do: you traverse them repeatedly.

However, you are right to point out that certain aspects of either the Sieve of O'Neill or the Sieve of Eratosthenes don't generalize in a nice way. I'll have to think about how to implement a lucky number sieve well.

Wikipedia says that the progenitors of the lucky numbers suggested calling it the Sieve of Josephus Flavius, which recalls Chris Rathman's post here. Some interesting stuff to think about!

Different motivation

I thought his post was a little bizarre too when I first encountered it in the context of this thread (avoiding addition through repeated increment isn't much of an optimization), but I don't think he was tossing this out as an exemplar of a speed. I think the motivating factors here are simplicity in a from-first-principles sense and generality. Take the last sentence of his description from the link:

Thus the algorithm can be used with Church and Peano numerals, or members of Elliptic rings, where zero comparison and successor take constant time but other arithmetic operations are more involved.

Re: different motivation.

I don't entirely disagree.

However I think Oleg is wrong on this count. My off-the-cuff guess was wrong, which I have to admit was based on Oleg's history of being correct rather than the content of that particular comment.

The alleged Sieve of Oleg is in the same asymptotic ballpark as the impostor sieve of David Turner, as well as Tim Band's sieve, none of which come close to reasonably-implemented trial division, let alone a geniune sieve.

I have looked at elliptic rings, however briefly. Lucky numbers are new to me, and I'm not aware of the best known algorithms in this area.

The complicating factor with the lucky number sieve compared to the Sieve of Eratosthenes is that you have to know the locations of the existing lucky candidates to cross off your target. The "obvious" solution is mighty similar to Oleg's. However... maybe it's not too hard to modify the Sieve of O'Neill to handle the lucky numbers?

On what count?

However I think Oleg is wrong on this count.

On what count? I agree Oleg's algorithm has more than constant factor issues, but I don't see where Oleg claimed that it didn't.

Context

Did David Turner ever come out and explicitly state that his classic example was within a constant factor of the genuine Sieve of Eratosthenes? I'm not aware that he did.

Oleg might not have intended to imply such a statement, but without it, the surrounding context feels especially bizarre. The subject starts with the phrase "Even better Eratosthenes sieve", and explicitly mentions Melissa O'Neill's paper, but by the criteria of that paper, proceeds to commit the same fallacies.

Admittedly, the only concrete, definitive thing I can say that Oleg got wrong was his point (i), that his code performs no operations on the composite numbers. But that seems to support my overall thesis.

It's an interesting, if flawed, article. I don't think Oleg's implementation of the lucky number sieve is particularly profound, but on that count, his overall thought process is correct. I am definitely still grateful that he brought the subject to my attention.

I think the productive path for this conversation to take is to focus on the lucky number sieve and how it might be implemented.

No, it's still a sieve, just with a (little) bug

No, it's a sieve nevertheless, with a little bug easy enough to fix. :)

The problem is that the naive code sieves prematurely. The sieve on each prime that is found should be delayed until its square is seen. Not just its work has to start after that square, but the sieve itself must be started there. That is the key point missing from the article.

The fixed code is plenty fast and clear enough for an introductory code:

 primes   = 2 : primes'     
 primes'  = 3 : sieve primes' (map (^2) primes') [5,7..] 
 sieve (p:ps) (q:qs) xs  
          = let (h,t) = span (< q) xs 
            in h ++ (sieve ps qs   
                     . filter ((/=0).(`rem`p)) . tail) t  

Before you point out the `rem` issue, on today's CPUs it's very fast and local, so it's a non-issue, at least while we're inside the Int32 range. The exact nature of check whether to scratch the number out or not - be it done by counting from the previous composite, comparing it to the PriorityQueue's head or by checking the remainder - is immaterial as long as it is O(1).

The use of 'filter' has its benefits in that it culls its input sequence so that each sieve leaves less work to be done by all its successor sieves.

The real issue - that which the article does improve upon - is that implicitly at run-time all the sieves form a linear, nested structure, and each is consulted for each prime number produced.

I think, the article's point is realizing this linearity and explicating this control structure (as e.g. a list passed as a function argument - the article skips this step), as if turning the nested list of filters, each with its predicate, into one filter by a list of those predicates - and then transforming this explicit list control structure into a priority queue, so that only a small near constant amount of top entries is consulted for each prime to be produced.

It is as if a person who scratches the numbers by hand were to keep the list of all active sieves as a table on the side, and for each incoming number consult each of its entries to see whether to scratch the number or not.

Now the incremental improvement step is immediately apparent: keep this table partially sorted, as a PriorityQueue, so that we only need to consult a small near-constant amount of its top entries each time (or even better, using this PQ to generate the next number, possibly skipping over several adjacent composites at once). That's the key to the article's sieve optimization. It makes it much better, asymptotically, but much less suitable as a piece of introductory code.

As for the O(n) issue, I'm confused: is it supposed to be 'n' as in n-th prime, or N as in its value? I think it's the former in O(n log log n) of the PQ algorithm. As for Nick's code, it produces all the composites and primes before any given prime, so is at least O(N), which is guaranteed to be worse than O(n*log n). Is it not?

I realize long time has passed since this discussion ended, but maybe someone is watching...

Still just an unfaithful sieve. ;-)

What you've implemented is trial division, but not a grossly naive trial division like the classic example.

Fast is relative; it's true that remainder is pretty fast on modern processors, but it's still quite a bit slower than addition or multiplication. But this is all quite irrelevant, because the trial part of trial division is what is slow, exactly how the trials are performed is not so important.

You are definitely thinking along the correct lines with regard to the algorithms being discussed, but you also seem to be missing some of the subtleties. For example:

The use of 'filter' has its benefits in that it culls its input sequence so that each sieve leaves less work to be done by all its successor sieves.

If this were truly important, then starting each filter earlier would be a benefit, would it not? But as you correctly realize, you can substantially improve the run-time performance of the sieve by delaying the start of each filter. This saves time because each prime number gets copied once per prime less than the square root of the number, instead of getting copied once per prime less than the number. The Sieve of O'Neill completely eliminates the copying introduced by the use of filter.

Now the incremental improvement step is immediately apparent: keep this table partially sorted, as a PriorityQueue, so that we only need to consult a small near-constant amount of its top entries each time (or even better, using this PQ to generate the next number, possibly skipping over several adjacent composites at once). That's the key to the article's sieve optimization. It makes it much better, asymptotically, [...]

You don't seem to fully appreciate Melissa O'Neill's contribution on this count. It's much more than "searching", she examines those numbers that are prime divisors of a given candidate, and only those numbers.

[...] but much less suitable as a piece of introductory code.

You might be right. I do think that the original unfaithful sieve and the (imperative!) Sieve of Eratosthenes are excellent introductory material. Honestly the imperative algorithm is a bit easier to grasp than the Sieve of O'Neill, and I think it's a good motivating example for imperative programming.

In my honest opinion, your code is quite elegant, though a little unpolished. I approve of the use of circular programming introduced by primes'. In fact, I've used a similar trick to implement a Lucky Number Sieve here. The definition, on the other hand, could be cleaned up in a few ways. You would probably eek out a bit more performance in the process.

As for the complexity of Nick Chapman's code, most of the discussion above is bogus. I recommend reading Geoffrey Irving's blog posts, as I do believe his analysis is correct.

Lazy Analysis


As for the complexity of Nick Chapman's code, most of the discussion above is bogus. I recommend reading Geoffrey Irving's blog post, as I do believe his analysis is correct.

The most enlightening aspect of this discussion is that analysis of the performance of lazy algorithms can be very hard. This is what keeps me an imperative language programmer.

Strictness is not exclusive

Strictness is not exclusive to imperative programming.

Where are the strict non-imperative languages?

Where are the strict non-imperative languages? I'm sure that some of these languages exist, but are they industrial strength, and widey used or supported? (I consider Lisp and SML and CAML imperative -- these are languages that I do use.)

(I consider Lisp and SML and

(I consider Lisp and SML and CAML imperative -- these are languages that I do use.)

How are SML and OCaml imperative languages? They are strict functional languages, with some extensions which permit imperative programming, but by no means encourage it.

Strict version of Sieve of Eratosthenes

If I was coding Sieve of Eratosthenes in SML or CAML, I would use mutable arrays any the code would look very much like any imperative language implementation.

I would use mutable arrays

I would use mutable arrays any the code would look very much like any imperative language implementation.

For an efficient implementation, perhaps it would, but so what? You implied strict languages and imperative languages are synonymous, but they are not. That's all I was pointing out when I replied to you, so I will now take your statement to mean, "This is what keeps me a strict language programmer," which makes much more sense.

Strict Unfaithful Sieve in R

To quote Alan Perlis: "A Lisp programmer knows the value of everything, and the cost of nothing" ;-)

I'm not yet convinced that algorithmic analysis is *that* hard in functional programming, it's more that FP allows your mind to gloss over the details of how something is computed, which means it's much easier to think you are doing it one way when in fact you are doing it completely differently. Certainly we could use some more insight into reasoning about the "how", but this is somewhat complicated by the fact that FP implementations have much more freedom in how things are ultimately executed than a C compiler does.

One of my best friends is an avid R programmer. Recently he solved the first few Euler problems, and in the process he independently re-invented the unfaithful sieve, except in a strict setting. When he complained how badly his "sieve" was performing, I explained why, and told him that even though R implements imperative programming poorly (which has the interesting sociological benefit of encouraging people to learn how to program functionally), this really was a case where he wanted to use it.

I predict that the unfaithful sieve is going to be one of those ideas that gets re-invented many, many times over the future history of computation. :-)

Any sequential sieve is trial division, and manual sieve is sequ

Any sequential sieve is trial division, and a manual sieve is naturally sequential.

When a person sets out to scratch out all the multiples of primes from a table of numbers, he must count them all - even when falling on an already scratched out number he still stops at it, and continues counting from there (or else his counting will get out of sync). The only real difference is the abortiveness of the Priority Queue version. It's just so structured as to shortcut the testing, i.e. guaranteeing the validity of all with only a few checks performed (by having all the next composites present in the queue so that the first hole is guaranteed to be prime).

In fact I doubt it can be considered a sieve in a traditional sense, because it replaces counting on an increasing numbers sequence with maintaining, in a sophisticated way, the table of next composite values and comparing those with each new number's value. The original sieve was even oblivious to the numbers' values. It could have empty boxes in place of them, and still work.

The person would go on scratching each third number starting from 3, and each fifth number starting from 5 on the same table, naturally thus repeating the tests for all primes found so far. So I don't think actually O'Neill's code is a sieve in that sense. A normal person 2300 years ago wouldn't work that way, but instead in a sequential manner. And any sequential sieve is a trial division, reorganized.

That was not the central critique point at the start of the article about the naive implementation, but rather its horrendously over-working, starting each sieve horribly early, eons in advance, and thus making it very slow. My code addresses that issue, in a simple and clear manner, making apparent the delaying of work to be started just at the right moment. It is short and simple enough to be used as an introductory example of "the functional code" - to replace the very wrong naive version. I expected to see something similar in the article, before it goes on to the more sophisticated optimizations. This would also make performance comparisons more sensible.

BTW I was quite certain someone wrote something like it already when I tinkered with it some 6 or 7 years ago. I didn't even bother to check.

And it is on par with many other versions in efficiency, when actually tested. Here's some data:

   1000s:  8    16   32   64  128   256   512   1024
delayed   0.07 0.25 0.76 1.95 5.40 14.98 43.07 123.24
Bird's    0.08 0.25 0.53 1.36 3.42  9.19 23.32  66.93
PQ        0.10 0.22 0.53 1.28 2.95  6.40 14.84  35.84
Nick's_3  0.08 0.24 0.67 1.58 5.24 11.77 40.51  ----
                                        2.24G  2.5G swap, 8% CPU

Here Nick's_3 is Nick's code improved to work on odds only (making it 4/3 faster):

  primes  = 2 : primes'  where 
    primes' = 3 : diff [5,7..] (composites primes') 
    composites (p:ps) = cs
       where
          cs = (p*p) : merge (map (p*) (merge ps cs)) (composites ps)

and the rest of it the same. (somehow it takes up much more memory than the others, and so eventually stops working).

BTW I don't know how to make my code better. It took me quite some time to get at its current version, and it is only "easy and apparent" in retrospect to me. :)

I tried counting (as in "true" sieve) and marking the scratched-out numbers as zeroes, but it was slower than the filter version - because I had to count on the non-culled sequence, and thus many numbers were marked as 0s multiple times. In the filtered version any composite that's struck out isn't checked for anymore by all the other filters - that gives it some boost. All the primes do go through all the filters, of course, and that's what the PQ version improves upon so drastically (in asymptotic sense).

I tried to turn it into explicit filtering, as I mentioned in the first post above. It was slower. I tried to count with an ever growing gob of explicit counters (in a list, sadly!). It was slower still. PQ version beats all that, after a certain point. :)

The use of 'filter' has its benefits in that it culls its input sequence so that each sieve leaves less work to be done by all its successor sieves.

If this were truly important, then starting each filter earlier would be a benefit, would it not?

No, it would be much worse, because each filter filters out composites first and foremost. So each composite is dismissed only once from the input sequence, unlike the counting versions which must count on non-culled input sequence and so many composites get marked multiple times (for each of their prime factors), and starting a filter prematurely means that it will work hard for nothing - much earlier than it is its turn to contribute to the composites elimination.

you can substantially improve the run-time performance of the sieve by delaying the start of each filter. This saves time because each prime number gets copied once per prime less than the square root of the number, instead of getting copied once per prime less than the number.

Yes, for example, to get at the prime 7927, which has 1000 primes before it, we only need 25 sieves, not a 1000 that the unfaithful version starts. It is specifically unfaithful to the original algorithm in its disregard for the square root and in starting all the sieves immediately - which issues my code addresses fully, of course.

I'm not sure there must be any "copying" going on though, evaluation is triggered by access and filter just gets to the original storage eventually. The real problem is that all this filters are organized linearly, implicitly, since they are created by the nested recursive calls. They (can) all sit on the same input storage though. They are all working hard, too, so we better keep their number to a minimum.

keep this table partially sorted, as a PriorityQueue, so that we only need to consult a small near-constant amount of its top entries each time [...]. That's the key to the article's sieve optimization. [...]

You don't seem to fully appreciate Melissa O'Neill's contribution on this count. It's much more than "searching", she examines those numbers that are prime divisors of a given candidate, and only those numbers.

That is exactly what I said, only not in the same words. I don't think I was referring to any "searching". The key is the abandonment of the sieve-like counting, and summing up the values instead, maintaining those values in the PQ and thus utilizing the natural order of things to only examine the top entries of the PQ, which correspond to the prime factors of the incoming number -- and only those prime factors. The rest being (at the moment) kept deeper inside the Queue, each with its next composite that it produces as its key.

Sorry for being so verbose. If it is a faux-pas here please tell me so that I won't do that again. :) I realize this is all pretty trivial stuff, but very much appreciate the possibility to discuss it still.

For testing I ran ghc -c -O3 primes.hs and then loaded it into GHCi.

Not really

Well, the obvious improvement I see with your code would be to eliminate the use of qs, which only increases memory allocation. However this should be short-lived memory, and so it might not substantially impact performance.

Any sequential sieve is trial division [...]

Definitely not. There is no trial in the Sieve of Eratosthenes. When you visit a number, you are definitely removing it. Although many numbers will get "removed" multiple times, once for each of its prime divisors less than the square root of the number, we will see this is not terribly important.

For comparison, with the unfaithful sieve, you visit numbers that are not crossed off, and decide to leave them alone.

You might use division in segmented sieves, because if p is less than the size of your segment, then you know that the resulting number gets crossed off. It might make sense to keep primes larger than your segment size in a priority queue.

In fact I doubt it can be considered a sieve in a traditional sense [...]

I would definitely call it a sieve. See my comment above

[...] So each composite is dismissed only once from the input sequence [...]

This is not important. The number of distinct prime divisors of a number grows much more slowly than the inverse of the factorial function, which itself is sub-logarithmic. By contrast, the number of trial divisions needed to establish primality is O((n / log n)^(1/2)) worst-case, which occurs when investigating prime numbers and their squares. Even on composites, the expected number of trial divisions grows much faster than the number of distinct prime divisors.

I'm not sure there must be any "copying" going on though, evaluation is triggered by access and filter just gets to the original storage eventually. [...]

Under a lazy regime, evaluation of a filter triggers copying. Read chapters one and two of Chris Okasaki's Purely Functional Data Structures. His thesis of the same name is available online, but the thesis does not contain those chapters.

You might be able to find an interesting combination of lazy and call-by-name evaluation that would eliminate much of this copying, but that shouldn't improve the asymptotic performance of this example. You are still paying a heavy price for the mere act of traversal. If you are interested in exploring this idea, I suggest writing proof-of-concept implementations in Scheme, or possibly ML or Factor. Those choices should work well for exploring combinations of both lazy evaluation and call-by-name.

And it is on par with many other versions in efficiency [...]

On my computer, Melissa's code takes 0.2 seconds to find all the primes up to 1,024,000, and your code takes 0.6 seconds. To find all the primes up to 102,400,000, Melissa's code takes 10 seconds, and your code takes 500 seconds. On this larger problem, Melissa's code had a maximum of 54 megabytes allocated at any one time, while your code used a total of 710 megabytes.

For comparison, running time on the shell script primes 2 102400000 | wc takes 3.5 seconds, where primes is the venerable prime sieve distributed with BSD. I'm sure the bottleneck is converting from binary to decimal and doing input/output. Running factor (sieve 102400000) 360 with my code finishes in 1.6 seconds. This is not exactly an apples-to-apples comparison, because creating a list of primes from my sieve is only slightly faster than Melissa O'Neill's code.

This demonstrates that as computers get faster, algorithmic efficiency becomes more important, not less.

i'm not saying my code's in any way better

I'm not saying my code's in any way better then the PQ version, I should've been insane to claim that. :) You shouldn't compare my code with Melissa's; it's not the intention of it. Instead, compare it with trial division, or Richard Bird's code, or Nick's. Don't try it for very big numbers of primes either. It's obviously only good for a first quarter of a million of primes or so. But it's not only good for 19 of them either, that's for sure.

I'm merely saying it's a step between the terrible unfaithful sieve, and the optimized sophisticated one, a code which is also illuminating in some ways, suggesting the improvement of explication of list control structure as a list data structure, which is then turned into PQ data structure. An introductory device, if you will. But if we were to look for the candidate for "that functional look-and-feel", simple, yet not-terribly-inefficient code, it'd be as good a candidate as any.

It helped me to understand what's going on, as far as I do.

Under a lazy regime, evaluation of a filter triggers copying.

OK, my bad. It can only vanish if the produced value stream isn't reused, and it is reused in my code. Bummer. :) But no, actually, only the resulting re-usable ONE sequence needs be allocated, each filter here doesn't reuse anything. All it needs to do is to get candidates from its producer and turn its final value out to its caller. A filter only remembers its next_candidate and that can just be a pointer into the original storage, always - because it itself passes out the pointer to original storage too. So each nested filter should remember the next_value inside its frame, which would be actually a pointer into the original storage, all of them, and let the caller worry about the produced stream's re-use and copying.

There is no trial in the Sieve of Eratosthenes. When you visit a number, you are definitely removing it. Although many numbers will get "removed" multiple times,

I find it hard to see it that way. When you set out to eliminate all the composites up to a certain upper limit (say for the first 1000 natural numbers) you proceed to count and eliminate all the multiples for each prime less than 32 - and each time you count over the whole table from 0 to 1000 (well, not from 0 but from the square of each of the first primes). It means that for each number from 0 to 1000 you have in effect tested its divisibility by _all_ the primes less than 32. ALL of them.

That is the definition of trial division algorithm. Trial and delayed-filter are indeed asymptotically the same, because of the reasoning you provide, but the delayed-filter is by a constant factor better because (I think) of my reasoning - eliminating some work for composites which trial division performs always. At least that's what my empiric data shows, I think. E.g. the number 925 will get eliminated by a very early prime sieve of '5' and thus won't even be considered by the rest 7 of them, but in the counting sieve we definitely will go over the number 925 several times, once for each of 2,3,5,7,11,13,17,19,23, and 29. Whether we scratch it out or not is immaterial. We went through it, so we have tested for its divisibility by all of them.

So it is better than trial, but in a less important area, non-influential over its asymptotic behavior. IOW it only has advantage at first. But that's alright, because its only real value is educational, as introductory code. It's not that terrible as the original one, is all, and is on par with the rest of unsophisticated ones. :)

BTW in my tables I refer to the number of primes produced, not their top value.

It is not a sieve. It is better.

In fact I doubt it can be considered a sieve in a traditional sense [...]

I would definitely call it a sieve. See my comment above

You say there

Certainly, it is a "genuine sieve" of some variety. [...]

After all, O'Neill's sieve is fully incremental: if you want all the primes between 1 and 10^9, you don't have to wait for multiples of 2, 3, 5, and 7 to get crossed off to get back 53. So by the time criteria, it certainly is a different algorithm.

so you seem to contradict your own statement. The genuine sieve does go over the whole table, counting and crossing each p-s number, as if "in rows", but Melissa O'Neill's code achieves its breakthrough in abandoning this counting and instead computing and comparing the values of numbers to be scratched. Thus it is not a sieve at all.

It is better.

(it could've maintained "phase" counters but then would have to update each one's value on each step forward, thus going back to using pred/succ from its use of sum).

BTW I think O'Neill inserts primes immediately

BTW I think O'Neill herself inserts primes into the PQ immediately as they are encountered, unfaithfully to the "genuine sieve". :) She relies then on the PQ's mechanics to keep them away until needed, but it still makes the Queue blow up hard in size, prematurely.

In my half-baked primitive PQ implementation as a binary tree with ad-hoc insertion and without even a pop-and-reinsert, it does pay to insert them as late as possible:

 qprimes   = 2 : primes'   where
   primes' = 3 : sieve primes' (map (^2) primes') emptyIts [5,7..]

   sieve ps@(p:pt) qs@(q:qt) its (x:xt)             -- ASSUMES all qs are PRESENT in the feed stream
    | x == q
           = let its' = addIt its (2*p,q+2*p)       -- add (val,key)
             in                sieve pt qt its' xt
    | True   
           = case updIts its x of
              Just its' ->     sieve ps qs its' xt  -- match on Q found: a composite
              _         -> x : sieve ps qs its  xt  -- x is a prime

This also makes clear the existence of two kinds of queue insertions: one, the rarer, distant insertion of a new entry - when the square-of-prime is reached; and the more frequent, local one, involving mostly the top of the queue, when it is consulted to see if there's a match, so that we know this specific composite is reached in the input stream, and so its entries for each of its prime factors are popped off from the head of the queue, and are reinserted with their next-to-produce composites as keys (the step value - here, twice the original prime - is stored as a value in PQ's entry). If no match was found, the queue stays unchanged and we know we've encountered our next prime. And for that, of course, we only had to consult the very top of the queue.

To make our testing data a bit more representative, here it is again with added data for the unfaithful sieve and trial division version, exactly as it is in the article, only with a standard "on-odds" improvement:


   1000s:         1    2    4    8    16   32   64  128   256   512   1024
unfaithful       0.05 0.21 0.99 6.66 46.71
trial_3               0.02 0.06 0.16 0.43 1.16 3.04 8.23 22.00 60.19
delayed-filter             0.02 0.07 0.25 0.76 1.95 5.40 14.98 43.07 123.24
Bird's_3                   0.03 0.07 0.25 0.53 1.36 3.42  9.19 23.32  66.93
PQ                         0.04 0.10 0.22 0.53 1.28 2.95  6.40 14.84  35.84
Nick's_3                   0.03 0.07 0.23 0.65 1.57 5.11 11.56 40.51  ----

Each data point should be taken with an error margin of about 10%.

We can glean at asymptotic performance locally by checking the empirical time sequences on doubling problem sizes with the function tm_check tms = map ((/log 2).log) $ zipWith (/) (tail tms) tms which lets us see the exponent of the power function locally approximating our data:

   1000s:         1    2    4    8    16   32   64  128   256   512   1024
unfaithful            2.07 2.23 2.75 2.81
trial_3                    1.58 1.41 1.42 1.43 1.38 1.43  1.41  1.45          -- same
delayed-filter                  1.80 1.83 1.60 1.35 1.46  1.47  1.52  1.51    -- same
Bird's_3                        1.41 1.64 1.08 1.35 1.33  1.42  1.34  1.52    -- a bit better
PQ                              1.32 1.13 1.26 1.27 1.20  1.11  1.21  1.27    -- WINNER
Nick's_3                        1.22 1.71 1.49 1.27 1.70  1.17  1.80          -- memory blowup

The additional code used in this testing was:


 uprimes        = sieve [2..]  where  -- "unfaithful sieve" 
   sieve (p:xs) = p : sieve (filter ((/=0).(`rem`p)) xs)

 tprimes        = 2 : primes'  where  -- trial divisions code 
   primes'      = 3 : filter isprime' [5,7..]
   isprime' x   = all ((/=0).(x`mod`)) 
                      $ takeWhile ((<=x).(^2)) primes'

 dprimes        = 2 : primes'  where  -- "delayed-filter"
   primes'      = 3 : sieve primes' (map (^2) primes') [5,7..]
   sieve (p:ps) (q:qs) xs  
                = let (h,t) = span (< q) xs
                  in h ++ (sieve ps qs 
                           . filter ((/=0).(`rem`p)) . tail) t

 bprimes        = 2 : primes'  where  -- Richard Bird's code 
   primes'      = 3 : [5,7..] `minus` composites
   composites   = foldr merge []
                        [ map (p*) [p,p+2..] | p <- primes' ]
   merge  (x:xs)  ys    = x:merge' xs ys        -- produce the number 9 forthwith
   merge' (x:xs) (y:ys) 
          | x < y       = x:merge' xs (y:ys)
          | x == y      = x:merge' xs ys
          | x > y       = y:merge' (x:xs) ys
   a@(x:xs) `minus` b@(y:ys) 
          | x < y       = x:(xs `minus` b)
          | x == y      =    xs `minus` ys
          | x > y       =    a  `minus` ys


We're getting off topic

I have been using the phrase "unfaithful sieve" in the exact same sense that the paper that we should be discussing uses the phrase. Yes, the paper describes the algorithm it that way you say, but if you read the source, Melissa O'Neill's full implementation already encompasses the ideas you present. And no, the paper is not "unfaithful", this optimization does not affect the asymptotic behavior of the algorithm.

In actuality, if you were to use an array based, imperative binary heap, this optimization does not make one iota of difference. The particular usage of the heap results in O(1) inserts anyway. Only in the presence of persistent priority queues do you even need to be concerned with such details.

You can download the Sieve of O'Neill and more on hackage. You can even browse haddocks and a colorized source in your browser. Be warned that the factor sieve performs badly on GHC 6.10 and (to a lesser extent) GHC 6.12; I recommend GHC 6.8 for that bit of code. Go do your homework.

And, while you are at it, you might try this implementation. On my computer it runs ever slightly faster than yours, with much improved patterns of memory consumption but basically the same time behavior:

primes  = 2 : primes
primes' = 3 : filter isPrime [5,7..]

isPrime n = all (\p -> n `mod` p /= 0) 
                (takeWhile (< (flsqrt n)) primes')

flsqrt = floor . sqrt . fromIntegral

Now honestly, we are getting rather off topic, and I daresay the bogosity fraction has been a little high for your lengthy posts, which we try to discourage here. Lengthy posts are ok, but when they start to lose objectiveness, start quibbling over meanings of words, or are based too much on unsubstantiated opinion, or we lose the back-and-forth exchange of ideas, it's not good for the community.

Let me be clear: I think that circular programming is pretty cool, and I've written about that at some length in the latest Monad Reader, and hinted at on LtU here, here and here. I like your original code. Your first few "original" circular programs will almost certainly be very mind bending and can fill one with a heady mixture of excitement and befuddlement.

If you are really interested in benchmarking, take a look at complexity, criterion, or even do as I did and roll your own crude benchmarking library that will run something a given number of times and then calculate the mean and standard deviation.

First and foremost LtU exists to exchange ideas and come to a better mutual understanding of programming, and so far despite a lot of words, you haven't told me anything I don't know, and I don't seem to have said anything that's made an impact on you. So if you want to set up say, a wordpress blog, and write something up, please be my guest. I will take a look at it. But if this conversation is to continue on LtU, something has got to change.

I've just posted that other

I've just posted that other code for reference. I guess these test data had no other point than to show that my code had performance on par with the other non-PQ versions, on my computer. It also has shown the usieve's O(n^3) behavior, and O(n^1.5) of delayed-filter code, where n is number of primes produced, so I do believe that my data disproves the article's claim (on pg. 2) that

This optimization [starting from a prime's square - wn] does not affect the time complexity of the sieve, however, so its absence from the code in Section 1 is not our cause for worry.

I do believe the article is in error in not recognizing the need to start the filter delayed, at the square only, as the central problem of the naive code. Again, its original behaior is O(n^3) which gets improved to about O(n^1.5) by my code, which then gets improved to about O(n^1.2) by the article's PQ algorithm (or be it O(n*log^k n) ).

The article claims that the sieve is better than filter because, in the example it uses (on pg 2.), it "crosses off 15 multiples of 17 from 289 to 527", whereas the filter "examines 45 non-multiples of 2,3,5,7,11,13" (emphasis mine). But what's important, IMO, is not crossing-off, but the very counting. The sieve counts 238 times from 289 to 527, while the filter only examines 45 numbers on the sequence culled by the preceding filters, so this reasoning doesn't seem valid to me. The true origin of article's optimization is that it changes counting (pred/succ) into leaping forward at once (using sum). It stops counting, so I thougt it means it's stopping being a sieve. Is it "semantics"? What's wrong with semantics, is it not "meaning"? This _does_ matter for the very big numbers because then comparison acquires cost. Only counting has no cost on big numbers as on small, but the article abandons counting.

Is it not a valid point to make?

The premature insertions of big primes squares will indeed be O(1) in array-based PQ but won't be on other implementations, and will take up memory.

Your code exchanges p^2 < x with p < sqrt x in the trial version. It's a nice trick to speed things up.

Thank you for your attention, and discussion, and pointers you gave me.

I don't think I made any bogus claims, not consciously anyway. But we don't have to spend even more time on these minutiae.

Optimizing trial division vs Optimizing a "geniune sieve"

Is it not a valid point to make?

It's true that that stopping at sqrt n makes a huge difference when working with trial division. But Melissa O'Neill is 100% correct in her claim: given a genuine Sieve of Eratosthenes, starting at p*p is not a crucial optimization. It does not change the asymptotics. This is true of both the traditional Sieve of Eratosthenes and the Sieve of O'Neill, but not trial division.

In general, an optimization that is crucial to one algorithm might not mean that much to another. It just goes to show that the "unfaithful" sieve and the "geniune" sieve are fundamentally different at the highest level.

The other optimization that you mention, of delaying the insertion primes into the PQ, is also already done in the full implementation but not mentioned in the paper. Again, this is not a crucial optimization for an implementation to be a considered a "genuine sieve" by the criteria set forth by the paper, as it does not affect the asymptotic time complexity of the algorithm. It also doesn't affect space complexity much, as you have to store them *somewhere* unless you want to recompute them later. (Which may in fact be worthwhile.)

So most of your observations are ok, but the conclusions tend to be bogus.

I am not trying to discourage or disparage you, but it is past time to take this off of LtU. You are more than welcome to email me if you'd like to continue this conversation. I do have a few ideas that we could explore that might actually turn into something interesting and/or useful. You can find my email address on my Monad Reader article.

Is it not a valid point to

(deleted)

Fastest non-PQ code yet?

Just to put it to the record, the nested filters code can be much improved in its efficiency by collecting all the primes to-test-the-primality-by found so far into an explicit list, and eliminating their multiples in one go, each time for the corresponding span upto an upper limit of the next prime's square:

 primes   = 2: primes'           
 primes'  = 3: sieve primes' 0 [5,7..]      
 sieve (p:ps) k xs            -- with k primes', filter xs
          = noDivs k h ++ sieve ps (k+1) (tail t)
                where (h,t) = span (< q) xs ; q=p*p
 noDivs k = filter (\x-> all ((/=0).(x`rem`)) $ take k primes')

It actually doesn't matter, speed-wise, whether we collect this list explicitly or just reuse the primes sequence's prefix as it is getting built.

I've arrived at the above code after more private discussion with Leon P. Smith, who had the great insight of fusing the numbers supply with the span to directly generate the needed numbers on each iteration step:

 primes   = 2: primes'           
 primes'  = 3: sieve primes' 0 5
 sieve (p:ps) k x             -- with k primes', filter from x
          = noDivs k h ++ sieve ps (k+1) (q+2)
                where (h,q) = ([x,x+2..q-2], p*p)
 noDivs k = filter (\x-> all ((/=0).(x`rem`)) $ take k primes')

This code is fastest of all I've seen here or in the article (it's about 25% faster than the one above it), to produce first million primes.

The path for further improvement is also clear - maintain the first k primes explicitly in a priority queue instead of a list, and iterate over the generated numbers span while updating the queue, thus skipping the primes - exactly the article's approach. So it would have to be compared with the above code then, to be a fare comparison.

The relative performance of "span" and "generate" code vs. my ad-hoc tree-based PQ code, tested by running GHC -O3 compiled code loaded into GHCi, is:

 
 generated:  100,000   300,000   1,000,000    primes 
 
 span/PQ       0.85      1.0       1.2
 gene/PQ       0.64      0.8       1.0

Yes, it is faster than PQ-based code, for the first MILLION primes. That's not primes up to one million; that's a million primes generated.

I take these data to prove that the Sieve of Eratosthenes can be written in a simple clear functional lazy style and be reasonably fast at that, without resorting to arcane data structures, which are better left for far-end optimizations.

Thus this disproves the central case of the article. The author presents astronomical gains in speed over the naive case, and attributes it all to her use of PQ, but the above data disprove that claim.

The real sieve crosses off composites and thus gets its primes for free because it is able to work with the spans of numbers at once (whether performed by humans, using our vision, or imperative sieves, using mutable arrays). Plain functional versions work with one number at a time, imitating counting by recalculating the remainders, and thus end up working hard for primes as well (and asymptotically primes matter the most, because most of the composites will have small prime factors most of the time, as the article rigorously shows).

The usual descriptions of the algorithm overlook this aspect of counting and traversal, ascribing no cost at all to them. The author does that too when on pgs. 2 and 3 she compares 45 numbers tested by trial-division with 15 numbers crossed off by a "faithful sieve", not realizing that in order to cross off those 15 numbers it had first to visit 238 of them. This has no cost on imperative arrays, or with human vision, but it is still there.

The priority queue code achieves its improvement by imitating this spacial aspect, able to jump to the next composite directly, as it calculates its value, thus skipping over the primes. The original sieve though works by counting, not calculating and comparing values.

That using value-comparing priority queue is a great far-end optimization nobody can dispute (as well as the wheel's), but whether it is in any way "faithful", and a version that imitates counting by recalculating remainders is not faithful to an ill-described algorithm devised by an ancient Greek 2300 years ago, is open for anyone's interpretation. IMO.

{edit:}
... apparently list comprehensions are compiled into even faster (5-10% ) code:

primes   = 2: 3: sieve 0 primes' 5 
primes'  = tail primes
sieve k (p:ps) x         -- sieve by k first odd primes the odds from x upto p*p
         = [x | x<-[x,x+2..p*p-2], and [(x`rem`p)/=0 | p<-take k primes']]
           ++ sieve (k+1) ps (p*p+2)  

Modified Erasthotenes

To work on an unlimited list of primes, you should use a slightly modified sieve. It requires a lot of scroll but that shouldn't be an issue for the average computer ;)

  1. Create a contiguous list of numbers from two.
  2. Annotate with 2 the next multiple of two.
  3. If the next number in the list has not been annotated then it is a prime number.
  4. Annotate with this prime number the next multiple of the number you identified in the previous step.
  5. If the next number in the list is annotated with a number, annotate the next multiple of that number with it.
  6. Repeat previous step for all numbers in the annotation.
  7. Repeat from step 3 until you reach a number that is greater than the square root of n.
  8. Repeat steps 5 and 6 until you reach n
  9. All the remaining non-annotated numbers in the list are prime.

Instead of annotating with just "is prime" or "is not prime", the algorithm annotates with all the prime factors of the number. The computer only needs to store an annotation for numbers that are not yet visited. The x2 optimisation and pre-annotate optimisation can be used too.

Some results for a quick Java implementation

Some results for a quick Java implementation using ints and a Hashmap for prime factors:

$ java -Xprof Primes 500000000
2  3  5 ... 499999909  499999931  499999993  

Flat profile of 692.62 secs (40541 total ticks): main

  Interpreted + native   Method                        
  0.3%   103  +     0    java.lang.Integer.hashCode
  0.0%     0  +     9    java.util.zip.ZipFile.read
  0.0%     5  +     0    Primes.lazyAppend
  0.0%     4  +     0    Primes.
  0.0%     0  +     3    java.util.zip.ZipFile.open
  0.0%     0  +     3    java.lang.Object.getClass
  0.0%     2  +     0    java.math.BigInteger.addOne
  0.0%     2  +     0    java.util.Arrays.asList
  0.0%     2  +     0    java.util.Arrays.copyOf
  0.0%     0  +     2    java.lang.System.arraycopy
  0.0%     2  +     0    java.util.ArrayList.get
  0.0%     1  +     0    sun.security.x509.X509CertInfo.
  0.0%     1  +     0    sun.security.provider.X509Factory.engineGenerateCertificate
  0.0%     1  +     0    sun.net.www.ParseUtil.decode
  0.0%     1  +     0    java.math.BigInteger.multiplyToLen
  0.0%     1  +     0    java.util.regex.Pattern.compile
  0.0%     1  +     0    sun.security.pkcs.PKCS9Attribute.
  0.0%     1  +     0    java.security.Provider.parseLegacyPut
  0.0%     0  +     1    java.util.zip.Inflater.init
  0.0%     1  +     0    sun.security.provider.SHA.implCompress
  0.0%     0  +     1    java.lang.Class.forName0
  0.0%     0  +     1    sun.misc.Unsafe.getInt
  0.0%     1  +     0    java.util.AbstractCollection.
  0.0%     1  +     0    java.lang.Double.toString
  0.0%     1  +     0    java.nio.HeapByteBuffer.
  0.4%   146  +    20    Total interpreted (including elided)

     Compiled + native   Method                        
 59.0% 23917  +     1    Primes.
 30.4%  8294  +  4024    Primes.lazyAppend
  2.7%     0  +  1098    java.lang.Integer.valueOf
  2.3%   950  +     0    java.util.HashMap.containsKey
  1.5%   616  +     0    java.util.HashMap.remove
  1.3%   529  +     0    java.util.HashMap.get
  1.1%     0  +   460    java.util.AbstractList.iterator
  1.1%     0  +   446    java.util.ArrayList.ensureCapacity
  0.0%     3  +     0    java.math.BigInteger.squareToLen
 99.5% 34309  +  6029    Total compiled

         Stub + native   Method                        
  0.0%     0  +     3    java.lang.Object.getClass
  0.0%     0  +     3    Total stub

  Thread-local ticks:
  0.1%    34             Class loader


Flat profile of 0.04 secs (4 total ticks): DestroyJavaVM

  Interpreted + native   Method                        
 33.3%     1  +     0    java.io.DeleteOnExitHook.
 33.3%     1  +     0    Total interpreted

  Thread-local ticks:
 25.0%     1             Blocked (of total)
 66.7%     2             Class loader


Global summary of 692.68 seconds:
100.0% 44697             Received ticks
  7.9%  3540             Received GC ticks
  0.1%    30             Compilation
  1.3%   584             Other VM operations
  0.1%    36             Class loader

$ java -Xprof Primes 50000000
2  3  5 ... 49999903  49999921  49999991  

Global summary of 49.04 seconds:
100.0%  3204             Received ticks
  5.3%   171             Received GC ticks
  0.8%    26             Compilation
  1.5%    48             Other VM operations
  0.2%     5             Class loader

$ java -Xprof Primes 5000000
2  3  5 ... 4999913  4999933  4999949  4999957  4999961  4999963  4999999  

Global summary of 4.62 seconds:
100.0%   305             Received ticks
  4.6%    14             Received GC ticks
 13.1%    40             Compilation
  1.6%     5             Other VM operations
  1.6%     5             Class loader

$ java -Xprof Primes 500000
2  3  5 ... 499903  499927  499943  499957  499969  499973  499979  

Global summary of 0.92 seconds:
100.0%    68             Received ticks
  5.9%     4             Received GC ticks
 42.6%    29             Compilation
  5.9%     4             Class loader

$ java -Xprof Primes 50
2  3  5 ... 11  13  17  19  23  29  31  37  41  43  47  

Global summary of 0.45 seconds:
100.0%    33             Received ticks
 51.5%    17             Compilation
  3.0%     1             Class loader

$ java -Xprof Primes 2000000000
2  3  5 ... 1999999913  1999999927  1999999943  1999999973  

Global summary of 2496.85 seconds:
100.0% 166844            Received ticks
 15.1% 25155             Received GC ticks
  0.0%    37             Compilation
  1.3%  2242             Other VM operations
  0.0%     3             Class loader

Mmh, the time complexity looks roughly linear...

$ java -version
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03-211)
Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02-83, mixed mode)

For the sake of completeness (old sieve of Eratosthenes thread)

The genuine, sufficiently efficient, unbounded, list-based sieve of Eratosthenes in Haskell turned out to be

primes = (2:) . _Y $ (3:) . diff [5,7..] . _U . map (\p-> [p*p, p*p+2*p..]) 
  
_Y g = g (_Y g)        -- non-sharing fixpoint combinator
_U ((x:xs):t) = (x:) . union xs . _U $ unfoldr (\(a:b:c)->Just(union a b,c)) t

with the usual definitions of diff and union for ordered, increasing, infinite lists of numbers.

Runs in practically constant (very slowly growing) space, with run time's empirical orders of growth on par with the article's code (as later improved in the accompanying ZIP file), which is ~ n1.20..1.25 (in n primes produced). For comparison, an array-based code runs at about ~ n1.10..1.05.. (Richard Bird's code from the article is equivalent to defining   _U ((x:xs):t) = (x:) . union xs . _U $ t).

With wheel factorization added,

primesW = [2,3,5,7] ++ _Y ( (11:) . tail . diff (scanl (+) 11 wh11) 
                          . _U . map (\(w,p)-> scanl (\c d-> c + p*d) (p*p) w)
                          . meetBy snd (tails wh11 `zip` scanl (+) 11 wh11) )
wh11 = 2:4:2:4:6:2:6:4:2:4:6:6:2:6:4:2:6:4:6:8:4:2:4:2:  
       4:8:6:4:6:2:4:6:2:6:6:4:2:4:6:2:6:4:2:4:2:10:2:10:wh11

with the obvious semantics for meetBy f xs ys ~ meet (map f xs) ys. Fusing diff and meetBy with their surroundings gives it a bit of an additional constant factor improvement.

(later remark:) So lists are perfectly capable of expressing a proper sieve of Eratosthenes; moreover, the main problem with the original "naive" sieve is that it is too hasty, firing up new filter for each produced prime. The paper rightly dismisses having each filter delay its operations until the prime's square, but overlooks the optimization of postponement of creation of the filter itself — like so:

primesT = sieve (4,primesT) [2..] 
    where
      sieve (q,ps) (x:xs) | x < q = x : sieve (q,ps) xs
                          | otherwise = next ps xs
      next (p:t) xs = sieve (head t * head t, t) [n | n <- xs, rem n p > 0]

This runs at about ~ n1.35..45.. to the original's ~ n2, just like the paper shows the optimal trial division should.

Wow

That's impressive. Did you personally come up with it, was it already described somewhere else? I would be interested in information about how you came with this code.

If I understand correctly, the point of the "unsharing fixpoint" is to avoid a memory leak due to the recursion marking all intermediate lists as reachable forever. What would the memory cost of that be? The simple array-using imperative sieve implementation use an array of all (primes and non-primes) element in the interval of interest. You would store more because of the duplication in multiples to remove.

On the other hand, what is the time overhead of recomputing primes recursively? I don't have a clear picture of at which speed the unionAll implementation forces the list, for example.

Haskellwiki page about prime numbers

Thanks. :) This code is a product of discussions on haskell-cafe starting December 2009 and then 2010, mainly with contributions from Daniel Fischer, about the code by Heinrich Apfelmus, which came after the code by Dave Bayer from July 2007 (and then this his "venturi" version). I've just simplified the formulation later, forfeiting the sophistication of prior versions striving to force the tree as little as possible, for the unrestrained simplicity of forcing it in bunches of 2k streams - which still works, since primes' squares are sufficiently far apart (9>7, 49>19, 361>53, ... (p2k)2 > p2k+1; asymptotically, 2log(2k) > log(2k+1), 2k > k+1). The tree can be seen here and here.

Yes, it is about achieving the near-constant memory footprint, to allow the calculated primes to be forgotten (and thus gc-ed) as soon as possible; the initial idea was by Melissa O'Neill in her ZIP file code (with two stages). There's an SO answer about this, too. We had it with two stages for a long time on haskellwiki; I recently (a year?) realized the multistage production (with _Y) is even more forgetful. And that is just simple recursion, after all.

My interest was initially mainly about the idea of postponement which I found was lacking in the article. It can be seen in Python as well. Here's some variations on that.