Relating FFTW and Split-Radix. Proc. of ICESS'04, the First International Conference on Embedded Software and System, December 9-10 2004, Hangzhou (Zhejiang), China.

This ongoing attempt to reproduce an efficient implementation of FFT using staging and abstract interpretation attempts to answer the question "How can we get the raw performance of hardware without giving up the expressivity and clarity of software?"

Here's how Oleg describes the contribution of this paper,

One may think that generating truly optimal power-of-two FFT is straightforward: we generate the naive radix-2 FFT code, and then optimize it, removing trivial multiplications (x*1), trivial additions (x+0), etc. That was the approach demonstrated previously.

And the bottom line is: the improved code is better than the original, but still worse than the best (FFTW) code. Here's why: if we consider an expression, (sin(pi/4) * x + cos(pi/4) * y), we may think that it can be optimized to sin(pi/4)*(x+y) thus saving one multiplication. Alas, computed sin(pi/4) and cos(pi/4) are not _exactly_ equal. Therefore, no optimizer is able to detect that they are the common factor. The previous paper BTW did handle the case of sin(pi/4) -- and still fell short of FFTW. To achieve optimum, more complex trigonometric transformations have to be applied.

This paper shows that to achieve the best quality of the generated code, we have to use domain knowledge. In case of FFT, we use the fact that it is a linear transform whose factors are roots of unity. This gives us an abstract representation of terms amenable to exact trigonometry. We can essentially symbolically manipulate the terms, and then concretize our abstract representation into the code. In the end, we generated FFT code that exactly matches the number of FP operations of FFTW. Somewhat unexpectedly, when we changed the function that generates the code for general complex multiplication, we obtained split-radix FFT' -- another well-known FFT technique.

Oleg points out that the point isn't that they managed to reproduced the FFTW results. The crucial point is that we know exactly which identities (i.e., axioms) contributed to the optimum. The search was principled rather heuristic, and the code is generated in only one pass. There are no manipulations on the code: it is generated just right.

## Comment viewing options

### Nice article + random thoughts

It seems I have only questions on this blog; ah well, I am not a mathematician ;-)

Does anyone remember the area/time, or bit-, complexity results on FFT? I Googled but only found "stuff I may not read, courtesy of ACM."

And, uhm, anyone know if anyone ever tried to get rid of all the multiplications in FFT -or NTTs, who cares- say by using Karatsuba multiplication? And if so, are there any AT results on that?

### The number of multiplications

The following paper is a good reference regarding the FFT complexity:

@article{Heideman:1986,
author = "M.T.Heideman and C.S.Burrus",
title = "On the number of multiplications
necessary to compute a length-$2^n$ {DFT}",
journal = {{IEEE} Trans. ASSP},
volume = "ASSP-34",
number = "1",
month = {February},
pages = "91--95",
year = "1986"
}


The paper proves that the minimum number of floating-point (not
rational!) multiplications needed to compute 2^n complex-valued DFT is
2^{n+2} - 2n^2 - 2n -4. For real-valued inputs, only half that many
are needed.

For split-radix DFT, the number of FP multiplication is (n-3)*2^n + 4.

The algorithm with the minimum number of multiplications is not
practical beyond n = 4 (because the number of additions
sky-rockets). But for n = 4 and below, the split-radix is the minimum
multiplication DFT algorithm. For n=5, the algorithm in the Heideman and
Burrus paper has 64 FP multiplications and 504 additions (whereas
only lower boundaries were proven but no algorithm was actually
written (so the number of additions is not known, but it is certainly
very large).

### Great

Thanks for the answers; couldn't find your article yet but stumbled on this one: S.G. Johnson, M. Frigo, A modified split-radix FFT with reduced arithmetic complexity. Hmpf, their tables only list general flop's.

Hmm, when counting bit operations there is a "philosophical" argument that you can get rid of all the multiplications by recursive Karatsuba but I must admit that this is probably impractical and stretching the truth, well, a bit... ;-)

I just posted an updated version of the paper that gives formulas for the separate counts of real additions and multiplies (no table, though). (See Eqs. 4-5 and 13.)

(Since the savings are purely in the number of real multiplications, as described in the paper, these are pretty easy to figure out yourself if you know the counts for standard split-radix assuming general complex multiplies are done in 4/2 mults/adds.)

### Combinatorial Reasoning Systems?

Thanks for your comments and compliments on your paper. I reread some stuff on FFT and NTTs and, ah well, decided that it would take too much of my time to grasp it really (I personally only looked at the NTT to find minimal terms for factorization, but went for Karatsuba for now, so irrational FFT I don't really get); same goes for about 50% of your paper, I grasp your line of reasoning but don't have time to go into the technicalities of it :(

Anyway, I had three, somewhat rambling, thoughts you (others) might find of interest. The last one is a question which rambles the least and which I am interested in.

First, by wikipedia, Winograd, who else, proved that the DFT can be computed with O(n) multiplications at the expense of too many additions. Wikipedia does not link to an argument. (prime length?)

Second, [decided to kill this not-even half-baked thought]

Third thought, which is the interesting thought for LtU readers -I guess- is: there is some "mechanical" feeling to the combinatorial/algorithmic argument. You state a function which unfolds to a given term graph and then you start searching for sharing on the graph, or rewrite rules which maximize sharing, after which you look for another recurrence relation which implies another function the graph could be folded to (and then you hope that you can apply the master theorem). This seems aimenable to automation so my question is: is there any software (more general than for FFT use) which helps automating these mundane combinatorial tasks?

[Hmm, actually, I guess I am going to try this. Hey, if I am no Winograd maybe my computer can become one ;-) ]

### SPIRAL signal-processing code generation system

marco wrote:
there is some "mechanical" feeling to the combinatorial/algorithmic argument. You state a function which unfolds to a given term graph and then you start searching for sharing on the graph, or rewrite rules which maximize sharing, after which you look for another recurrence relation which implies another function the graph could be folded to (and then you hope that you can apply the master theorem). This seems amenable to automation so my question is: is there any software (more general than for FFT use) which helps automating these mundane combinatorial tasks?

That has been done in the SPIRAL project, which `entirely autonomously, generates platform-tuned implementations of signal processing transform such as the discrete Fourier transform, discrete cosine transform, and many others.''

SPIRAL can also search for signal processing algorithms with the minimal operation count (btw, on modern super-scalar CPU the code with the minimal (FP) operations count is not necessarily the fastest one).

### I disagree until I made my decision. Then I agree.

[in the hope that some folks here share my appreciation of the blatant nonsensical (is it?) - the above is yet another translation of a quote by soccer guru Cruyff]

Good reference, I am now reading M. PÃ¼schel, B. Singer, J. Xiong, J. M .F. Moura, J. Johnson, D. Padua, M. M. Veloso, and R. W. Johnson SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms, it is a nice system.

[The results on comparing algorithms on various OSes/compilers I found very amusing. I read a few papers on satisfiability (solvers) and a number of results I always dumbed down to "studies on the caching behavior of the machine the experiment was executed on." Maybe I was right, who knows?]

Anyway, as always, it seems, I was thinking in terms of a more unconventional system which helps discovering combinatorial results.

An engineer told me once that he was amazed by the "human manner" in which Maple (?) solves difficult integrals. I.e, mechanized reasoning of what he did for thirty years in a way he thought was not possible.

In another thread, I made a somewhat -I guess discomfortable to some- philosophical remark on the nature of functional programming. In hindsight, another formulation of what I said there is that functional programming is based on combinatorial reasoning which primarily involves human spatial and temporal insight.

This sparked my original question: I was thinking about a system which facilitates the discovery of (pure) combinatorial results. I.e. presents the unfolded term graph of a function, finds/shows recurrence relations, rewrites according to some algebraic law... maybe folds the term graph... etc. etc. etc.

[I personally would love to see some system which would facilitate abductive reasoning too in some manner. Maybe something which shows different solutions of a similar problem, maybe tricks the mind between associative/disassociative states. But that, I guess, really is a dream. Ah well.]

Then again, I would settle for some text-based system for rapid prototyping of ideas; and, then again, maybe some stuff is better done with a pencil and some paper; at least Spiral solves a problem ;-)