Note: This article is a bit dated, and as the relative speed of memory access, floating point calculations, and integer/pointer calculations (for array indexing) change; it may be less relavant. However I believe some of the techniques used here (Esp. the 1-multiply,1-add trig generation) may still be relevant to modern FFTs as well.
As shown on the excellent FFTW benchmark page, today, for complex-valued fft's there are a number of algorithms that beat this one (except for the kinda special case of N=2^11 on a power mac;-)). I haven't seen a comprehensive benchmark for real-valued ffts or hartley transforms yet.
Newsgroups: comp.dsp Subject: Real FFT comparison References: <128341@netnews.upenn.edu> Reply-To: mayer@acuson.com ()[changed to rmayer@digitalmarket.com] Distribution: world Organization: ACUSON, Mountain View, CA Keywords: FFT, Real valued. [Date: Sometime in 1993, I believe.] Summary: The Real Valued FFT Henrik Sorensen posted here a few days ago is considerably faster than any of the other FFT routines I've seen posted recently. I think I have a FHT which seems faster on some architectures. This posting will explain my results, the following posting will contain my source code, a C translation of Sorensen's code, and timing routines to compare them.
The code Henrik posted does indeed require less floating point computation than my FHT routine; but it requires almost 3/2 times as many memory accesses. Systems, like a Sun Sparc, with fast floating point arithmetic and slower memory access give my FHT an advantage. We also used different methods of calcuating trig values; but on a Sparc his algorithm was faster. Another difference may have been in my translating his code from Fortran to C; I just did a line-by-line translation, and I'm certain it could have been cleaned up to better suit gcc's optimizer. My apologies if the performance overhead is due to a poor C translation on my part.
I'd guess a radix-4 real-valued FFT should beat them both.
---------------------------------------------------------------------- Timing Results
The tables below contines timing results for various array sizes for each of the a number of different FFT algorithms. The timing values are in units of "microseconds/n/log(n)/number_of_ffts_run". An explanation of each algorithm follows the tables.
The following table, in roughly decreasing order of performance, contains timing results on a Sparcstation2 and compiled with gcc version 2.0.2 with the flag -O4.
fft_algorithm | trig_algorithm | | A: fourea-real ;c_lib B: fourea ;c_lib C: duhamel ;c_lib D: mayer-rad4 ;c_lib E: numrec ;numrec F: singleton ;sing G: mayer-rad4 ;Buneman2 H: mayer-rad4 ;Simple I: nrec_real ;numrec (real valued fft) J: sing_real ;sing (real valued fft) K: mayer-real ;Buneman2 (real valued fft) L: sorensen-r ;n-rec (real valued fft) M: mayer-fht ;Buneman2 (hartley transform) # A B C D E F G H I J K L M 4 : 5.9 5.5 2.3 1.5 5.6 9.4 1.6 1.6 4.0 11. 0.82 0.82 0.63 8 : 5.1 5.0 2.7 1.1 3.3 6.2 1.1 1.1 2.5 3.5 0.58 0.65 0.41 16 : 4.5 4.3 2.8 2.0 2.2 2.6 1.2 1.1 1.6 2.8 0.61 0.71 0.50 32 : 4.0 3.9 2.7 2.0 1.8 1.9 1.0 1.0 1.2 1.4 0.53 0.61 0.42 64 : 3.6 3.5 2.6 2.1 1.5 1.4 1.1 1.0 0.96 1.1 0.53 0.56 0.45 128 : 3.2 3.2 2.4 2.0 1.4 1.6 0.96 0.92 0.81 0.87 0.49 0.48 0.42 256 : 3.0 2.9 2.2 2.0 1.3 1.3 0.98 0.91 0.74 0.89 0.50 0.49 0.43 512 : 2.9 2.8 2.1 1.9 1.2 1.2 0.92 0.90 0.71 0.70 0.46 0.44 0.40 1024 : 2.7 2.8 2.2 1.8 1.2 1.3 0.94 0.92 0.69 0.65 0.46 0.45 0.41 2048 : 2.6 2.9 2.2 1.7 1.2 1.5 0.91 0.87 0.68 0.68 0.44 0.44 0.39 4096 : 2.6 3.0 2.3 1.7 1.2 1.4 0.94 0.91 0.68 0.81 0.46 0.45 0.41 8192 : 2.8 2.9 2.2 1.6 2.2 1.4 0.89 0.86 0.66 0.73 0.44 0.44 0.39 16384: 3.5 3.5 2.6 2.1 2.3 1.7 1.5 1.5 0.99 0.75 0.74 0.80 0.69 32768: 3.5 3.5 2.6 2.1 2.3 1.9 1.5 1.5 1.0 0.88 0.74 0.85 0.69 65536: 3.5 3.5 2.5 2.1 2.4 1.7 1.5 1.5 1.0 0.99 0.77 0.85 0.71 real-value *** ******************* hartley **** radix2 ********* **** radix4 **** ************** ********* **** splitradix **** **** c-lib-trig ****************** Note: * There's at least a factor of 3 in the time taken by various algorithms * Real valued transforms are significantly faster than complex ffts * FFTs which rely on C library calls for all trig values are very slow * My workstation apparently started paging at about N=16384. This slowed down memory accesses. It is interesting to note that the split-radix-real-valued-FFT was more affected by paging than the radix-4 fht. My complex-FFT is also severely affected by memory paging. * Order(N) and Order(log(N)) components are very significant for transforms smaller than about 256. If FFTs were a purely N*log(N) process all the numbers in any given column would be constant. * I suspect these results are highly architecture dependant. On a 486 system, Henrik's FFT was faster than my FHT for most 2^K for even K, and slower for most 2^K for odd K. The main reason for this is probalby because the 486 has relatively few registers, so my algorithm can't effectively decrease the memory accesses. ----------------- Description of FFT algorithms compared. numrec A straightforward radix-2 FFT Source: "Numerical Recipies in C"; duhamel "A Duhamel-Hollman split-radix dif fft" by: Dave Edelblute, edelblut@cod.nosc.mil, 05 Jan 1993 [...] Ref: Electronics Letters, Jan. 5, 1984" Source: periodic posting on comp.benchmarks fourea "adapted from subroutine FOUREA listed in Programs for Digital Signal Processing [...] edited by Digital Signal Processing Committee [...] IEEE Acoustics Speech and Signal Processing Committee [...] Chapter 1 Section 1.1 Page 1.1-4,5" Source: comp.dsp 02/04/93; by pe93009@black.ox.ac.uk (Ted Wong) singleton "Singleton's mixed radix algorithm" Source: From edu/math/msdos/modelling on wuarchive.wustl.edu Translated to C by: Javier Soley, FJSOLEY@UCRVM2.BITNET sorensen my C translation of the code Henrik Sorensen posted here. Source: "Real-valued fast Fourier transform algorithms", IEEE Tran. ASSP, Vol. ASSP-35, No. 6, pp. 849-864, June 1987 (My C translation included in the following posting) mayer-* All my routines are wrappers around a fast Hartley transform. Source: the following posting.---------------------------------------------------------------------------
Trig algorithms
A few people asked me via email about the trig algorithm I used: I'll try to look up the reference to the trig algorithm I was using; but I think I lost it somewhere in storage. It works by manipulating a log2(n) sized table of trig values. Each newly generated value is for an angle which lies exactly halfway between two angles whose trig values are known. I think the identity is something like
sin( (w+v)/2 ) = (sin(w) + sin(v)) * sec(v-w) / 2although I might be wrong about the argument inside the sec() function. You can start with a table of cos(pi),cos(pi/2),cos(pi/4), cos(pi/8)...., and everytime you generate a new value you can overwrite one which you no longer need (to avoid excessive memory usage). This takes less floating point math than any of the other trig generating algorithms I know of; but requires a bunch of integer math to determine which value can be overwritten. It is probably only faster on systems which emulate floating point math; but it is considerably more stable than other common methods.
I'll try to find and mention the reference in a future posting.
--------------------------------------------------------------------------- Accuracy and Noise of FFT algorithms.
An earlier discussion I had on comp.benchmarks people were discussing possible tradeoffs between speed and accuracy. The amount of noise generated by various FFT algorithms varies by a factor of about 1e10! However, as shown in the table below, this is not all that closely correlated to the speed of the algorithm.
In reality, noise added by any of these FFTs probably only matters if you are doing zillions of forward and reverse transforms; but some will argue that a good FFT should have very low noise.
One significant component of errors in FFTs is the trig algorithm used. Avoiding recursively successive calculating trig functions can reduce accumulated errors by about a factor of 5. Unfortunatelly C library functions are too slow for this purpose. See my code in the next posting for a number of compromise trig algorithms.
algo trig N time ssq-errors fourea-real ;c_lib 65536 : 3.48 2.19e-12 fourea ;c_lib 65536 : 3.45 2.64e-11 duhamel ;c_lib 65536 : 2.53 4.63e-21 numrec ;numrec 65536 : 2.35 9.13e-14 mayer-rad4 ;c_lib 65536 : 2.08 1.79e-21 *(all library functions) sing ;sing 65536 : 1.68 5.22e-19 mayer-rad4 ;Buneman2 65536 : 1.52 2.46e-21 *(half-angle algorithm) mayer-rad4 ;Simple 65536 : 1.51 8.16e-21 *(recursively generated) nrec_real ;numrec 65536 : 1.04 5.39e-19 sing_real ;sing 65536 : 0.987 ???????? sorensen-r ;n-rec 65536 : 0.855 1.83e-20 mayer-real ;Buneman2 65536 : 0.765 1.78e-21 mayer-fht ;Buneman2 65536 : 0.707 2.46e-21
Note that the wierd trig algorithm reduces errors by almost a factor of 4 with negligable performance cost.
----------------------------------------------------------------------------
Ron Mayer
rmayer@digitalmarket.com