Short-Vector SIMD Parallelization in Signal Processing

  • Rade KutilEmail author


Short-vector Single-instruction-multiple-data (SIMD) units have become common in signal processors. Moreover, almost all modern general-purpose processors include SIMD extensions, which makes SIMD also important in high performance computing. This chapter gives an overview of approaches to the vectorization of signal processing algorithms. Despite their complexity, these algorithms have a relatively regular data flow. This regularity makes them good candidates for SIMD vectorization. They fall in two categories: filter banks that operate on streaming signal data, and Fourier-like transforms that operate on blocks of data. For the first category, simple FIR filters, IIR filters and more complicated filter banks from the field of wavelet transforms are investigated to develop and present general vectorization strategies. Well-known loop transformations as well as novel vectorization approaches are combined and evaluated. For the second category, basic approaches for the fast Fourier transform (FFT) are shown and the workings of automatic vectorizing performance tuning systems are explained. The presented solutions are tested on Intel processors with SIMD extensions and the results are compared. Wherever possible, the reasons for performance gains or losses are uncovered so that good vectorization strategies can be derived for arbitrary signal processing algorithms.


Sequential Algorithm Lift Scheme Signal Processing Algorithm Vectorization Strategy Small Data Size 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series, Mathematics of Computation 19 (1965) 297–301.zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    P. Duhamel, M. Vetterli, Fast Fourier transforms: A tutorial review and a state of the art, Signal Processing 19 (4) (1990) 259–299.zbMATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    C. M. Rader, Discrete Fourier transforms when the number of data samples is prime, in: Proc. of the IEEE, Vol. 56 (1968), pp. 1107–1108.CrossRefGoogle Scholar
  4. 4.
    M. Frigo, S. G. Johnson, FFTW: An adaptive software architecture for the FFT, in: Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 3 (1998), pp. 1381–1384.Google Scholar
  5. 5.
    M. Püschel, B. Singer, J. Xiong, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, R. W. Johnson, SPIRAL: A generator for platform-adapted libraries of signal processing algorithms, High Performance Computing and Applications (2004) 21–45.Google Scholar
  6. 6.
    J. Xiong, J. Johnson, R. Johnson, D. Padua, SPL: A language and compiler for DSP algorithms, in: Proc. Programming Language Design and Implementation (PLDI), ACM (2001), pp. 298–308.Google Scholar
  7. 7.
    S. Kral, F. Franchetti, J. Lorenz, C. W. Überhuber, SIMD vectorization techniques for straight line code, Tech. Rep. TR2003-02, Institute of Applied Mathematics and Numerical Analysis, Vienna University of Technology (2003).Google Scholar
  8. 8.
    S. Kral, F. Franchetti, J. Lorenz, C. W. Überhuber, SIMD vectorization of straight line FFT code, in: Proc. Euro-Par (2003), pp. 251–260.Google Scholar
  9. 9.
    M. Frigo, S. G. Johnson, The design and implementation of FFTW3, in: Proc. IEEE, Vol. 93 (2005), pp. 216–231.CrossRefGoogle Scholar
  10. 10.
    C. Tenllado, D. Chaver, L. Piñuel, M. Prieto, F. Tirado, Vectorization of the 2D wavelet lifting transform using SIMD extensions, in:Workshop on Parallel and Distributed Image Processing, Video Processing, and Multimedia, PDIVM ’03, Nice, France (2003).Google Scholar
  11. 11.
    D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado, 2-D wavelet transform enhancement on general-purpose microprocessors: Memory hierarchy and SIMD parallelism exploitation, in:Proceedings of the 2000 International Conference on High Performance Computing, Bangalore, India (2002).Google Scholar
  12. 12.
    M. Pic, H. Essafi, D. Juvin, Wavelet transform on parallel SIMD architectures, in: F. Huck, R. Juday (Eds.), Visual Information Processing II, Vol. 1961 of SPIE Proceedings, SPIE (1993) pp. 316–323.Google Scholar
  13. 13.
    C. Chakrabarti, M. Vishvanath, Efficient realizations of the discrete and continuous wavelet transforms: From single chip implementations to mappings on SIMD array computers, IEEE Transactions on Signal Processing 3 (43) (1995) 759–771.CrossRefGoogle Scholar
  14. 14.
    M. Feil, A. Uhl, Wavelet packet decomposition and best basis selection on massively parallel SIMD arrays, in: Proceedings of the International Conference “Wavelets and Multiscale Methods” (IWC’98), Tangier, 1998, INRIA, Rocquencourt (1998), 4 pages.Google Scholar
  15. 15.
    R. Kutil, P. Eder, M. Watzl, SIMD parallelization of common wavelet filters, in:Parallel Numerics ’05, Portorož, Slovenia (2005), pp. 141–149.Google Scholar
  16. 16.
    R. Kutil, P. Eder, Parallelization of wavelet filters using SIMD extensions, Parallel Processing Letters 16 (3) (2006) 335–349.CrossRefMathSciNetGoogle Scholar
  17. 17.
    ISO/IEC 15444-1, Information technology – JPEG2000 image coding system, Part 1: Core coding system (Dec. 2000).Google Scholar
  18. 18.
    I. Daubechies,W. Sweldens, Factoring wavelet transforms into lifting steps, Journal of Fourier Analysis Applications 4 (3) (1998) 245–267.MathSciNetGoogle Scholar
  19. 19.
    M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gačić, Y. Voronenko, K. Chen, R. W. Johnson, N. Rizzolo, SPIRAL: Code generation for DSP transforms, Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation" 93 (2) (2005) 232–275.Google Scholar
  20. 20.
    R. Schaffer,M. Hosemann, R. Merker, G. Fettweis, Recursive filtering on SIMD architectures, in: Proc. IEEE Workshop on Signal Processing Systems (SIPS), 2003, pp. 263–268.Google Scholar
  21. 21.
    M. Hosemann, G. Fettweis, On enhancing SIMD-controlled dsps for performing recursive filtering, Journal of VLSI signal processing 43 (2–3) (2006) 125–142.zbMATHGoogle Scholar
  22. 22.
    J. Robelly, G. Cichon, H. Seidel, G. Fettweis, Implementation of recursive digital filters into vector SIMD DSParchitectures, in: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 5 (2004), pp. 165–168.Google Scholar
  23. 23.
    R. Kutil, Parallelization of IIR filters using SIMD extensions, in:Proceedings of the 15th International Conference on Systems, Signals and Image Processing (IWSSIP), Bratislava, Slovak Republic (2008), pp. 65–68.Google Scholar
  24. 24.
    R. C. Whaley, J. Dongarra, Automatically tuned linear algebra software (ATLAS), in: Proc. Supercomputing (1998).Google Scholar
  25. 25.
    J. Bilmes, K. Asanović, C.W. Chin, J. Demmel, Optimizing matrix multiply using PHiPAC: A portable, high-performance, ANSI C coding methodoly, in: Proc. Int. Conf. Supercomputing (ICS) (1997), pp. 340–347.Google Scholar
  26. 26.
    E.-J. Im, K. Yelick, Optimizing sparse matrix computations for register reuse in SPARSITY, in: Proc. Int. Conf. Computational Sciences (ICCS) (2001), pp. 127–136.Google Scholar
  27. 27.
    F. Franchetti, M. Püschel, Short vector code generation for the discrete Fourier transform, in: Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2003), pp. 58–67.Google Scholar
  28. 28.
    F. Franchetti, M. Püschel, Short vector code generation and adaption for DSP algorithms, in: Proc. International Conference on Acoutstics, Speech and Signal Processing (ICASSP), Vol. 2 (2003), pp. 537–540.Google Scholar
  29. 29.
    F. Franchetti,M. Püschel, Generating SIMD vectorized permutations, in: Proc. Compiler Construction (CC) (2008), pp. 116–131.Google Scholar
  30. 30.
    J. Robelly, G. Cichon, H. Seidel, G. Fettweis, Design and automatic code generation of the MS algorithm for SIMD signal processors, in: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 5 (2005), pp. 81–84.CrossRefGoogle Scholar
  31. 31.
    G. Lafruit, B. Vanhoof, L. Nachtergaele, F. Catthoor, J. Bormans, The local wavelet transform: a memory-efficient, high-speed architecture optimized to a region-oriented zero-tree coder, Integrated Computer-Aided Engineering 7 (2) (2000) 89–103.Google Scholar
  32. 32.
    R. Kutil, A single-loop approach to SIMD parallelization of 2-D wavelet lifting, in:Proceedings of the 14th Euromicro Conference on Parallel, Distributed and Network-based Processing (PDP), Montbeliard-Sochaux, France (2006), pp. 413–420.Google Scholar
  33. 33.
    C. Chrysafis, A. Ortega, Line based, reduced memory, wavelet image compression, IEEE ransactions on Image Processing 9 (3) (2000) 378–389.zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Department of Computer SciencesUniversity of SalzburgJ.-Haringer-Strasse 25020 SalzburgAustria

Personalised recommendations