Scientific Computing Kernels on the Cell Processor

  • Samuel Williams
  • John Shalf
  • Leonid Oliker
  • Shoaib Kamil
  • Parry Husbands
  • Katherine Yelick
Special Issue on High-End Computing

In this work, we examine the potential of using the recently-released STI Cell processor as a building block for future high-end scientific computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key numerical kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. Next, we validate our model by comparing results against published hardware data, as well as our own Cell blade implementations. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different kernel implementations and demonstrates a simple and effective programming model for Cell’s unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.


Cell processor GEMM SpMV sparse matrix FFT Stencil three level memory 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    S. Williams, J. Shalf, L. Oliker, et al., The Potential of the Cell Processor for Scientific Computing, Computing Frontiers, pp. 9–20 (May 2006).Google Scholar
  2. 2.
    M. Kondo, H. Okawara, H. Nakamura, et al., Scima: A Novel Processor Architecture for High Performance Computing, 4th International Conference on High Performance Computing in the Asia Pacific Region, volume 1, pp. 355–360 (May 2000).Google Scholar
  3. 3.
    P. Keltcher, S. Richardson, S. Siu, et al., An Equal Area Comparison of Embedded DRAM and SRAM Memory Architectures for a Chip Multiprocessor. Technical report, HP Laboratories (April 2000).Google Scholar
  4. 4.
    S. Tomar, S. Kim, N. Vijaykrishnan, et al., Use of Local Memory for Efficient Java Execution, Proceedings of the International Conference on Computer Design, pp. 468–473 (September 2001).Google Scholar
  5. 5.
    M. Kandemir, J. Ramanujam, M. Irwin, et al., Dynamic Management of Scratch-pad Memory Space, Proceedings of the Design Automation Conference, pp. 690–695 (June 2001).Google Scholar
  6. 6.
    P. Francesco, P. Marchal, D. Atienzaothers, et al., An Integrated Hardware/Software Approach for Run-time Scratchpad Management, Proceedings of the 41st Design Automation Conference, pp. 238–243 (June 2004).Google Scholar
  7. 7.
    The Berkeley Intelligent RAM (IRAM) Project. Scholar
  8. 8.
    Khailany B., Dally W., Rixner S. et al (March–April 2001). Imagine: Media Processing with Streams. IEEE Micro, 21(2):35–46CrossRefGoogle Scholar
  9. 9.
    Oka M., Suzuoki M. (November 1999). Designing and Programming the Emotion Engine. IEEE Micro, 19(6):20–28CrossRefGoogle Scholar
  10. 10.
    Kunimatsu A., Ide N., Sato T. et al. (March 2000). Vector Unit Architecture for Emotion Synthesis. IEEE Micro, 20(2):40–47CrossRefGoogle Scholar
  11. 11.
    Suzuoki M. et al. (November 1999). A Microprocessor with a 128-bit cpu, Ten Floating Point Macs, Four Floating-point Dividers, and an mpeg-2 Decoder. IEEE Solid State Circuits, 34(1):1608–1618CrossRefGoogle Scholar
  12. 12.
    B. Flachs, S. Asano, S.H. Dhong, et al., A Streaming Processor Unit for a Cell Processor, ISSCC Dig. Tech. Papers, pp. 134–135 (February 2005).Google Scholar
  13. 13.
    D. Pham, S. Asano, M. Bollier, et al., The Design and Implementation of a First-generation Cell Processor, ISSCC Dig. Tech. Papers, pp. 184–185 (February 2005).Google Scholar
  14. 14.
    S. M. Mueller, C. Jacobi, C. Hwa-Joon, et al., The Vector Floating-point Unit in a Synergistic Processor Element of a Cell Processor, 17th IEEE Annual Symposium on Computer Arithmetic (ISCA), pp. 59–67 (June 2005).Google Scholar
  15. 15.
    J. A. Kahle, M. N. Day, H. P. Hofstee, et al., Introduction to the Cell Multiprocessor. IBM Journal of R&D, 49(4) (2005).Google Scholar
  16. 16.
    IBM Cell specifications. Scholar
  17. 17.
    M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J.Dongarra. MPI: The Complete Reference (Vol. 1). The MIT Press (1998).Google Scholar
  18. 18.
    Sony press release. Scholar
  19. 19.
    N. Park, B. Hong, and V. K. Prasanna, Analysis of Memory Hierarchy Performance of Block Data Layout. International Conference on Parallel Processing (ICPP), p. 35 (August 2002).Google Scholar
  20. 20.
    L. Cannon, A Cellular Computer to Implement the Kalman Filter Algorithm. PhD thesis, Montana State University, p. 228 (1969).Google Scholar
  21. 21.
    Cell Broadband Engine Architecture and its First Implementation. http://www-128. Scholar
  22. 22.
    Saad Y. (1996). Iterative Methods for Sprarse Linear Systems. PWS, Boston, MAGoogle Scholar
  23. 23.
    G. Blelloch, M. Heroux, and M. Zagha, Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors, Technical Report CMU-CS-93-173, CMU (1993).Google Scholar
  24. 24.
    R. Vuduc, Automatic Performance Tuning of Sparse Matrix Kernels, PhD thesis, University of California at Berkeley (2003).Google Scholar
  25. 25.
    E.-J. Im, K. Yelick, and R. Vuduc, Sparsity: Optimization Framework for Sparse Matrix Kernels, International Journal of High Performance Computing Applications, pp. 135–158 (2004).Google Scholar
  26. 26.
    E. F. D’Azevedo, M. R. Fahey, and R. T. Mills, Vectorized Sparse Matrix Multiply for Compressed Row Storage Format, International Conference on Computational Science (ICCS), pp. 99–106 (2005).Google Scholar
  27. 27.
    Chombo homepage. Scholar
  28. 28.
    Cactus homepage. Scholar
  29. 29.
    Li Z., Song Y. (2004). Automatic Tiling of Iterative Stencil Loops. ACM Transactions on Programming Language Systems, 26(6):975–1028CrossRefGoogle Scholar
  30. 30.
    David Wonnacott, Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations. International Parallel and Distributed Processing Symposium (IPDPS), pp. 171–180 (2000).Google Scholar
  31. 31.
    G. Jin, J. Mellor-Crummey, and R. Fowlerothers, Increasing Temporal Locality with Skewing and Recursive Blocking, Proc. SC2001 (2001).Google Scholar
  32. 32.
    S. Kamil, K. Datta, S. Williams, et al., Implicit and Explicit Optimizations for Stencil Computations, ACM Workshop on Memory System Performance and Correctness, pp. 51–60 (October 2005).Google Scholar
  33. 33.
    S. Kamil, P. Husbands, L. Oliker, et al., Impact of Modern Memory Subsystems on Cache Optimizations for Stencil Computations, ACM Workshop on Memory System Performance, pp. 36–43 (June 2005).Google Scholar
  34. 34.
    L. Oliker, R. Biswas, J. Borrill, et al., A Performance Evaluation of the Cray X1 for Scientific Applications, Proc. 6th International Meeting on High Performance Computing for Computational Science, pp. 51–65 (2004).Google Scholar
  35. 35.
    FFTW speed tests. Scholar
  36. 36.
    A. Chow, G. Fosum, D, and Brokenshire, A Programming Example: Large FFT on the Cell Broadband Engine, Proceeding of the 2005 Global Signal Processing Expo (GSPx) (October, 2005).Google Scholar
  37. 37.
    J. Greene and R. Cooper, A Parallel 64k Complex FFT Algorithm for the PIBM/Sony/Toshiba Cell Broadband Engine processor, Proceeding of the 2005 Global Signal Processing Expo (GSPx) (October, 2005).Google Scholar
  38. 38.
    ORNL cray x1 evaluation. Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • Samuel Williams
    • 1
  • John Shalf
    • 1
  • Leonid Oliker
    • 1
  • Shoaib Kamil
    • 1
  • Parry Husbands
    • 1
  • Katherine Yelick
    • 1
  1. 1.Lawrence Berkeley National LaboratoryCRD/NERSCBerkeleyUSA

Personalised recommendations