Skip to main content

Advertisement

Log in

Scientific Computing Kernels on the Cell Processor

International Journal of Parallel Programming Aims and scope Submit manuscript

In this work, we examine the potential of using the recently-released STI Cell processor as a building block for future high-end scientific computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key numerical kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. Next, we validate our model by comparing results against published hardware data, as well as our own Cell blade implementations. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different kernel implementations and demonstrates a simple and effective programming model for Cell’s unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

  1. S. Williams, J. Shalf, L. Oliker, et al., The Potential of the Cell Processor for Scientific Computing, Computing Frontiers, pp. 9–20 (May 2006).

  2. M. Kondo, H. Okawara, H. Nakamura, et al., Scima: A Novel Processor Architecture for High Performance Computing, 4th International Conference on High Performance Computing in the Asia Pacific Region, volume 1, pp. 355–360 (May 2000).

  3. P. Keltcher, S. Richardson, S. Siu, et al., An Equal Area Comparison of Embedded DRAM and SRAM Memory Architectures for a Chip Multiprocessor. Technical report, HP Laboratories (April 2000).

  4. S. Tomar, S. Kim, N. Vijaykrishnan, et al., Use of Local Memory for Efficient Java Execution, Proceedings of the International Conference on Computer Design, pp. 468–473 (September 2001).

  5. M. Kandemir, J. Ramanujam, M. Irwin, et al., Dynamic Management of Scratch-pad Memory Space, Proceedings of the Design Automation Conference, pp. 690–695 (June 2001).

  6. P. Francesco, P. Marchal, D. Atienzaothers, et al., An Integrated Hardware/Software Approach for Run-time Scratchpad Management, Proceedings of the 41st Design Automation Conference, pp. 238–243 (June 2004).

  7. The Berkeley Intelligent RAM (IRAM) Project. http://iram.cs.berkeley.edu.

  8. Khailany B., Dally W., Rixner S. et al (March–April 2001). Imagine: Media Processing with Streams. IEEE Micro, 21(2):35–46

    Article  Google Scholar 

  9. Oka M., Suzuoki M. (November 1999). Designing and Programming the Emotion Engine. IEEE Micro, 19(6):20–28

    Article  Google Scholar 

  10. Kunimatsu A., Ide N., Sato T. et al. (March 2000). Vector Unit Architecture for Emotion Synthesis. IEEE Micro, 20(2):40–47

    Article  Google Scholar 

  11. Suzuoki M. et al. (November 1999). A Microprocessor with a 128-bit cpu, Ten Floating Point Macs, Four Floating-point Dividers, and an mpeg-2 Decoder. IEEE Solid State Circuits, 34(1):1608–1618

    Article  Google Scholar 

  12. B. Flachs, S. Asano, S.H. Dhong, et al., A Streaming Processor Unit for a Cell Processor, ISSCC Dig. Tech. Papers, pp. 134–135 (February 2005).

  13. D. Pham, S. Asano, M. Bollier, et al., The Design and Implementation of a First-generation Cell Processor, ISSCC Dig. Tech. Papers, pp. 184–185 (February 2005).

  14. S. M. Mueller, C. Jacobi, C. Hwa-Joon, et al., The Vector Floating-point Unit in a Synergistic Processor Element of a Cell Processor, 17th IEEE Annual Symposium on Computer Arithmetic (ISCA), pp. 59–67 (June 2005).

  15. J. A. Kahle, M. N. Day, H. P. Hofstee, et al., Introduction to the Cell Multiprocessor. IBM Journal of R&D, 49(4) (2005).

  16. IBM Cell specifications. http://www.research.ibm.com/cell/home.html.

  17. M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J.Dongarra. MPI: The Complete Reference (Vol. 1). The MIT Press (1998).

  18. Sony press release. http://www.scei.co.jp/corporate/release/pdf/050517e.pdf.

  19. N. Park, B. Hong, and V. K. Prasanna, Analysis of Memory Hierarchy Performance of Block Data Layout. International Conference on Parallel Processing (ICPP), p. 35 (August 2002).

  20. L. Cannon, A Cellular Computer to Implement the Kalman Filter Algorithm. PhD thesis, Montana State University, p. 228 (1969).

  21. Cell Broadband Engine Architecture and its First Implementation. http://www-128. ibm.com/developerworks/power/library/pa-cellperf/

  22. Saad Y. (1996). Iterative Methods for Sprarse Linear Systems. PWS, Boston, MA

    Google Scholar 

  23. G. Blelloch, M. Heroux, and M. Zagha, Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors, Technical Report CMU-CS-93-173, CMU (1993).

  24. R. Vuduc, Automatic Performance Tuning of Sparse Matrix Kernels, PhD thesis, University of California at Berkeley (2003).

  25. E.-J. Im, K. Yelick, and R. Vuduc, Sparsity: Optimization Framework for Sparse Matrix Kernels, International Journal of High Performance Computing Applications, pp. 135–158 (2004).

  26. E. F. D’Azevedo, M. R. Fahey, and R. T. Mills, Vectorized Sparse Matrix Multiply for Compressed Row Storage Format, International Conference on Computational Science (ICCS), pp. 99–106 (2005).

  27. Chombo homepage. http://seesar.lbl.gov/anag/chombo.

  28. Cactus homepage. http://www.cactuscode.org.

  29. Li Z., Song Y. (2004). Automatic Tiling of Iterative Stencil Loops. ACM Transactions on Programming Language Systems, 26(6):975–1028

    Article  Google Scholar 

  30. David Wonnacott, Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations. International Parallel and Distributed Processing Symposium (IPDPS), pp. 171–180 (2000).

  31. G. Jin, J. Mellor-Crummey, and R. Fowlerothers, Increasing Temporal Locality with Skewing and Recursive Blocking, Proc. SC2001 (2001).

  32. S. Kamil, K. Datta, S. Williams, et al., Implicit and Explicit Optimizations for Stencil Computations, ACM Workshop on Memory System Performance and Correctness, pp. 51–60 (October 2005).

  33. S. Kamil, P. Husbands, L. Oliker, et al., Impact of Modern Memory Subsystems on Cache Optimizations for Stencil Computations, ACM Workshop on Memory System Performance, pp. 36–43 (June 2005).

  34. L. Oliker, R. Biswas, J. Borrill, et al., A Performance Evaluation of the Cray X1 for Scientific Applications, Proc. 6th International Meeting on High Performance Computing for Computational Science, pp. 51–65 (2004).

  35. FFTW speed tests. http://www.fftw.org.

  36. A. Chow, G. Fosum, D, and Brokenshire, A Programming Example: Large FFT on the Cell Broadband Engine, Proceeding of the 2005 Global Signal Processing Expo (GSPx) (October, 2005).

  37. J. Greene and R. Cooper, A Parallel 64k Complex FFT Algorithm for the PIBM/Sony/Toshiba Cell Broadband Engine processor, Proceeding of the 2005 Global Signal Processing Expo (GSPx) (October, 2005).

  38. ORNL cray x1 evaluation. http://www.csm.ornl.gov/~dunigan/cray.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Samuel Williams.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Williams, S., Shalf, J., Oliker, L. et al. Scientific Computing Kernels on the Cell Processor. Int J Parallel Prog 35, 263–298 (2007). https://doi.org/10.1007/s10766-007-0034-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-007-0034-5

Keywords

Navigation