As CMOS feature sizes continue to shrink and traditional microarchitectural methods for delivering high performance (e.g., deep pipelining) become too expensive and power-hungry, chip multiprocessors (CMPs) become an exciting new direction by which system designers can deliver increased performance. Exploiting parallelism in such designs is the key to high performance, and we find that parallelism must be exploited at multiple levels of the system: the thread-level parallelism that has become popular in many designs fails to exploit all the levels of available parallelism in many workloads for CMP systems. We describe the Cell Broadband Engine and the multiple levels at which its architecture exploits parallelism: data-level, instruction-level, thread-level, memory-level, and compute-transfer parallelism. By taking advantage of opportunities at all levels of the system, this CMP revolutionizes parallel architectures to deliver previously unattained levels of single chip performance. We describe how the heterogeneous cores allow to achieve this performance by parallelizing and offloading computation intensive application code onto the Synergistic Processor Element (SPE) cores using a heterogeneous thread model with SPEs. We also give an example of scheduling code to be memory latency tolerant using software pipelining techniques in the SPE.
Similar content being viewed by others
References
V. Srinivasan, D. Brooks, M. Gschwind, P. Bose, P. Emma, V. Zyuban, and P. Strenski, Optimizing Pipelines for Power and Performance, in Proc. 35th International Symposium on Microarchitecture (December 2002).
R. Dennard, F. Gaensslen, H.-N. Yu, L. Rideout, E. Bassous, and A. LeBlanc. Design of ion-implanted MOSFETs with very Small Physical Dimensions, IEEE J. Solid State Circuits, SC-9:256–268 (1974).
Christensen C. (1997) The Innovator’s Dilemma. McGraw-Hill, New York
J. Kahle, M. Day, P. Hofstee, C. Johns, T. Maeurer, and D. Shippy. Introduction to the Cell Multiprocessor. IBM J. Res. Dev., 49(4/5):589–604 (September 2005).
P. Hofstee. Power Efficient Processor Architecture and the Cell Processor, in Proc. 11th International Symposium on High-Performance Computer Architecture (February 2005).
M. Gschwind, P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. A Novel SIMD Architecture for the CELL Heterogeneous Chip-Multiprocessor, in Hot Chips 17, Palo Alto, CA (August 2005).
M. Gschwind, P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic Processing in Cell’s Multicore Architecture, IEEE Micro, 26(2):10–24 (March 2006).
B. Flachs, S. Asano, S. Dhong, P. Hofstee, G. Gervais, R. Kim, T. Le, P. Liu, J. Leenstra, J. Liberty, B. Michael, H.-J. Oh, S. Mueller, O. Takahashi, A. Hatakeyama, Y. Watanabe, N. Yano, D. Brokenshire, M. Peyravian, V. To, and E. Iwata. The microarchitecture of the Synergistic Processor for a Cell Processor, IEEE J. Solid State Circuits, 41(1):63–70 (January 2006).
T. Karkhanis and J. E. Smith. A Day in the Life of a Data Cache Miss, in Workshop on Memory Performance Issues (2002).
V. Salapura, R. Bickford, M. Blumrich, A. A. Bright, D. Chen, P. Coteus, A. Gara, M. Giampapa, M. Gschwind, M. Gupta, S. Hall, R. A. Haring, P. Heidelberger, D. Hoenicke, G. V. Kopcsay, M. Ohmacht, R. A. Rand, T. Takken, and P. Vranas. Power and Performance Optimization at the System Level, in Proc. ACM Computing Frontiers 2005 (May 2005).
V. Salapura, R. Walkup, and A. Gara. Exploiting Workload Parallelism for Power and Performance Optimization in Blue Gene, IEEE Micro, 26(5):67–81 (September 2006).
W. Wulf and S. McKee. Hitting the Memory Wall: Implications of the Obvious. Compu. Archit. News, 23(1):20–24 (March 1995).
A. Glew. MLP yes! ILP no!, in ASPLOS Wild and Crazy Idea Session ’98 (October 1998).
The Blue Gene team. Blue Gene: A Vision for Protein Science Using a Petaflop Supercomputer. IBM Syst. J., 40(2):310–327 (2001).
C. Cascaval, J. Castanos, L. Ceze, M. Denneau, M. Gupta, D. Lieber, J. Moreira, K. Strauss, and H. Warren. Evaluation of a Multithreaded Architecture for Cellular Computing, in Proc. Eighth International Symposium on High-Performance Computer Architecture (2002).
Y. Chou, B. Fahs, and S. Abraham. Microarchitecture Optimizations for Exploiting Memory-Level Parallelism, in Proc. 31st Annual International Symposium on Computer Architecture (June 2004).
L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing,in Proc. 27th Annual International Symposium on Computer Architecture (June 2000).
E. Altman, P. Capek, M. Gschwind, P. Hofstee, J. Kahle, R. Nair, S. Sathaye, and J.-D. Wellman. Method and system for maintaining coherency in a multiprovessor system by broadcasting TLB invalidated entry instructions. U.S. Patent 6970982 (November 2005).
M. Gschwind. Chip multiprocessing and the Cell Broadband Engine, in Proc. ACM Computing Frontiers 2006 (May 2006).
C. McNairy and R. Bhatia. Montecito: A Dual-Core, Dual-Thread Itanium Processor, IEEE Micro, 25(2):10–20 (March 2005).
S. Clark, K. Haselhorst, K. Imming, J. Irish, D. Krolak, and T. Ozguner. Cell Broadband Engine Interconnect and Memory Interface, in Hot Chips 17, Palo Alto, CA (August 2005).
C. Click. A Tour Inside the Azul 384-way Java Appliance, Tutorial at the 14th International Conference on Parallel Architectures and Compilation Techniques (September 2005).
A. Eichenberger, K. O’Brien, K. O’Brien, P. Wu, T. Chen, P. Oden, D. Prener, J. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, and M. Gschwind. Optimizing Compiler for the Cell Processor, in Proc. 14th International Conference on Parallel Architectures and Compilation Techniques (September 2005).
Author information
Authors and Affiliations
Corresponding author
Additional information
This paper is based in part on “Chip multiprocessing and the Cell Broadband Engine”, ACM Computing Frontiers 2006.
Rights and permissions
About this article
Cite this article
Gschwind, M. The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor. Int J Parallel Prog 35, 233–262 (2007). https://doi.org/10.1007/s10766-007-0035-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-007-0035-4