Dependence-Based Code Generation for a CELL Processor

  • Yuan Zhao
  • Ken Kennedy
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4382)


Obtaining high performance on the STI CELL processor requires substantial programming effort because its architectural features must be explicitly managed, with separate codes required for two different types of cores (PPE and SPE). Research at IBM has developed a single source-image compiler for CELL that performs vectorization but uses OpenMP to specify cross-core parallelism. In this paper, we present and evaluate an alternative dependence-based compiler approach that automatically generates parallel and vector code for CELL from a single source program with no parallelism directives. In contrast to OpenMP, our approach can also handle loop nests that carry dependences. To preserve correct program semantics, we employ on-chip communication mechanisms to implement barrier and unidirectional synchronization primitives. We also implement strategies to boost performance by managing DMA data movement, improving data alignment, and exploiting memory reuse in the innermost loop.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Allen, J.R.: Dependence Analysis for Subscripted Variables and its Application to Program Transformation. PhD thesis, Rice University, Houston, Texas (1983)Google Scholar
  2. 2.
    Allen, R., Callahan, D., Kennedy, K.: Automatic decomposition of scientific programs for parallel execution. In: POPL ’87: Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, Munich, West Germany, ACM Press, New York (1987)Google Scholar
  3. 3.
    Allen, R., Kennedy, K.: Vector register allocation. IEEE Transactions on Computers 41(10), 1290–1317 (1992)CrossRefGoogle Scholar
  4. 4.
    Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architectures. Morgan Kaufmann, San Francisco (2001)Google Scholar
  5. 5.
    Bik, A.J.C., et al.: Automatic intra-register vectorization for the intel architecture. International Journal of Parallel Programming 30(2), 65–98 (2002)MATHCrossRefGoogle Scholar
  6. 6.
    Callahan, D., Kennedy, K., Porterfield, A.: Software prefetching. In: Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Santa Clara, California, April (1991)Google Scholar
  7. 7.
    Carr, S., Kennedy, K.: Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems 15(3), 400–462 (1994)Google Scholar
  8. 8.
    Crescent Bay Software. VAST/AltiVec.
  9. 9.
    Eichenberger, A.E., et al.: Optimizing compiler for a cell processor. In: PACT (2005)Google Scholar
  10. 10.
    Eichenberger, A.E., Wu, P., O’Brien, K.: Vectorization for SIMD architectures with alignment constraints. In: PLDI’04, Washington DC, USA, June (2004)Google Scholar
  11. 11.
    Feldman, S.I., et al.: A fortran-to-C converter. Technical Report 149, AT&T Bell Laboratories, Murray Hill, NJ (1990)Google Scholar
  12. 12.
    Lam, M.D., Rothberg, E.E., Wolf, M.E.: The cache performance and optimizations of blocked algorithms. In: ASPLOS-IV: Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Santa Clara, California, United States, April (1991)Google Scholar
  13. 13.
    Larsen, S., Amarasinghe, S.: Exploiting superword level parallelism with multimedia instruction sets. In: PLDI (2000)Google Scholar
  14. 14.
    Mowry, T.C.: Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Standford University, California (1994)Google Scholar
  15. 15.
    Nuzman, D., Henderson, R.: Multi-platform auto-vectorization. In: CGO ’06: Proceedings of the International Symposium on Code Generation and Optimization, Washington, DC, USA (2006)Google Scholar
  16. 16.
    Nuzman, D., Rosen, I., Zaks, A.: Auto-vectorization of interleaved data for SIMD. In: PLDI, Ottawa, Ontario, Canada (2006)Google Scholar
  17. 17.
    Shin, J., Chame, J., Hall, M.W.: Compiler-controlled caching in superword register files for multimeida extension architecture. In: PACT (2002)Google Scholar
  18. 18.
    Temam, O., Granston, E.D., Jalby, W.: To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts. In: Supercomputing ’93: Proceedings of the 1993 ACM/IEEE Conference on Supercomputing, Portland, Oregon, United States, November 1993, IEEE Computer Society Press, Los Alamitos (1993)Google Scholar
  19. 19.
    Whaley, R.C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Computing 27(1), 3–25 (2001)MATHCrossRefGoogle Scholar
  20. 20.
    Yi, Q.: Applying data copy to improve memory performance of general array computations. In: Ayguadé, E., et al. (eds.) LCPC 2005. LNCS, vol. 4339, Springer, Heidelberg (2006)CrossRefGoogle Scholar
  21. 21.
    Zhao, Y., Kennedy, K.: Scalarization on short vector machines. In: 2005 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, Texas, March 20–22, 2005, IEEE Computer Society Press, Los Alamitos (2005)Google Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Yuan Zhao
    • 1
  • Ken Kennedy
    • 1
  1. 1.Computer Science Department, Rice University, Houston, TXUSA

Personalised recommendations