Advertisement

Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences

  • Ziang Hu
  • Juan del Cuvillo
  • Weirong Zhu
  • Guang R. Gao
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4128)

Abstract

This paper presents a study of performance optimization of dense matrix multiplication on IBM Cyclops-64(C64) chip architecture. Although much has been published on how to optimize dense matrix applications on shared memory architecture with multi-level caches, little has been reported on the applicability of the existing methods to the new generation of multi-core architectures like C64. For such architectures a more economical use of on-chip storage resources appears to discourage the use of caches, while providing tremendous on-chip memory bandwidth per storage area.

This paper presents an in-depth case study of a collection of well known optimization methods and tries to re-engineer them to address the new challenges and opportunities provided by this emerging class of multi-core chip architectures. Our study demonstrates that efficiently exploiting the memory hierarchy is the key to achieving good performance. The main contributions of this paper include: (a) identifying a set of key optimizations for C64-like architectures, and (b) exploring a practical order of the optimizations, which yields good performance for applications like matrix multiplication.

Keywords

Memory Hierarchy Program Language Design Memory Segment Register Tiling Task Pool 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Denneau, M., Warren, Jr., H.S.: 64-bit Cyclops principles of operation part I. Technical report, IBM Watson Research Center, Yorktown Heights, NY (2005)Google Scholar
  2. 2.
    Denneau, M., Warren, Jr., H.S.: 64-bit Cyclops principles of operation part II: Memory organization, the A-switch, and SPRs. Technical report, IBM Watson Research Center, Yorktown Heights, NY (2005)Google Scholar
  3. 3.
    Almagor, L., Cooper, K.D., Al, E.: Finding effective compilation sequences. In: LCTES 2004, Wahsington, DC, USA (2004)Google Scholar
  4. 4.
    Wolf, M.E., Maydan, D.E., Chen, D.K.: Combining loop transformations considering caches and scheduling. In: Proceedings of the 29th Annual International Symposium on Microarchitecture, Paris, IEEE-CS TC-MICRO and ACM SIGMICRO, pp. 274–286 (1996)Google Scholar
  5. 5.
    del Cuvillo, J., Zhu, W., Hu, Z., Gao, G.R.: Fast: A functionally accurate simulation toolset for the cyclops-64 cellular architecture. In: Workshop on Modeling, Benchmarking and Simulation (MoBS 2005) of ISCA 2005, Madison, Wisconsin (2005)Google Scholar
  6. 6.
    Allen, R., Kennedy, K.: Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann, San Francisco (2001)Google Scholar
  7. 7.
    Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation, Toronto, Ontario, pp. 30–44 (1991), SIGPLAN Notices 26(6) (June 1991) Google Scholar
  8. 8.
    Carr, S., McKinley, K.S., Tseng, C.W.: Compiler optimizations for improving data locality. In: Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, California, pp. 252–262. ACM SIGARCH, SIGOPS, SIGPLAN, and the IEEE Computer Society (1994), Computer Architecture News (October 22, 1994) Operating Systems Review 28(5) (December 1994), SIGPLAN Notices 29(11) (November 1994)Google Scholar
  9. 9.
    Anderson, J.M., Lam, M.S.: Global optimizations for parallelism and locality on scalable parallel machines. In: Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation, Albuquerque, New Mexico, pp. 112–125 (1993), SIGPLAN Notices 28(6) (June 1993)Google Scholar
  10. 10.
    Wolfe, M.J.: High Performance Compilers for Parallel Computing. Addison-Wesley Longman Publishing, Boston (1995)Google Scholar
  11. 11.
    Lim, A.W., Lam, M.S.: Maximizing parallelism and minimizing synchronization with affine transforms. In: Conference Record of POPL 1997: The 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Paris, pp. 201–214 (1997)Google Scholar
  12. 12.
    Wolfe, M.: Iteration space tiling for memory hierarchies (SIAM) Parallel Processing for Scientific Computing, pp. 36–361 (1987)Google Scholar
  13. 13.
    Andonov, R., Bourzoufi, H., Rajopadhye, S.: Two-dimensional orthogonal tiling: from theory to pratice. In: HiPC 1996, Trivandrum, India (1996)Google Scholar
  14. 14.
    Xue, J.: Loop Tiling for Parallelism. Kluwer Academic Publishers, Dordrecht (2000)MATHGoogle Scholar
  15. 15.
    Calder, B., Krintz, C., John, S., Austin, T.: Cache-conscious data placement. In: Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, California, pp. 139–149. ACM SIGARCH, SIGOPS, SIGPLAN, and the IEEE Computer Society (1998), Computer Architecture News (October 26, 1998), Operating Systems Review 32(5) (December 1998), SIGPLAN Notices 33(11) (November 1998)Google Scholar
  16. 16.
    Ding, C., Kennedy, K.: Improving cache performance in dynamic applications through data and computation reorganization at run time. In: [22], May 1999. SIGPLAN Notices 34(5), 229–241 (1999)Google Scholar
  17. 17.
    Kennedy, K., Kremer, U.: Automatic data layout for distributed memory machines. ACM Transactions on Programming Languages and Systems 20(4) (1998)Google Scholar
  18. 18.
    Chilimbi, T.M., Davidson, B., Larus, J.R.: Cache-conscious structure definition. In: [22], May 1999. SIGPLAN Notices 34(5), 13–24 (1999)Google Scholar
  19. 19.
    Gloy, N., Smith, M.D.: Procedure placement using temporal-ordering information. ACM Transactions on Programming Languages and Systems 21(5) (1999)Google Scholar
  20. 20.
    Ding, C., Kennedy, K.: Improving effective bandwidth through compiler enhancement of global cache reuse. Parallel and Distributed Computing 64(1) (2004)Google Scholar
  21. 21.
    Ding, C., Orlovich, M.: The potential of computation regrouping for improving locality. In: SuperComputing 2004, Pittsburgh, PA (2004)Google Scholar
  22. 22.
    Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation. In: Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation, Atlanta, Georgia (1999), SIGPLAN Notices 34(5) (May 1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Ziang Hu
    • 1
  • Juan del Cuvillo
    • 1
  • Weirong Zhu
    • 1
  • Guang R. Gao
    • 1
  1. 1.Department of Electrical and Computer EngineeringUniversity of DelawareNewarkU.S.A.

Personalised recommendations