Advertisement

A Performance Model of Dense Matrix Operations on Many-Core Architectures

  • Guoping Long
  • Dongrui Fan
  • Junchao Zhang
  • Fenglong Song
  • Nan Yuan
  • Wei Lin
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5168)

Abstract

Current many-core architectures (MCA) have much larger arithmetic to memory bandwidth ratio compared with traditional processors (vector, superscalar, and multi-core, etc). As a result, bandwidth has become an important performance bottleneck of MCA. Previous works have demonstrated promising performance of MCA for dense matrix operations. However, there is still little quantitative understanding of the relationship between performance of matrix computation kernels and the limited memory bandwidth. This paper presents a performance model for dense matrix multiplication (MM), LU and Cholesky decomposition. The input parameters are memory bandwidth B and on-chip SRAM capacity C, while the output is maximum core number P max . We show that \(P_{max}=\Theta(B\ast \sqrt{C})\). P max indicates that when the problem size is large enough, the given memory bandwidth will not be a performance bottleneck as long as the number of cores P < P max . The model is validated by a comparison between the theoretical performance and experimental data of previous works.

Keywords

performance model many-core architecture dense matrix memory bandwidth 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The Landscape of Parallel Computing Research: A View from BerkeleyGoogle Scholar
  2. 2.
    Zhu, W.R., Sreedhar, V.C., Aang Hu, Z., Gao, G.R.: Synchronization State Buffer: Supporting Efficient Fine-Grain Synchronization for Many-Core Architectures. In: Proceedings of the 34th International Symposium on Computer Architecture (ISCA 2007), San Diego, CA, USA, June 9-13 (2007)Google Scholar
  3. 3.
    Vangal, S., Howard, J., Ruhl, G., Dighe, S., Wilson, H., Tschanz, J., Finan, D., Iyer, P., Singh, A., Jacob, T., Jain, S., Venkataraman, S., Hoskote, Y., Borkar, N.: An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS. In: Proceedings of IEEE International Solid-State Circuits Conference, February 11-15 (2007)Google Scholar
  4. 4.
    Dally, W.J., Labonte, F., Das, A., Hanrahan, P., Ahn, J.H., Gummaraju, J., Erez, M., Jayasena, N., Buck, I., Knight, T.J., Kapasi, U.J.: Merrimac: Supercomputing with Streams. In: Proceedings of the Supercomputer Conference, November 15-21 (2003)Google Scholar
  5. 5.
    Tan, G., Fan, D., Zhang, J., Russo, A., Gao, G.R.: Experience on Optimizing Irregular Computation for Memory Hierarchy in Manycore Architecture. In: The 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 20-23 (2008)Google Scholar
  6. 6.
    Ang Hu, Z., del Cuvillo, J., Zhu, W., Gao, G.R.: Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences. In: The 12th International European Conference on Parallel Processing, 29 August - 1 September (2006)Google Scholar
  7. 7.
    Venetis, I.E., Gao, G.R.: Optimizing the LU Benchmark for the Cyclops-64 Architecture. CAPSL Technical Memo 75 (February 2007)Google Scholar
  8. 8.
    Tan, G.: Locality and Parallelism of Algorithm in Irregular Computation. PH.D. dissertation. Institute of Computing Technology, Chinese Academy of Sciences (6) (2007)Google Scholar
  9. 9.
    Automatically Tuned Linear Algebra Software (ATLAS), http://math-atlas.sourceforge.net/
  10. 10.
    Yotov, K., Roeder, T., Pingali, K., Gunnels, J., Gustavson, F.: An Experimental Comparison of Cache-oblivious and Cache-aware Programs. In: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, June 9-11 (2007)Google Scholar
  11. 11.
    Bilardi, G., Pietracaprina, A., Pucci, G., Schifano, S.F., Tripiccione, R.: The Potential of On-Chip Multiprocessing for QCD Machines. In: Proceedings of the International Conference on High Performance Computing, pp. 386–397 (December 2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Guoping Long
    • 1
  • Dongrui Fan
    • 1
  • Junchao Zhang
    • 1
  • Fenglong Song
    • 1
  • Nan Yuan
    • 1
  • Wei Lin
    • 1
  1. 1.Key Laboratory of Computer System and Architecture, Institute of Computing TechnologyChinese Academy of SciencesBeijingChina

Personalised recommendations