A Study on Load Imbalance in Parallel Hypermatrix Multiplication Using OpenMP

  • José R. Herrero
  • Juan J. Navarro
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3911)


In this paper we present our work on the the parallelization of a matrix multiplication code based on the hypermatrix data structure. We have used OpenMP for the parallelization. We have added OpenMP directives to a few loops and experimented with several features available with OpenMP in the Intel Fortran Compiler: scheduling algorithms, chunk sizes and nested parallelism. We found that the load imbalance introduced by the hypermatrix structure could not be solved by any of those OpenMP features.


Schedule Algorithm Dynamic Schedule Memory Hierarchy Static Schedule Load Imbalance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Fuchs, G., Roy, J., Schrem, E.: Hypermatrix solution of large sets of symmetric positive-definite linear equations. Comp. Meth. Appl. Mech. Eng. 1, 197–216 (1972)CrossRefMATHGoogle Scholar
  2. 2.
    Noor, A., Voigt, S.: Hypermatrix scheme for the STAR–100 computer. Comp. & Struct. 5, 287–296 (1975)CrossRefGoogle Scholar
  3. 3.
    Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: Supercomputing 1998, pp. 211–217. IEEE Computer Society, Los Alamitos (1998)Google Scholar
  4. 4.
    Choi, J., Dongarra, J., Pozo, R., Walker, D.: ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers. In: Proc. Fourth Symposium on the Frontiers of Massively Parallel Computation, pp. 120–127. ACM Press, New York (1992)CrossRefGoogle Scholar
  5. 5.
    Chatterjee, S., Lebeck, A.R., Patnala, P.K., Thottethodi, M.: Recursive array layouts and fast parallel matrix multiplication. In: Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures, pp. 222–231. ACM Press, New York (1999)Google Scholar
  6. 6.
    OpenMP: (URL) http://www.openmp.org
  7. 7.
    Ast, M., Fischer, R., Manz, H., Schulz, U.: PERMAS: User’s reference manual, INTES publication no. 450, rev.d (1997)Google Scholar
  8. 8.
    Chatterjee, S., Jain, V.V., Lebeck, A.R., Mundhra, S., Thottethodi, M.: Nonlinear array layouts for hierarchical memory systems. In: Proceedings of the 13th international conference on Supercomputing, pp. 444–453. ACM Press, New York (1999)CrossRefGoogle Scholar
  9. 9.
    Frens, J.D., Wise, D.S.: Auto-blocking matrix multiplication, or tracking BLAS3 performance from source code. Proc. 6th ACM SIGPLAN Symp. on Principles and Practice of Parallel Program, SIGPLAN Not, 32(7), 206–216 (1997)Google Scholar
  10. 10.
    Valsalam, V., Skjellum, A.: A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurrency and Computation: Practice and Experience 14(10), 805–839 (2002)CrossRefMATHGoogle Scholar
  11. 11.
    Wise, D.S.: Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In: Bode, A., Ludwig, T., Karl, W.C., Wismüller, R. (eds.) Euro-Par 2000. LNCS, vol. 1900, pp. 774–783. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  12. 12.
    Mellor-Crummey, J., Whalley, D., Kennedy, K.: Improving memory hierarchy performance for irregular applications. In: Proceedings of the 13th international conference on Supercomputing, pp. 425–433. ACM Press, New York (1999)CrossRefGoogle Scholar
  13. 13.
    Wise, D.S.: Representing matrices as quadtrees for parallel processors. Information Processing Letters 20(4), 195–199 (1985)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Herrero, J.R., Navarro, J.J.: Automatic benchmarking and optimization of codes: an experience with numerical kernels. In: Proceedings of the 2003 International Conference on Software Engineering Research and Practice, pp. 701–706. CSREA Press (2003)Google Scholar
  15. 15.
    Herrero, J.R., Navarro, J.J.: Improving Performance of Hypermatrix Cholesky Factorization. In: Kosch, H., Böszörményi, L., Hellwagner, H. (eds.) Euro-Par 2003. LNCS, vol. 2790, pp. 461–469. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  16. 16.
    Lam, M., Rothberg, E., Wolf, M.: The cache performance and optimizations of blocked algorithms. In: Proceedings of ASPLOS 1991, pp. 67–74 (1991)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • José R. Herrero
    • 1
  • Juan J. Navarro
    • 1
  1. 1.Computer Architecture Dept.Univ. Politècnica de CatalunyaBarcelonaSpain

Personalised recommendations