A Paradigm for Parallel Matrix Algorithms:

  • David S. Wise
  • Craig Citro
  • Joshua Hursey
  • Fang Liu
  • Michael Rainey
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3648)

Abstract

A style for programming problems from matrix algebra is developed with a familiar example and new tools, yielding high performance with a couple of surprising exceptions. The underlying philosophy is to use block recursion as the exclusive control structure, down to a 2p× 2p base case anyway, where hardware favors iterative style to fill its pipe. Use of Morton-ordered matrices yields excellent locality within the memory hierarchy—including block sharing among distributed computers. The recursion generalizes nicely to an SPMD program where such sharing is the only communication.

Cholesky factorization of an n × n SPD matrix is used as a simple nontrivial example to expose the paradigm. The program amounts to four functions, two of which are finalizers for the other two. This insight allows final blocks to be shared with inter-node communication ∈ Θ(n2) for this algorithm ∈ Θ (n3) flops.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Chatterjee, S., Lebeck, A.R., Patnala, P.K., Thottenthodi, M.: Recursive array layouts and fast parallel matrix multiplication. IEEE Trans. Parallel Distrib. Syst. 13, 1105–1123 (2002), http://dx.doi.org/10.1109/TPDs.2002.105s095 CrossRefGoogle Scholar
  2. 2.
    Thiyagalingam, J., Beckmann, O., Kelly, P.H.J.: Is Morton layout competitive for large two-dimensional arrays, yet? Concur. Comput. Prac. Exper. (2004) ,To appear in special issue on Compilers for Parallel Computing, http://www.docic.ac.uk/~phjk/Publications/IsMortonYetCCPandE2004.pdf
  3. 3.
    Goto, K., van de Geijn, R.: On reducing TLB misses in matrix multiplication.FLAME Working Note 9, Univ. of Texas, Austin (2002), http://www.cs.utexas.edu/users/flame/pubs/GOTO.ps.gz
  4. 4.
    Morton, C.: A computer oriented geodetic data base and a new technique in file sequencing. Technical report, IBM Ltd., Ottawa, Ontario (1966)Google Scholar
  5. 5.
    Drakenberg, P., Lundevall, F., Lisper, B.: An efficient semi-hierarchical array layout. In: Lee, C., Yew, P.C. (eds.) Interaction between Compilers and Computer Architectures. Kluwer Intl. Series in Engineering and Computer Science, vol. 613, Kluwer, Deventer (2001), http://www.mrtc.mdh.se/publications/0313.pdf Google Scholar
  6. 6.
    Wise, D.S.: Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In: Bode, A., Ludwig, T., Karl, W.C., Wismüller, R. (eds.) Euro-Par 2000. LNCS, vol. 1900, pp. 774–883. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  7. 7.
    Wise, D.S., Frens, J.D., Gu, Y., Alexander, G.A.: Language support for Morton-order matrices. In: Proc. 8th ACM SIGPLAN Symp. on Principles and Practice of Parallel Program. SIGPLAN Not., vol. 36, pp. 24–33 (2001), http://doi.acm.org/10.1145/379539.379559
  8. 8.
    Schrack, G.: Finding neighbors of equal size in linear quadtrees and octrees in constant time. CVGIP: Image Underst. 55, 221–230 (1992)MATHCrossRefGoogle Scholar
  9. 9.
    Raman, R., Wise, D.S.: Converting to and from dilated integers. Submitted for publication (2004), http://www.cs.indiana.edu/dswise/Arcee/castingDilated-comb.pdf
  10. 10.
    Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: Proc. 40th Ann. Symp. Foundations of Computer Science, pp. 285–298. IEEE Computer Soc. Press, Washington (1999), http://dx.doi.org/10.1109/SFFCS.1999.814600 Google Scholar
  11. 11.
    Frens, J.D.: Matrix Factorization Using a Block-Recursive Structure and Block-Recursive Algorithms. PhD thesis, Indiana Univ., Bloomington (2002), http://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR568
  12. 12.
    Spiefi, J.: Untersuchungen des Zeitgewinns durch neue Algorithmen zur Matrix-Multiplication. Computing 17, 23–36 (1976)CrossRefMathSciNetGoogle Scholar
  13. 13.
    Tocher, K.D.: The application of automatic computers to sampling experiments. J. Roy. Statist. Soc. Ser. B 16, 39–61,53-55 (1954)MATHMathSciNetGoogle Scholar
  14. 14.
    Johnson, D.S.: A theoretician’s guide to the experimental analysis of algorithms. In: Goldwasser, M.H., Johnson, D.S., McGeoch, C.C. (eds.) Data Structures, Near Neighbor Searches, and Methodology: 5th & 6th DIMACS Implementation Challenges. DIMACS Ser. Discrete Math. Theoret. Comput. Sci. Amer. Math. Soc, Providence, vol. 59, pp. 215–250 (2002), http://www.research.att.com/~dsj/papers.html
  15. 15.
    Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: Proc. Supercomputing 1998, vol. 38, IEEE Computer Soc. Press, Washington (1998), http://dx.doi.org/10.1109/SC.1998.10004 Google Scholar
  16. 16.
    Intel Corp. Santa Clara, CA: Intel Math Kernel Library (2003), http://www.intel.com/software/products/mkl/
  17. 17.
    LAM/MPI Bloomington, IN (2004) ,www.lam-mpi.org
  18. 18.
    InfiniBand Trade Assn. Portland, OR (2004), www.infinibandta.org
  19. 19.
    InfiniCon Systems King of Prussia, PA (2004) ,www.infinicon.com
  20. 20.
    Myricom Inc. Arcadia, CA (2004) ,www.myri.com
  21. 21.
    Quadrics Ltd. Bristol, UK (2004), www.quadrics.com
  22. 22.
    Quadrics Ltd. Bristol, UK: Quadrics Release of MPICH 1.24. (2004), www.quadrics.com

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • David S. Wise
    • 1
  • Craig Citro
    • 1
  • Joshua Hursey
    • 1
  • Fang Liu
    • 1
  • Michael Rainey
    • 1
  1. 1.Indiana UniversityBloomington

Personalised recommendations