Skip to main content

Programming the LU Factorization for a Multicore System with Accelerators

  • Conference paper
Book cover High Performance Computing for Computational Science - VECPAR 2012 (VECPAR 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7851))

  • 2030 Accesses

Abstract

LU factorization with partial pivoting is a canonical numerical procedure and the main component of the High Performance LINPACK benchmark. This article presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Castaldo, A.M., Whaley, R.C.: Scaling LAPACK panel operations using parallel cache assignment. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2010. ACM, Bangalore (2010), doi:10.1145/1693453.1693484 (accepted to ACM TOMS)

    Google Scholar 

  2. Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: Past, present and future. Concurrency Computat.: Pract. Exper. 15(9), 803–820 (2003), doi:10.1002/cpe.728

    Article  Google Scholar 

  3. Gustavson, F.G.: Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J. Res. Dev. 41(6), 737–756 (1997), doi:10.1147/rd.416.0737

    Article  Google Scholar 

  4. Gustavson, F.G., Karlsson, L., Kågström, B.: Parallel and cache-efficient in-place matrix storage format conversion. Tech. Rep. UMINF 10.05, Department of Computer Science, Umeå University (2010), http://www8.cs.umu.se/research/uminf/reports/2010/005/part1.pdf (accepted to ACM TOMS)

  5. Kurzak, J., Tomov, S., Dongarra, J.: Autotuning GEMMs for Fermi. Tech. Rep. UT-CS-11-671, Electrical Engineering and Computer Science Department, University of Tennessee (2011), http://www.netlib.org/lapack/lawnspdf/lawn245.pdf (accepted to IEEE TPDS)

  6. MAGMA, http://icl.eecs.utk.edu/magma/

  7. PLASMA, http://icl.eecs.utk.edu/plasma/

  8. QUARK, http://icl.eecs.utk.edu/quark/

  9. Whaley, R.C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput. Syst. Appl. 27(1-2), 3–35 (2001), doi:10.1016/S0167-8191(00)00087-9

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kurzak, J., Luszczek, P., Faverge, M., Dongarra, J. (2013). Programming the LU Factorization for a Multicore System with Accelerators. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science - VECPAR 2012. VECPAR 2012. Lecture Notes in Computer Science, vol 7851. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38718-0_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38718-0_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38717-3

  • Online ISBN: 978-3-642-38718-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics