Abstract
LU factorization with partial pivoting is a canonical numerical procedure and the main component of the High Performance LINPACK benchmark. This article presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Castaldo, A.M., Whaley, R.C.: Scaling LAPACK panel operations using parallel cache assignment. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2010. ACM, Bangalore (2010), doi:10.1145/1693453.1693484 (accepted to ACM TOMS)
Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: Past, present and future. Concurrency Computat.: Pract. Exper. 15(9), 803–820 (2003), doi:10.1002/cpe.728
Gustavson, F.G.: Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J. Res. Dev. 41(6), 737–756 (1997), doi:10.1147/rd.416.0737
Gustavson, F.G., Karlsson, L., Kågström, B.: Parallel and cache-efficient in-place matrix storage format conversion. Tech. Rep. UMINF 10.05, Department of Computer Science, Umeå University (2010), http://www8.cs.umu.se/research/uminf/reports/2010/005/part1.pdf (accepted to ACM TOMS)
Kurzak, J., Tomov, S., Dongarra, J.: Autotuning GEMMs for Fermi. Tech. Rep. UT-CS-11-671, Electrical Engineering and Computer Science Department, University of Tennessee (2011), http://www.netlib.org/lapack/lawnspdf/lawn245.pdf (accepted to IEEE TPDS)
PLASMA, http://icl.eecs.utk.edu/plasma/
Whaley, R.C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput. Syst. Appl. 27(1-2), 3–35 (2001), doi:10.1016/S0167-8191(00)00087-9
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kurzak, J., Luszczek, P., Faverge, M., Dongarra, J. (2013). Programming the LU Factorization for a Multicore System with Accelerators. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science - VECPAR 2012. VECPAR 2012. Lecture Notes in Computer Science, vol 7851. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38718-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-38718-0_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38717-3
Online ISBN: 978-3-642-38718-0
eBook Packages: Computer ScienceComputer Science (R0)