Heterogenous Acceleration for Linear Algebra in Multi-coprocessor Environments

  • Azzam Haidar
  • Piotr LuszczekEmail author
  • Stanimire Tomov
  • Jack Dongarra
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8969)


We present an efficient and scalable programming model for the development of linear algebra in heterogeneous multi-coprocessor environments. The model incorporates some of the current best design and implementation practices for the heterogeneous acceleration of dense linear algebra (DLA). Examples are given as the basis for solving linear systems’ algorithms – the LU, QR, and Cholesky factorizations. To generate the extreme level of parallelism needed for the efficient use of coprocessors, algorithms of interest are redesigned and then split into well-chosen computational tasks. The tasks execution is scheduled over the computational components of a hybrid system of multi-core CPUs and coprocessors using a light-weight runtime system. The use of light-weight runtime systems keeps scheduling overhead low, while enabling the expression of parallelism through otherwise sequential code. This simplifies the development efforts and allows the exploration of the unique strengths of the various hardware components.


Diagonal Block Hardware Component Hardware Accelerator Block Column Many Integrate Core 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This research was supported in part by the National Science Foundation under Grants OCI-1032815, ACI-1339822, and Subcontract RA241-G1 on NSF Prime Grant OCI- 0910735, DOE under Grants DE-SC0004983 and DE-SC0010042, and Intel Corporation.


  1. 1.
    Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.-A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concur. Comput. Pract. Exp. 23(2), 187–198 (2011)CrossRefGoogle Scholar
  2. 2.
    Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. SIGPLAN Not. 30, 207–216 (1995)CrossRefGoogle Scholar
  3. 3.
    Haidar, A., Ltaief, H., Luszczek, P., Dongarra, J.: A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium, Shanghai, China, 21–25 May 2012, pp. 25–35. IEEE Computer Society (2012)Google Scholar
  4. 4.
    Intel\(^{\textregistered }\) Xeon Phi™ coprocessor system software developers guide.
  5. 5.
  6. 6.
    Jeffers, J., Reinders, J.: Intel\(^{\textregistered }\) Xeon Phi™ Coprocessor High-Performance Programming. Morgan Kaufmann Publishers, San Francisco (2013)Google Scholar
  7. 7.
    Kurzak, J., Ltaief, H., Dongarra, J.J., Badia, R.M.: Scheduling dense linear algebra operations on multicore processors. Concur. Comput. Pract. Exp. 21(1), 15–44 (2009)Google Scholar
  8. 8.
    Kurzak, J., Luszczek, P., YarKhan, A., Faverge, M., Langou, J., Bouwmeester, H., Dongarra, J.: Multithreading in the PLASMA Library. In Handbook of Multi and Many-Core Processing: Architecture, Algorithms, Programming, and Applications. Computer and Information Science Series. Chapman and Hall/CRC, 26 April 2013Google Scholar
  9. 9.
    Ltaief, H., Luszczek, P., Dongarra, J.: Enhancing parallelism of tile bidiagonal transformation on multicore architectures using tree reduction. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2011, Part I. LNCS, vol. 7203, pp. 661–670. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  10. 10.
    Luszczek, P., Ltaief, H., Dongarra, J.: Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In: Proceedings of IPDPS 2011: IEEE International Parallel and Distributed Processing Symposium, Anchorage, Alaska, USA, 16–20 May 2011, pp. 944–955. IEEE Computer Society (2011)Google Scholar
  11. 11.
    Pérez, J.M., Badia, R.M., Labarta, J.: A dependency-aware task-based programming environment for multi-core architectures. In: Proceedings of the 2008 IEEE International Conference on Cluster Computing, Tsukuba, Japan, 29 September–1 October 2008, pp. 142–151. IEEE (2008)Google Scholar
  12. 12.
    Rinard, M.C., Scales, D.J., Lam, M.S.: Jade: a high-level, machine-independent language for parallel programming. Computer 26(6), 28–38 (1993). doi: 10.1109/2.214440 CrossRefGoogle Scholar
  13. 13.
    Valiant, L.G.: Bulk-synchronous parallel computers. In: Reeve, M. (ed.) Parallel Processing and Artificial Intelligence, pp. 15–22. Wiley, New York (1989)Google Scholar
  14. 14.
    Valiant, L. G.: A bridging model for parallel computation. Commun. ACM 33(8) (1990). doi: 10.1145/79173.79181
  15. 15.
    YarKhan, A.: Dynamic task execution on shared and distributed memory architectures. Ph.D. thesis, University of Tennessee, December 2012Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Azzam Haidar
    • 1
  • Piotr Luszczek
    • 1
    Email author
  • Stanimire Tomov
    • 1
  • Jack Dongarra
    • 1
    • 2
    • 3
  1. 1.University of Tennessee KnoxvilleKnoxvilleUSA
  2. 2.Oak Ridge National LaboratoryOak RidgeUSA
  3. 3.University of ManchesterManchesterUK

Personalised recommendations