A Fully Empirical Autotuned Dense QR Factorization for Multicore Architectures

  • Emmanuel Agullo
  • Jack Dongarra
  • Rajib Nath
  • Stanimire Tomov
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6853)


Tuning numerical libraries has become more difficult over time, as systems get more sophisticated. In particular, modern multicore machines make the behaviour of algorithms hard to forecast and model. In this paper, we tackle the issue of tuning a dense QR factorization on multicore architectures using a fully empirical approach.We exhibit a few strong empirical properties that enable us to efficiently prune the search space. Our method is automatic, fast and reliable. The tuning process is indeed fully performed at install time in less than one hour and ten minutes on five out of seven platforms. We achieve an average performance varying from 97% to 100% of the optimum performance depending on the platform. This work is a basis for autotuning the PLASMA library and enabling easy performance portability across hardware systems.


Directed Acyclic Graph Tuning Process Tile Size Multicore Architecture Panel Factorization 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Frigo, M., Johnson, S.: FFTW: An adaptive software architecture for the FFT. In: Proc. 1998 IEEE Intl. Conf. Acoustics Speech and Signal Processing, vol. 3, pp. 1381–1384. IEEE, Los Alamitos (1998)Google Scholar
  2. 2.
    Choi, J.W., Singh, A., Vuduc, R.W.: Model-driven autotuning of sparse matrix-vector multiply on GPUs. In: Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP), Bangalore, India (January 2010)Google Scholar
  3. 3.
    Ansel, J., Chan, C., Wong, Y.L., Olszewski, M., Zhao, Q., Edelman, A., Amarasinghe, S.: Petabricks: A language and compiler for algorithmic choice. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, Dublin, Ireland (June 2009)Google Scholar
  4. 4.
    Clint Whaley, R., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of software and the atlas project. Parallel Computing 27(1-2), 3–35 (2001)CrossRefzbMATHGoogle Scholar
  5. 5.
    Volkov, V., Demmel, J.W.: Benchmarking gpus to tune dense linear algebra. In: SC 2008: Proceedings of the ACM/IEEE Conference on Supercomputing, pp. 1–11. IEEE Press, Piscataway (2008)Google Scholar
  6. 6.
    Tomov, S., Nath, R., Ltaief, H., Dongarra, J.: Dense linear algebra solvers for multicore with gpu accelerators. Accepted for publication at HIPS 2010 (2010)Google Scholar
  7. 7.
    Quintana-Ortí, G., Quintana-Ortí, E., van de Geijn, R., Van Zee, F., Chan, E.: Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Trans. Math. Softw. 36(3) (2009)Google Scholar
  8. 8.
    Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing 35(1), 38–53 (2009)CrossRefMathSciNetGoogle Scholar
  9. 9.
    Agullo, E., Hadri, B., Ltaief, H., Dongarra, J.: Comparative study of one-sided factorizations with multiple software packages on multi-core hardware. In: 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 2009) (2009)Google Scholar
  10. 10.
    Agullo, E., Dongarra, J., Nath, R., Tomov, S.: A Fully Empirical Autotuned Dense QR Factorization For Multicore Architectures. Research Report 7526, INRIA (Febuary 2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Emmanuel Agullo
    • 1
  • Jack Dongarra
    • 2
  • Rajib Nath
    • 2
  • Stanimire Tomov
    • 2
  1. 1.LaBRI and INRIA Bordeaux Sud OuestFrance
  2. 2.University of TennesseeUSA

Personalised recommendations