Skip to main content

Implementing LU and Cholesky factorizations on artificial intelligence accelerators


LU and Cholesky factorizations for dense matrices are one of the most fundamental building blocks in a number of numerical applications. Because of the \(O(n^3)\) complexity, they may be the most time consuming basic kernels in numerical linear algebra. For this reason, accelerating them on a variety of modern parallel processors received much attention. We in this paper implement LU and Cholesky factorizations on novel massively parallel artificial intelligence (AI) accelerators originally developed for deep neural network applications. We explore data parallelism of the matrix factorizations, and exploit neural compute units and on-chip scratchpad memories of modern AI chips for accelerating them. The experimental results show that our various optimization methods bring performance improvements and can provide up to 41.54 and 19.77 GFlop/s performance using single precision data type and 78.37 and 33.85 GFlop/s performance using half precision data type for LU and Cholesky factorizations on a Cambricon AI accelerator, respectively.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J. J.: “Performance tuning and optimization techniques of fixed and variable size batched cholesky factorization on gpus,” In International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA, ser. Procedia Computer Science, M. Connolly, Ed., vol. 80. Elsevier, (2016), pp. 119–130

  2. Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.J.: Fast Cholesky factorization on GPUs for batch and native modes in MAGMA. J. Comput. Sci. 20, 85–93 (2017)

    Article  Google Scholar 

  3. Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J. Phys. Conf. Ser. 180, 012037 (2009)

    Article  Google Scholar 

  4. Anderson, E., Bai, Z., Dongarra, J., Greenbaum, A., McKenney, A., Du Croz, J., Hammarling, S., Demmel, J., Bischof, C., Sorensen, D.: In: Lapack: a portable linear algebra library for high-performance computers, pp. 2–11. IEEE Computer Society Press, Washington, DC, USA (1990)

  5. Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O.: Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. Presented at the (2014)

  6. Chen, Y., Chen, T., Xu, Z., Sun, N., Temam, O.: Diannao family: energy-efficient hardware accelerators for machine learning. Commun. ACM 59(11), 105–112 (2016)

    Article  Google Scholar 

  7. Chen, Y., Xie, Y., Song, L., Chen, F., Tang, T.: A survey of accelerator architectures for deep neural networks. Engineering 6(3), 264–274 (2020)

    Article  Google Scholar 

  8. Choi, J., Demmel, J., Dhillon, I., Dongarra, J., Ostrouchov, S., Petitet, A., Stanley, K., Walker, D., Whaley, R.: “Scalapack: a portable linear algebra library for distributed memory computers – design issues and performance,” Computer Physics Communications, vol. 97, no. 1, pp. 1–15, (1996), high-Performance Computing in Science

  9. Choi, J., Dongarra, J.J., Ostrouchov, S., Petitet, A., Walker, D.W., Whaley, R.C.: Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Sci. Program. 5(3), 173–184 (1996b)

    Google Scholar 

  10. Dong, T., Haidar, A., Luszczek, P., Harris, J. A., Tomov, S., Dongarra, J. J.: “LU factorization of small matrices: Accelerating batched DGETRF on the GPU,” In 2014 IEEE International Conference on High Performance Computing and Communications, 6th IEEE International Symposium on Cyberspace Safety and Security, 11th IEEE International Conference on Embedded Software and Systems, HPCC/CSS/ICESS 2014, Paris, France, August 20-22, 2014. IEEE, (2014), pp. 157–160

  11. Dongarra, J.J., Faverge, M., Ltaief, H., Luszczek, P.: Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting. Concurr. Comput. Pract. Exp. 26(7), 1408–1431 (2014)

    Article  Google Scholar 

  12. Dorris, J., Kurzak, J., Luszczek, P., YarKhan, A., Dongarra, J. J.: “Task-based cholesky decomposition on knights corner using openmp,” In High Performance Computing - ISC High Performance 2016 International Workshops, ExaComm, E-MuCoCoS, HPC-IODC, IXPUG, IWOPH, P 3A, VHPC, WOPSSS, Frankfurt, Germany, June 19-23, 2016, Revised Selected Papers, ser. Lecture Notes in Computer Science, M. Taufer, B. Mohr, and J. M. Kunkel, Eds., vol. 9945, (2016), pp. 544–562

  13. Golub, G.H., van Loan, C.F.: Matrix computations, 4th edn. JHU Press, USA (2013)

    MATH  Google Scholar 

  14. Haidar, A., Abdelfattah, A., Tomov, S., Dongarra, J. J.: “High-performance cholesky factorization for gpu-only execution,” In Proceedings of the General Purpose GPUs, GPGPU@PPoPP, Austin, TX, USA, February 4-8, 2017. ACM, (2017), pp. 42–52

  15. Haidar, A., Abdelfattah, A., Zounon, M., Tomov, S., Dongarra, J.J.: A guide for achieving high performance with very small matrices on GPU: a case study of batched LU and Cholesky factorizations. IEEE Trans. Parallel Distrib. Syst. 29(5), 973–984 (2018)

    Article  Google Scholar 

  16. Jia, Y., Luszczek, P., Dongarra, J. J.: “Multi-gpu implementation of LU factorization,” In Proceedings of the International Conference on Computational Science, ICCS 2012, Omaha, Nebraska, USA, 4-6 June, 2012, ser. Procedia Computer Science, H. H. Ali, Y. Shi, D. Khazanchi, M. Lees, G. D. van Albada, J. J. Dongarra, and P. M. A. Sloot, Eds., vol. 9. Elsevier, (2012), pp. 106–115

  17. Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin, P.-L., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T.V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C.R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey, A., Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D., Koch, A., Kumar, N., Lacy, S., Laudon, J., Law, J., Le, D., Leary, C., Liu, Z., Lucke, K., Lundin, A., MacKean, G., Maggiore, A., Mahony, M., Miller, K., Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie, T., Omernick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek, A., Samadiani, E., Severn, C., Sizikov, G., Snelham, M., Souter, J., Steinberg, D., Swing, A., Tan, M., Thorson, G., Tian, B., Toma, H., Tuttle, E., Vasudevan, V., Walter, R., Wang, W., Wilcox, E., Yoon, D.H.: In-datacenter performance analysis of a tensor processing unit. Presented at the (2017)

  18. Jouppi, N.P., Young, C., Patil, N., Patterson, D.: A domain-specific architecture for deep neural networks. Commun. ACM 61(9), 50–59 (2018)

    Article  Google Scholar 

  19. Kurzak, J., Luszczek, P., Faverge, M., Dongarra, J. J.: “Programming the LU factorization for a multicore system with accelerators,” In High Performance Computing for Computational Science - VECPAR 2012, 10th International Conference, Kobe, Japan, July 17-20, 2012, Revised Selected Papers, ser. Lecture Notes in Computer Science, M. J. Daydé, O. Marques, and K. Nakajima, Eds., vol. 7851. Springer, (2012), pp. 28–35

  20. Kurzak, J., Anzt, H., Gates, M., Dongarra, J.J.: Implementation and tuning of batched Cholesky factorization and solve for NVIDIA GPUs. IEEE Trans. Parallel Distrib. Syst. 27(7), 2036–2048 (2016)

    Article  Google Scholar 

  21. Reuther, A., Michaleas, P., Jones, M., Gadepally, V., Samsi, S., Kepner, J.: “Survey and benchmarking of machine learning accelerators,” In. IEEE High Performance Extreme Computing Conference (HPEC) 2019, 1–9 (2019)

  22. Reuther, A., Michaleas, P., Jones, M., Gadepally, V., Samsi, S., Kepner, J.: “Survey of machine learning accelerators,” In 2020 IEEE High Performance Extreme Computing Conference (HPEC), (2020), pp. 1–12

  23. Rothberg, E.: Performance of panel and block approaches to sparse Cholesky factorization on the ipsc/860 and paragon multicomputers. SIAM J. Sci. Comput. 17(3), 699–713 (1996)

    MathSciNet  Article  Google Scholar 

  24. Yamazaki, I., Tomov, S., Dongarra, J.: Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs. SIAM J. Sci. Comput. 37(3), C307–C330 (2015)

    MathSciNet  Article  Google Scholar 

Download references


We would like to thank the invaluable comments from all the reviewers. Weifeng Liu is the corresponding author of this paper. This research was supported by the National Natural Science Foundation of China under Grant No. 61972415, and the Science Foundation of China University of Petroleum, Beijing under Grant Nos. 2462019YJRC004, 2462020XKJS03.

Author information



Corresponding author

Correspondence to Weifeng Liu.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lu, Y., Luo, Y., Lian, H. et al. Implementing LU and Cholesky factorizations on artificial intelligence accelerators. CCF Trans. HPC (2021).

Download citation


  • LU factorization
  • Cholesky factorization
  • AI accelerator