Parallel Computation of Echelon Forms

  • Jean-Guillaume Dumas
  • Thierry Gautier
  • Clément Pernet
  • Ziad Sultan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8632)


We propose efficient parallel algorithms and implementations on shared memory architectures of LU factorization over a finite field. Compared to the corresponding numerical routines, we have identified three main specifities of linear algebra over finite fields. First, the arithmetic complexity could be dominated by modular reductions. Therefore, it is mandatory to delay as much as possible these reductions while mixing fine-grain parallelizations of tiled iterative and recursive algorithms. Second, fast linear algebra variants, e.g., using Strassen-Winograd algorithm, never suffer from instability and can thus be widely used in cascade with the classical algorithms. There, trade-offs are to be made between size of blocks well suited to those fast variants or to load and communication balancing. Third, many applications over finite fields require the rank profile of the matrix (quite often rank deficient) rather than the solution to a linear system. It is thus important to design parallel algorithms that preserve and compute this rank profile. Moreover, as the rank profile is only discovered during the algorithm, block size has then to be dynamic. We propose and compare several block decompositions: tile iterative with left-looking, right-looking and Crout variants, slab and tile recursive. Experiments demonstrate that the tile recursive variant performs better and matches the performance of reference numerical software when no rank deficiency occurs. Furthermore, even in the most heterogeneous case, namely when all pivot blocks are rank deficient, we show that it is possbile to maintain a high efficiency.


Recursive Algorithm Modular Reduction Block Algorithm Echelon Form Block Splitting 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Broquedis, F., Gautier, T., Danjean, V.: libKOMP, an Efficient OpenMP Runtime System for Both Fork-Join and Data Flow Paradigms. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 102–115. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  2. 2.
    Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing 35(1), 38–53 (2009), MathSciNetCrossRefGoogle Scholar
  3. 3.
    Dongarra, J.J., Duff, L.S., Sorensen, D.C., Vorst, H.A.V.: Numerical Linear Algebra for High Performance Computers. SIAM (1998)Google Scholar
  4. 4.
    Dongarra, J.J., Faverge, M., Ltaief, H., Luszczek, P.: Achieving numerical accuracy and high performance using recursive tile LU factorization. Concurrency and Computation: Practice and Experience 26(7), 1408–1431 (2014), CrossRefGoogle Scholar
  5. 5.
    Dumas, J.-G., Giorgi, P., Pernet, C.: Dense linear algebra over prime fields. ACM TOMS 35(3), 1–42 (2008), MathSciNetCrossRefGoogle Scholar
  6. 6.
    Dumas, J.-G., Pernet, C., Sultan, Z.: Simultaneous computation of the row and column rank profiles. In: Kauers, M. (ed.) Proc. ISSAC 2013, Grenoble, France, pp. 181–188. ACM Press, New York (2013)Google Scholar
  7. 7.
    Faugère, J.-C.: A new efficient algorithm for computing Gröbner bases (F4). Journal of Pure and Applied Algebra 139(1–3), 61–88 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Gathen, J.V., Gerhard, J.: Modern Computer Algebra. Cambridge University Press, New York (1999)zbMATHGoogle Scholar
  9. 9.
    Gustavson, F.G.: Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development 41(6), 737–756 (1997)CrossRefGoogle Scholar
  10. 10.
    Jeannerod, C.-P., Pernet, C., Storjohann, A.: Rank-profile revealing Gaussian elimination and the CUP matrix decomposition. J. Symb. Comp. 56, 46–68 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Klimkowski, K., van de Geijn, R.A.: Anatomy of a parallel out-of-core dense linear solver. In: ICPP, vol. 3, pp. 29–33. CRC Press (August 1995)Google Scholar
  12. 12.
    Kurzak, J., Ltaief, H., Dongarra, J., Badia, R.M.: Scheduling dense linear algebra operations on multicore processors. Concurrency and Computation: Practice and Experience 22(1), 15–44 (2010)CrossRefGoogle Scholar
  13. 13.
    Stein, W.: Modular forms, a computational approach. Graduate studies in mathematics. AMS (2007),
  14. 14.
    Toledo, S.: Locality of reference in lu decomposition with partial pivoting. SIAM Journal on Matrix Analysis and Applications 18(4), 1065–1081 (1997)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Jean-Guillaume Dumas
    • 1
  • Thierry Gautier
    • 2
  • Clément Pernet
    • 3
  • Ziad Sultan
    • 1
    • 2
  1. 1.LJK-CASYS, UJF, CNRS, Inria, G’INP, UPMFGrenobleFrance
  2. 2.LIG-MOAIS UJF, CNRS, Inria, G’INP, UPMFGrenobleFrance
  3. 3.LIP-AriC UJF, CNRS, Inria, UCBL, ÉNS de LyonFrance

Personalised recommendations