## Introduction

The solution of linear systems is still one of the most common operations in scientific computing. The efficient solution of these systems is a key ingredient for many higher level algorithms. In our contribution we focus on the solution of

\begin{aligned} AX=B, \end{aligned}
(1)

where $$A\in \mathbb {R}^{m\times m}$$ is a non-symmetric matrix, $$B\in \mathbb {R}^{m\times n}$$, and $$X\in \mathbb {R}^{m\times n}$$ with $$n \gg 1$$ without any special structure. Especially, the case $$m=n$$, which appears in the computation of the generalized matrix sign function [5], is of higher interest. The standard implementation for this problem is using the LU decomposition with an additional forward/backward substitution. Regarding current GPU accelerated implementations, where we consider the MAGMA [2, 9, 10] version the state-of-the-art, only the LU decomposition works on more than one accelerator device. This is a crucial point to reduce the runtime and as a consequence also the energy consumption.

A closer look to the LU decomposition based solution process shows an additional problem: The three step scheme factorizing the matrix $$PA=LU$$, followed by the forward solve $$LY=PB$$, and the backward solve $$UX=Y$$, leads to the matrix being transferred between the main memory and the computational unit for three times. Since memory accesses are a key player in the energy balance, this also results in a notable additional power consumption.

In order to be able to not only factorize the matrix on multiple accelerators but also solve the linear system without the communication intensive distributed forward–backward substitution we recall the Gauss–Jordan-elimination algorithm. In contrast to the LU decomposition, the integration of the right hand side in the computation scheme removes the data dependencies which appear in the forward/backward substitution scheme. This enables us to derive an easy and efficient multi-GPU implementation of the solution process.

Beside focusing on the classical objective of minimizing the time-to-solution, we also take the energy-to-solution into account. Regarding the hardware developments during the recent years, an increasing number of the scientific computing hardware has been set up as heterogeneous architectures. Most commonly the combination of general purpose CPUs together with GPU-based accelerator devices is used. These architectures allow to choose the computation device that is best suited for a given task with respect to either the time-to-solution, or in the green HPC context, combinations with the energy-to-solution. The aforementioned differing properties of the LU decomposition and the Gauss–Jordan-elimination, with respect to saving memory transfers and the usage of multiple accelerators for the overall process, define the objective to look for obtaining the method that performs best with respect to both the time and the energy metric when we want to solve linear systems with many right hand sides.

In the following sections we recall the Gauss–Jordan-elimination and reformulate it to work on heterogeneous system with multiple accelerator devices. Furthermore, we identify and avoid bottlenecks of the GPU based implementation, which slow down the computation on the one hand, but are keeping the accelerator devices busy and consuming energy on the other hand. Finally, the numerical experiments compare our Gauss–Jordan based solver with the LU decomposition from MAGMA library [2, 9, 10], which can be seen as the GPU accelerated implementation of LAPACK. These experiments focus on the runtime as well as on the energy consumption of both algorithms and the multi-accelerator scalability of the Gauss–Jordan approach and the LU decomposition based solution process.

## Gauss–Jordan-elimination

The Gauss–Jordan-elimination (GJE) process is mostly known as an algorithm to invert matrices [1, 7]. Obviously, inverting the system matrix A first and multiplying the right hand side B by $$A^{-1}$$ afterwards is a quite expensive way to solve a linear system. It requires more than $$2m^3+2m^2n$$ flops. However, the involved matrix-matrix products can be easily performed in parallel on multiple accelerator devices. On the other hand, the typical way of calculation using the LU decomposition followed by a forward/backward substitution only costs $$\frac{2}{3}m^3+2m^2n$$ flops. The data dependency caused by the L and the U factors increase the communication necessary during the solution phase. The Gauss–Jordan-elimination scheme derived in the next paragraphs will reduce (when comparing the inversion and the matrix multiplication) the number of flops to only $$m^3+2m^2n$$, for the solution of one linear system, without introducing data dependencies that break the easy use of multiple accelerator devices.

### Basic scheme

We consider the augmented matrix

(2)

of size $$m\times (m+n)$$, which concatenates the system matrix A and the right hand side B. Now, the goal of the classic Gauss–Jordan-elimination is to transform the first m columns of the matrix D to the identity using elementary operations from the left, while the last n columns, initially containing the right hand side, are overwritten by the solution X, simultaneously. We use the formulation with column pivoting. Therefore, for column $$i\in \{1,\dots , m\}$$ of A, every elementary operation consists of a row permutation $$P_i$$ which exchanges the i-th and the k-th row ($$k\ge i$$ such that $$a_{ki}$$ has the largest absolute value) and a Gauss transformation

\begin{aligned} G_i = \begin{bmatrix} 1&&-\frac{a_{1i}}{a_{ii}}&\\&\ddots&\vdots&\\&1&-\frac{a_{(i-1)i}}{a_{ii}}&&\\&&\frac{1}{a_{ii}}&&\\&&-\frac{a_{(i+1)i}}{a_{ii}}&1&\\&&\vdots&\ddots&\\&&-\frac{a_{mi}}{a_{ii}}&&1 \\ \end{bmatrix}. \end{aligned}
(3)

creating a one at the new (ii) position in A, or D. Thus, we end up with

\begin{aligned} \underbrace{G_nP_n\cdots G_2P_2 G_1P_1}_{A^{-1}} D = \left[ \begin{array}{lll|lll} 1 &{} &{} &{} x_{11} &{} \cdots &{} x_{1n} \\ &{}\ddots &{} &{}\vdots &{}\vdots &{}\vdots \\ &{} &{} 1 &{} x_{m1} &{} \cdots &{} x_{mn} \end{array}\right] , \end{aligned}
(4)

which gives us $$A^{-1}$$ and X with a cost of $$2m^3+2m^2n$$ flops.

### Block reformulation

Since it is well known that only BLAS level-3 enabled algorithms are able to get near the maximum theoretic performance of the computation device we reformulate the above idea into a blocked algorithm. To this end, we first rewrite the update (3) into a rank-1 update [7]:

\begin{aligned}&D := D -\frac{1}{a_{ii}}{\left( a_{1i}, \ldots ,a_{(i-1)i},0,a_{(i+1)i},\ldots ,a_{mi} \right) }^T D_{i,\cdot }\\&D_{i,\cdot }:=\frac{1}{a_{ii}}D_{i,\cdot }. \end{aligned}

Afterwards, we repartition the augmented matrix D into

\begin{aligned} D:=\left[ \begin{array}{l|l|l|l} A_{11}\; &{} \; A_{12} \; &{} \; A_{13} \; &{} \; b_{1} \\ \hline A_{21}\; &{} \; A_{22} \; &{} \; A_{23} \; &{} \; b_{2} \\ \hline A_{31}\; &{} \; A_{32} \; &{} \; A_{33} \; &{} \; b_{3} \\ \end{array}\right] , \end{aligned}
(5)

where $$A_{22}$$ is of dimension $$N_B\times N_B$$. In this way the rank-1 update is transformed into a rank-$$N_B$$ update by replacing the scalar operations by their corresponding matrix valued counterparts:

(6)

In order to preserve numerical stability, we have to lift the pivoting strategy to the block reformulation, as well [7]. This can be done either by implementing an unblocked Gauss–Jordan-elimination as shown in (4) for the panel $$\begin{bmatrix} A_{12}^T&A_{22}^T&A_{32}^T \end{bmatrix}^T$$, or by computing the pivoted LU decomposition of $$\begin{bmatrix}A_{22}^T&A_{32}^T\end{bmatrix}^T$$. The later leads to

\begin{aligned} P\begin{bmatrix}A_{22}\\ A_{32}\end{bmatrix} = \begin{bmatrix}L_{1}\\ L_{2}\end{bmatrix}U_{1}. \end{aligned}
(7)

The obtained permutation is applied to the augmented matrix D by

\begin{aligned} D:= \begin{bmatrix} I_{11} \quad&\\&\quad P \end{bmatrix}D, \end{aligned}

where $$I_{11}$$ is the identity matrix of the same size as $$A_{11}$$ in the partitioning (5). By introduction of the panel matrix H

\begin{aligned} H = \begin{bmatrix} -A_{12}U_{1}^{-1}L_1^{-1} \\ U_1^{-1}L_1^{-1} \\ -L_{2}L_1^{-1} \end{bmatrix} \end{aligned}

the Update (6) transforms into

\begin{aligned} D:= \left[ \begin{array}{l|l|l|l} A_{11} &{} 0 &{} A_{13} &{} b_{1} \\ \hline 0 &{} 0 &{} 0 &{} 0 \\ \hline A_{31} &{} 0 &{} A_{33} &{} b_{3} \\ \end{array}\right] + H \begin{bmatrix} A_{21}&I_{N_B}&A_{23}&b_2 \end{bmatrix}. \end{aligned}
(8)

This procedure will compute the solution of the Linear System (1) as well as the inverse of A. Avoiding the left side of the update corresponding to the block $$A_{21}$$, we only compute the solution of the linear system. The resulting overall procedure using this second alternative, is shown in Algorithm 1. By only computing the solution of the linear system, we need $$m^3+2m^2n$$ flops to solve a linear system. The negligible influence of the block size $$N_B$$ and its determination is shown in [3]. As already pointed out in the introduction, we observe that the Gauss–Jordan-elimination scheme only needs to sweep over the matrix A once instead of three times of the LU decomposition based solution.

### Distributed algorithm

The previous section showed that beside calculating the current panel matrix H, we only need the GEMM operation to solve the linear system. This induces that inside a column or a block column the only data dependency is the knowledge of the current panel H. Once this is known all columns can be updated independent from each other. This motivates to distribute in a cyclic block-column way across all participating computational devices. Algorithm 1 only needs to distribute the augmented matrix D over all computational devices and to broadcast the current panel matrix H in every iteration. The GEMM calls for the permutation updating the remaining part of the matrix and the right hand side can be performed in parallel on all computational devices. At this point we have to remark that the cyclic block-column distribution is only efficient if the number of computational device is relatively small. Normally this should not affect our idea because the maximum number of accelerator devices inside one compute server is relatively small ($$\le$$8) by nature.

### Memory access analysis

In this subsection we count the number of memory accesses required by the LU decomposition and the GJE under some simplifications due to the complexity of modern hardware. We assume that only scalar values stay inside the cache of the CPU or GPU. By increasing the dimension of the problems matrices and vectors one can make them large enough to no longer fit into caches. Finally, we assume both algorithms, the LU decomposition and the GJE, use only rank-1 updates without pivoting, i.e., we regard their BLAS level-2 formulation.

Each step k in the LU decomposition consists of a vector scaling and a rank-1 update. The vector scaling needs $$1+k$$ memory reads and k writes. Processing the rank-1 update row-by-row, each row needs $$2k+1$$ reads and k writes. For $$k=m-1,\ldots ,1$$ the overall LU decomposition needs

\begin{aligned} \sum \limits _{k=1}^{m} 3k^2+2k+1 = {m}^{3}-\frac{1}{2}{m}^{2}+\frac{1}{2}m-1 \end{aligned}

memory accesses. A forward (or backward) substitution for one column of the right hand side needs

\begin{aligned} \sum \limits _{i=1}^m\left[ 3+\sum \limits _{j=1}^{i-1}4\right] = 2 m^2 + m \end{aligned}

memory accesses. Together with the assumption from the introduction ($$n = m$$) this yields

\begin{aligned} 5\,{m}^{3}+\frac{3}{2}{m}^{2}+\frac{1}{2}\,m-1 \end{aligned}
(9)

memory accesses to solve a linear system with m right hand sides.

The whole Gauss–Jordan-elimination scheme consists only of $$m-1$$ vector scales of length m and $$m-1$$ rank-1 updates of size $$m\times (k+n)$$ where $$k=m-1,\ldots ,1$$. Each update costs $$(3(k+n)+1)m$$ memory accesses. Again, together with the assumption $$n=m$$, this gives

\begin{aligned} \sum \limits _{k=1}^{m-1}m+(3(k+m)+1)m = \frac{9}{2} {m}^{3} - \frac{5}{2}{m}^{2}-2m \end{aligned}
(10)

memory accesses. Compared to the LU decomposition this saves the transfer of $$\frac{1}{2}m^3$$ elements.

## GPU implementation

The GPU implementation splits into two parts. First, we introduce a look-ahead scheme in order to introduce parallelism between host CPU and the accelerator device. Second, we discuss reasons for changing from the classical Fortran column major matrix storage to the row major storage known from C.

Beside the high computational power of the accelerator devices, a work sharing between CPU and GPU is a key ingredient for reducing the runtime. Therefore, we divide the operation performed by Algorithm 1 into two classes, those well suited for the GPU and those to be run on the host CPU(s). The operations well suited for the GPU are the GEMM operations, benefiting from the high computational power of the GPU, and the row swap of the pivoting, because of the high memory throughput of the GPU. The preparation of the panel matrix H is better suited for the CPU, due to the lower number of required operations in general and large sequential parts. Copying the current panel $$\begin{bmatrix}A_{12}^T&A_{22}^T&A_{23}^T\end{bmatrix}^T$$ to the CPU, computing H moving it back to the GPU, results in a simple GPU accelerated implementation of Algortihm 1.

### Look-ahead and asynchronous operation

In the basic GPU accelerated algorithm only the CPU or the GPU will work at a time. Since the GPUs are able to transfer data between the host and their memory while performing computation, we introduce a classic look-ahead strategy with the aim to prepare the next panel $$\tilde{H}$$ on the CPU, while the GPU still updates the remaining parts of the augmented matrix D. Therefore, we split the third column of the block partitioning (5) into

\begin{aligned} \begin{bmatrix} A_{13} \\ A_{23} \\ A_{33} \end{bmatrix} := \begin{bmatrix} \hat{A}_{13}&\quad \bar{A}_{13}\\ \hat{A}_{23}&\quad \bar{A}_{23} \\ \hat{A}_{33}&\quad \bar{A}_{33} \end{bmatrix}, \end{aligned}

where $$\hat{A}_{13}$$, $$\hat{A}_{23}$$, and $$\hat{A}_{33}$$ have $$N_B$$ columns. This small additional repartitioning allows us to update $$\hat{A}_{13}$$, $$\hat{A}_{23}$$, and $$\hat{A}_{33}$$ first and copy them back to the host while the GPU performs the remaining updates on $$\bar{A}_{13}$$, $$\bar{A}_{23}$$, and $$\bar{A}_{33}$$. This way, the host prepares the next panel matrix $$\tilde{H}$$, while the GPU still works on the previous update. The new panel matrix $$\tilde{H}$$ is potentially copied back to the device in the same asynchronous while the updates are still ongoing. After the updates on $$\bar{A}_{13}$$, $$\bar{A}_{23}$$, and $$\bar{A}_{33}$$ we synchronize the computations done on the device to ensure that the updated $$\tilde{H}$$ has been tranferred completely before it is used.

### Data layout

Due to the fact that the BLAS libraries available for GPUs, namely Nvidia ® cuBLAS for CUDA enabled devices and clBLAS for OpenCL based devices, use the same matrix storage scheme as the CPU based libraries BLAS and LAPACK we implement Algorithm 1 using column major storage. Thereby, preliminary experiments showed that 20–37.5 % of the computation time on the GPU (Nvidia ® Tesla K20) is spent in applying the permutation P to the remaining parts of the matrix (Step 2.2 of Algorithm 1). Previous work on the Gauss–Jordan-elimination based solvers [3] is affected by this issue. From this implementation we obtained Table 1 showing the percentage of time used for applying the permutation P.

A thorough analysis of the operation shows that permuting rows, despite being an easy operation in terms of the column major storage scheme, disturbs the coalesced memory access scheme of accelerator device. The reason is that for each row swap at most two elements are used out of each cache line. Assuming a cache line length of 128 bytes, which is typically used on a Nvidia ® CUDA device, and double precision arithmetic, we only use 8 or 16 bytes out of a cache line, which means that 93.5 or 87.5 % of the data transferred from the memory is not used, while its transfer is wasting time and energy, and slows down the overall process. Regarding the length of a cache line, it would be better if we could store 16 elements of a row in one cache line such that exchanging 16 elements of two rows only requires a transfer of 256 bytes from the memory, instead of 4kB. The direct consequence of this fact is that on the accelerator device the row major storage scheme is better suited for row swaps. This requires that at least the part of the algorithm working on the device needs to be adjusted to row major storage. Beside some copy operations this only affects the GEMM operation.

We assume that a call to GEMM( $$\alpha$$, A, B, $$\beta$$, C) computes $$C:=\alpha AB + \beta C$$, where $$A\in \mathbb {R}^{m\times k}$$, $$B\in \mathbb {R}^{k\times n}$$, and $$C\in \mathbb {R}^{m\times n}$$ stored in (Fortran) column major order. Furthermore, it is obvious that $$A^T$$ in column major storage is the same as A in row major storage. It follows that the column major oriented routine GEMM( $$\alpha$$, B, A, $$\beta$$, C) then computes

\begin{aligned} C^T := \alpha B^T A^T + \beta C^T, \end{aligned}

if the matrices are stored in row major, which gives our original GEMM operation in column major storage. Furthermore, that means that we only have to switch the dimension arguments and the roles of A and B for using the classical GEMM operations with the row major scheme. The numerical experiments show that this reduces the influence of row swap operation. The transpose operation which is additionally required is only performed after the initial transfer to the device and before the final results are copied back to the host. Therefore, they can be neglected in the performance analysis as the numerical results show.

## Experimental results

In the experimental results we will compare our Gauss–Jordan-elimination based approach with the well established GPU accelerated implementation of the LU decomposition from the runtime as well as from the energy point of view. Therefore, we choose random matrices $$A\in \mathbb {R}^{m\times m}$$ with $$m = 1024k$$, $$k\in \mathbb {N}$$, and a predefined solution $$X\in \mathbb {R}^{m\times n}=1$$. From this we determine the right hand side as $$B = AX$$. We set n equal to m in order to cover the case we explained in the introduction. All experiments are done in double precision arithmetic on a dual-socket 16 core Intel® Xeon® E5-2640v3 with 64GB RAM and two Nvidia ® Tesla K20 accelerators. The code is compiled using the Intel® C/Fortran compiler 15 with MKL 11.2 as host BLAS library. The device code uses Nvidia ® CUDA 7.5. The power consumption is measured using a ZES Zimmer LMG450 power meter with a sampling rate of 20 Hz.

Because of the fact, that reducing the power consumption results in a slower execution, i.e. larger runtime, in many cases, we consider the energy-delay-product (EDP) [4, 6] as a combined economical and ecological measure to compare both solution techniques. The energy-delay-product is defined as

\begin{aligned} {\text {EDP}}(w) = E \cdot T^w, \end{aligned}
(11)

where E is the energy-to-solution, T is the time-to-solution and w a weight factor to penalize the time. Usually, the EDP(1), EDP(2), and EDP(3) values are used for the comparison of algorithms depending on the requirements of the computing center.

Due to the fact that MAGMA [2, 9, 10] supports the LU decomposition on multiple devices, but the forward/backward substitution on a single device only, we use a workaround here. In the case of one GPU we use the MAGMA LU decomposition and the corresponding forward/backward solve on this device. In the case where we use both accelerator devices we only use the multi-GPU LU decomposition and perform the forward/backward substitution on the host CPU.

Before we compare the Gauss–Jordan-elimination approach with the LU decomposition based solvers, we have to determine the optimal block size $$N_B$$ with respect to the solution time and with respect to the energy, which are shown in Table 2. For the decision which optimal value to choose for the experiments, we consider the energy-delay-product shown in Table 3. Except of one case the EDP(1) suggest to chose the time optimal block size, which is therefore used for all remaining experiments.

Figure 1 shows the time-to-solution for all methods in the test. Obviously the usage of MAGMA with more than one GPU and the additional triangular solves on the CPU yield the worst result. Neither from the run-time nor energy points of view can it gain any advantage. Restricting to the one GPU case the Gauss–Jordan-elimination approach gains a speed up against MAGMA although we need 12.5 % more flops to solve. For small problems a speed up of more than 1.5, as shown in Table 4, is possible but even for the large problems our approach is 8 % faster. Regarding the energy-to-solution in Table 5 our approach saves between 8 and 60 % energy for the small and moderate size problems. In the case of the large problems, we see in Fig. 2 that we need approximately the same amount of energy to compute the solution. Because of minimizing the runtime with keeping the energy constant the energy-delay-product in Table 6 suggest to choose the Gauss–Jordan-elimination. Even increasing the weight w will not change this picture due to the fact that the runtime speed up gets more influence in the assessment.

Employing two GPUs we see in Table 4 that our distributed Gauss–Jordan-elimination implementation is the fastest approach for problems beginning at medium size ($$m = n \ge 3\,072$$). The overall speed up with respect to one GPU MAGMA solution is between 2.09 and 2.5. Furthermore, comparing with our one GPU implementation, we see a nearly perfect scaling. This nearly perfect scaling yield a large runtime reduction which easily compensates the additional power necessary for the second GPU. Even in the cases where no difference in runtime exists between the one and the two GPU execution the two GPU version requires less energy than MAGMA on one GPU. The direct comparison between the one and the two GPU implementation shows that using a second device accelerates the computation by a factor of up to 2 and reduces the energy consumption by 22 %. Regarding the energy-delay-product again in Table 6 it suggests to use the two GPU variant if two GPUs are available.

Reasons for the Gauss–Jordan-elimination scheme being so much more efficient than the LU decomposition with forward/backward substitution are the following:

• Gauss–Jordan is an all-at-once approach where one algorithm and one sweep over the augmented matrix D yield the solution of the problem. In comparison to the LU decomposition this reduces the memory transfers, since no additional triangular solves are required.

• The GEMM operation is the only operation necessary on the device. It does not involve data dependencies which yield partly sequential computation schemes, in contrast to the forward/backward substitution. Furthermore, the GEMM operation is known to be the best optimized operation on CPUs, as well as, on GPUs.

• In contrast to the LU decomposition, the GEMM operation inside the Gauss–Jordan-elimination scheme is constant in the number of affected rows and by choosing a proper block size the GEMM operation makes use of the whole device.

• By removing the data dependencies introduced by the forward/backward substitution scheme and only relying on the GEMM operation one can easily obtain and scalable distributed scheme to employ more than one GPU.

Finally, we compare two typical measures used for comparison of HPC systems, namely the flop rate and the energy efficiency in terms of GFlops/s/W [8]. Regarding Fig. 3 we observe that none of the MAGMA based approaches can compete at least with our single GPU code. Even if we have in mind that our approach needs 12.5 % more flops than the LU decomposition we still achieve a higher peak performance. Furthermore, it is easy to see that already small problems lead to a good utilization of the GPU in terms of the achieved flop rate. The two GPU implementations even obtains a good performance with moderate size problems and reaches a peak performance of 1.8 TFlops/s. Again one can see that the missing distributed triangular solve in MAGMA results in a stagnation of the performance. In contrast to the case where we only have one right hand side, in the linear system the forward/backward substitution needs more flops than the LU decomposition. Therefore, only having a distributed factorization is not enough to obtain a fast, and in this way energy efficient, algorithm. Figure 4 shows that we get a value of 3.3 GFlops/s/W for our dual GPU implementation. Compared to the HPC systems in the current Green500 list (November 2015, [8]) we can compete with the Top-30 systems. Using this metric the maximum gain between the Gauss–Jordan-elimination method and the LU decomposition from MAGMA is between 1.5 and 2.5.

## Conclusions

We showed that using the Gauss–Jordan-elimination scheme one can implement a fast and energy efficient solver for dense linear systems with many right hand sides. Furthermore, we showed that reducing memory transfers and data dependencies results in a scalable low-energy algorithm. By only requiring a high performance GEMM operation and minimal data dependencies on the accelerator devices this becomes a portable scheme for future architectures. Comparing the performance and the energy requirements to the MAGMA solution, we see that our implementation needs at most the same energy as the LU decomposition based solution process, but reduces the time to solution. Further optimizations like the influence of dynamic frequency scaling on the GPU and on the CPU are part of future research.