A low-rank update for relaxed Schur complement preconditioners in fluid flow problems

The simulation of fluid dynamic problems often involves solving large-scale saddle-point systems. Their numerical solution with iterative solvers requires efficient preconditioners. Low-rank updates can adapt standard preconditioners to accelerate their convergence. We consider a multiplicative low-rank correction for pressure Schur complement preconditioners that is based on a (randomized) low-rank approximation of the error between the identity and the preconditioned Schur complement. We further introduce a relaxation parameter that scales the initial preconditioner. This parameter can improve the initial preconditioner as well as the update scheme. We provide an error analysis for the described update method. Numerical results for the linearized Navier–Stokes equations in a model for atmospheric dynamics on two different geometries illustrate the action of the update scheme. We numerically analyze various parameters of the low-rank update with respect to their influence on convergence and computational time.


Introduction
Large-scale linear saddle-point systems arise in computational fluid dynamics.A stable discretization of the incompressible time-dependent linearized Navier-Stokes equa-tions leads to systems of the form where A ∈ R n u ×n u , B ∈ R n p ×n u , u, f ∈ R n u , p ∈ R n p .The matrix block A is dominated by the velocity mass matrix and includes diffusion and advection, both scaled by the time step length.The matrix blocks B, B T correspond to the divergence and gradient operator, resp..We solve for the fluid velocity u and the pressure p.
The right-hand side f describes the external forcing.The obtained high-dimensional systems are solved with iterative solvers which require efficient preconditioners.An overview of the numerical solution of saddle-point problems arising in fluid flow applications and suitable preconditioners are given in [1] or [2].A typical choice for preconditioners of 2 × 2-block systems as given in ( 1) is based on (approximations of) their block LU factorizations.Such block preconditioners typically require approximations to the inverse of the upper left block A of the system matrix in (1) as well as the (negative) pressure Schur complement S:=B A −1 B T .Well-known preconditioning techniques for the Schur complement are SIMPLE-type preconditioners [3,4], the least-squares commutator (LSC) [2,5], the pressure-convection-diffusion commutator [6], grad-div preconditioners [7][8][9], augmented Lagrangian preconditioners [4,10,11] as well as hierarchical matrix preconditioners [12].An alternative approach that is not based on the block LU factorization is given by the so-called nullspace method [13].
Any given reconditioner can be modified by low-rank corrections to hopefully improve its performance in an iterative solver.Typical objectives are to modify certain spectral properties of the preconditioned matrix.Large eigenvalues but also small eigenvalues (close to the origin) may decelerate the convergence of iterative methods.Deflation strategies aim to shift small eigenvalues to zero to make them "invisible" to the iterative solver.Spectral preconditioners aim at shifting small eigenvalues by one.Tuning strategies aim to shift large eigenvalues to one.An overview of popular low-rank techniques is given in the survey [14].In [15], a low-rank update is applied to the velocity-related block A of the system matrix but not to the Schur complement.In [16][17][18], low-rank techniques are applied to precondition the Schur complement that arises in domain decomposition methods.
The construction of low-rank corrections may pertain to matrices that are not given in explicit form but only defined by their action on a vector.In [15,16], Arnoldi iterations are used to compute the low-rank approximations.Another approach are methods from randomized numerical linear algebra.In [17], the Nyström method is used which is suitable for symmetric positive definite matrices.In [19,20], a randomized low-rank approximation is obtained by approximating the range of a matrix by sampling with a few random vectors.
The motivation for this work is to analyze whether a certain type of multiplicative low-rank update that has been recently introduced in [16] in the context of domain decomposition methods can also be used to improve existing Schur complement preconditioners in block preconditioners for the saddle-point system in (1).The update in [16] is based on a low-rank approximation of the error between the identity matrix and the preconditioned pressure Schur complement.Our conclusion will be that the impact of such a low-rank update on the convergence of the iterative method differs not only for different model problems but also for different parameters that are relevant for the construction of the low-rank updates.
In particular, we will consider two different initial Schur complement preconditioners, two different schemes to compute low-rank approximations, and two different model problems all of which we will briefly outline in the following.
The (negative) Schur complement S = B A −1 B T is not available explicitly since A −1 is typically only available in the form of (approximate) matrix-vector multiplications and an explicit S would result in a dense matrix.We consider two different preconditioning approaches as initial preconditioners: (i) the SIMPLE preconditioner [3] which uses an approximation of the Schur complement that replaces A −1 by a diagonal matrix, and (ii) the least-squares commutator (LSC) preconditioner [5] which provides an approximation to the Schur complement inverse based on approximate algebraic commuting (i.e., without first computing the explicit Schur complement).
In order to compute low-rank corrections for these initial Schur complement preconditioners, we use the following two methods: (i) Arnoldi iteration, and (ii) a randomized power range finder [19].
Our two model problems both stem from a model for buoyancy-driven atmospheric dynamics and differ in the computational domains which consist of a three-dimensional cube as well as a shell.
A further novel component of our work is the introduction of a relaxation parameter that scales the initial preconditioner before the computation of the update.It turns out that such a relaxation parameter can have a substantial impact on the convergence of the iterative solver.
We will provide an error analysis that connects the error of the low-rank update to the error between the identity matrix and the preconditioned Schur complement where the multiplicative low-rank update has been incorporated into the preconditioner.
In our numerical tests, we investigate the influence of the choice of initial preconditioner, relaxation parameter, low-rank update scheme and the update rank on the convergence behavior and computational times of the iterative solver.
The remainder of this paper is organized as follows: We start in Section 2 with an introduction of the initial preconditioners and the low-rank approximation methods.In Section 3, we review the update scheme introduced in [16] and derive left and right update formulas for a given initial preconditioner.This is followed by an analysis of the influence of the update on the error between the identity and the preconditioned Schur complement in Section 4. Numerical results illustrate the update method in Section 5. We conclude in Section 6 and give an outlook on future research.

Initial preconditioners and low-rank approximation
We start with a common block preconditioner and two well-known Schur complement preconditioners that we will use as initial preconditioners.Then, we will discuss two algorithms for the computation of a low-rank approximation that will be needed for the update derived in Section 3.

Initial preconditioners
Block preconditioners for saddle-point systems of the form in (1) can be obtained through block factorizations of the system matrix.We will use the block triangular preconditioner that is based on the block LU factorization where S:=B A −1 B T is the negative pressure Schur complement.The block triangular preconditioner is given by It is considered an ideal preconditioner since it can be shown that GMRES preconditioned with the block triangular preconditioner converges within at most two iterations [21,22].Since the inverses A −1 , S −1 are not available we replace them by preconditioners A −1 , S −1 which leads to the block triangular preconditioner The velocity block preconditioner A −1 is often implemented by a multigrid method [23].However, due to the advection in our system, algebraic multigrid methods may fail.Furthermore, the update technique presented in this paper leads to an algorithm that requires the transposed application of the preconditioners, i.e., Â−1 T v and Ŝ−1 T v, which is not available for multigrid techniques for unsymmetric systems.
Hence, we choose an incomplete factorization ilu(0) as A.
For the Schur complement, we use two well-known preconditioning techniques, the LSC method [5] and a SIMPLE-type method [3].They are given by where D u is a diagonal approximation of the velocity mass matrix.We use the diagonal of the velocity mass matrix as an approximation.We replace the Poisson-type problems by approximations which are obtained through incomplete Cholesky factorizations ic(0).The resulting preconditioners are given by Our choice of (the components of) the initial preconditioners has been guided by the desire to use rather straightforward (standard) algorithms.In our section on numerical results, we include tests for some alternative choices such as multigrid methods for A −1 and using inner iterations for S −1 .These tests have led us to believe that our choices for initial preconditioners are reasonable.

Low-rank approximation
For the update method, we will need a low-rank approximation of a matrix that is only defined by its action on a vector.We consider two methods, a randomized algorithm that is based on approximating the range of a given matrix M ∈ R n×n and the Arnoldi iteration.
We start with a description of the randomized algorithm.The goal is to compute a matrix Q ∈ R n× , n, with as few as possible orthonormal columns that satisfies and hence yields a low-rank approximation of M in the form with N :=M T Q.We compute Q with the randomized power range finder from [19] which captures the dominant action of the given matrix M by sampling with a few random test vectors which form the columns of a test matrix G ∈ R n×l .To improve accuracy and numerical robustness we apply q power iterations and obtain Q by computing the QR factorization of Y .The quality of Q can be improved by orthonormalizing the result after each multiplication with the given matrix M or its transpose.The needed steps for the randomized low-rank approximation are summarized in Algorithm 1.It can be beneficial to use more test vectors than the desired rank r for the low-rank approximation.We call the number of oversampling vectors the oversampling parameter k ≥ 0. We create the l = r + k sample vectors with normally distributed, independent entries with mean zero and variance one.These sample vectors form the columns of the sample matrix G ∈ R n×l .
According to [19], a small oversampling parameter of k = 5 or k = 10 is usually sufficient for Gaussian test matrices, i.e., choosing k > r usually does not improve the approximation further.As stated in [20], a small number of power iterations (q ≤ 3) usually leads to satisfying accuracy.
Algorithm 1 Randomized low-rank approximation with the power range finder from [19].
Next, we describe the computation of a low-rank approximation of a matrix M with the Arnoldi iteration.In each iteration, the matrix M is applied to a vector and the result is orthogonalized against the vectors obtained in previous iterations.After r such steps, we find a low-rank approximation of the given matrix M of the form where V r ∈ R n×r has orthonormal columns and H r ∈ R r ×r is an upper Hessenberg matrix.We use the Arnoldi iteration with modified Gram-Schmidt from [24].We reorthogonalize the vectors if needed in the Gram-Schmidt procedure.Algorithm 2 summarizes the low-rank approximation with the Arnoldi method.
Algorithm 2 Low-rank approximation with the Arnoldi method.
and H ∈ R r ×r . 1 Draw a random initial vector v 1 of size n with norm 1. 2 for j = 1, . . ., r do 3 Compute y j = Mv j ..
Compute y j = y j − H i, j v i .

9
Compute v j+1 = y j /H j+1, j .10 end 11 We briefly discuss the differences between the randomized low-rank approximation and the Arnoldi method.In both methods, the computational costs are dominated by the repeated application of the given matrix to a vector.The randomized low-rank approximation requires (q +1)l applications and (q +1)l transposed applications.The Arnoldi method requires l applications and no transposed applications.So, the Arnoldi method is cheaper concerning setup times but requires the sequential application of the given matrix to l vectors.In the randomized low-rank approximation, we could apply the given matrix to l vectors in parallel.

Error-based update with relaxed initial preconditioner
In this section, we derive a left and a right preconditioner update following the ideas from [16].We modify the updates by scaling the initial Schur complement preconditioner.Our goal is to adapt the given initial preconditioner S −1 with a low-rank update such that the new preconditioner is a more accurate approximation to the inverse Schur complement S −1 .To find a suitable update, we start from the error between the inverse Schur complement S −1 and the initial preconditioner We derive left and right update schemes by multiplying ( 6) by S from the left and the right, respectively, Solving for S and inverting, we obtain These alternative formulas for the inverse Schur complement S −1 are not practical since the matrices E L , E R are not available and solving an additional system of size n p × n p is problematic concerning computational complexity.We hence note that we are able to apply the error matrices E L , E R to vectors which allows us to compute lowrank approximations n p , and obtain .
By applying the Sherman-Morrison-Woodbury formula, we find These approximations S −1 L , S −1 R to the inverse Schur complement S −1 only require the additional solution of an r × r -system, r n p , instead of the solution of an n p × n p -system.For symmetric saddle-point systems with symmetric Schur complement preconditioner S −1 , the matrix E L is the transpose of the matrix E R , i.e., computing a low-rank approximation of E L also yields a low-rank approximation of E R and vice versa.
To improve the spectral properties of the preconditioner, we introduce a relaxation parameter α that scales the initial preconditioner in the error matrices We find the relaxed updates by scaling S −1 with α and replacing E L , E R with E L,α , E R,α , respectively, in (7).Replacing E L,α , E R,α with a low-rank approximation ×r and applying the Sherman-Morrison-Woodbury formula leads to the update formulas We compute the low-rank approximations of E L,α , E R,α with one of the two low-rank factorization algorithms described in Section 2.2.Both matrices depend on the Schur complement S = B A −1 B T which we cannot evaluate exactly due to the large-scale inverse A −1 .We approximate S by replacing the inverse A −1 with the preconditioner A −1 which in our case is given by ilu(0).Note that we also need to be able to evaluate Â−T v for the low-rank approximation by the power range finder of Section 2.2.This gives us the approximate error matrices We approximate with Algorithm 1 or Algorithm 2 and obtain the preconditioners Applying these updated preconditioners to a vector requires multiplication with thin rectangular matrices, solving a r ×r system, and scaling a vector of size n p in addition to the application of the initial preconditioner S −1 .Numerical tests, as illustrated in Section 5.4, show that the relaxation parameter can have a significant impact on the convergence obtained with the updated preconditioner.

Construction and application of the update
We discuss the practical construction and application of the update scheme along with the respective computational complexities.In the following, we only consider the right update since the respective derivations for the left update are analogous.To construct the update, we compute the low-rank approximation with Algorithm 1 or 2. Furthermore, we (explicitly) precompute the small matrix H r :=I r − N T R,α,r Q R,α,r ∈ R r ×r and its exact LU factorization H r = L r U r .Algorithm 3 summarizes these steps for the setup of the preconditioner update.
Algorithm 3 Construction of the update.
We now discuss the setup costs for the update scheme.The randomized low-rank approximation with Algorithm 1 requires creating the n p × l random test matrix, i.e., generating n p l random numbers.This costs N fillG = O(n p l).The most expensive part are the (q + 1)l matrix-vector products with E R,α and its transpose.The application of E R,α requires the application of S and S −1 to a vector and a vector addition.Hence, one application of E R,α or its transpose to a vector costs Furthermore, we orthogonalize 2(q + 1) times with modified Gram-Schmidt with reorthogonalization if needed.The setup costs N setup hence add up to The required steps for the application of the preconditioner S −1 R,α,r are summarized in Algorithm 4.
To apply the update to a vector we only need matrix-vector products with n p × rand r × n p -matrices, solve two triangular systems of size r × r , and add and scale a Algorithm 4 Application of the updated preconditioner.
5 Apply the initial preconditioner y = S −1 x. 6 Scale with the relaxation parameter y * = α.
vector of size n p .This sums up to the application costs If r ≤ n p (typically we have r n p ), the additional cost introduced by the application of the update is linear in the number of pressure degrees of freedom n p .

Error analysis
In this section, we analyze how the right update with relaxation acts on the error between the identity and the preconditioned Schur complement.With the Sherman-Morrison-Woodbury formula and the precomputed LU factorization of H r = I r − N T R,α,r Q R,α,r = L r U r , we have for the multiplicative update that is used in (11) with X :=Q R,α,r (L r U r ) −1 N T R,α,r .We rearrange (12) and obtain Now, we analyze the error between the identity and the preconditioned Schur complement after applying the update.We obtain The term describes the quality of the low-rank approximation Q R,α,r N T R,α,r of E R,α which not only depends on the chosen rank r but also on the quality of the approximation S = B A −1 B T to the Schur complement S = B A −1 B T used in its construction.We substitute R α into ( 14) and obtain Hence, the error after applying the update does not only depend on the low-rank approximation error R α but also on the product R α X .In terms of the ranks of the involved matrices and assuming best low-rank approximations, this implies i.e., the low-rank update typically does not reduce the rank of the error between the identity matrix and the (updated) preconditioned Schur complement.

Numerical results
In this section, we present numerical results obtained for a model problem describing buoyancy-driven atmospheric dynamics involving the incompressible Navier-Stokes equations.We start with a description of this model problem in Section 5.1.Then, we analyze the efficiency of the update and the influence of update parameters on convergence and computational times.In Section 5.2, we compare different initial preconditioners to justify our particular choice.Then, in Section 5.3, we discuss the influence of the parameters in the randomized low-rank approximation, i.e., the number of power iterations and the oversampling parameter.Furthermore, in Section 5.4, we investigate the quality of the low-rank update in dependence of the relaxation parameter and the update rank.

Model problem
This subsection introduces the considered test model that describes atmospheric dynamics.It is based on the Navier-Stokes equations that describe the dynamics by the pressure p and the velocity u Here, ε(u) = 1 2 (∇u + (∇u) T ) denotes the strain rate tensor and ∂ t the partial time derivative.The non-dimensional number Re denotes the Reynolds number.It is given as Re = ρU L μ where U is the velocity scale, L is the length scale, ρ is the density of the fluid, and μ is the dynamic viscosity of the fluid.The forcing f(u) includes the buoyancy and the Coriolis force.We consider two three-dimensional domains, a cube and a thin shell.Figure 1 shows example meshes for both geometries.We choose periodic boundary conditions at the sides of the cube that mimic the shell geometry of the atmosphere.Furthermore, we have no-slip conditions at the lower boundary ∂ lower and no-flux conditions at the upper boundary ∂ upper For the shell geometry, we use no-slip conditions at the inner boundary ∂ inner and no-flux conditions at the outer boundary ∂ outer The Navier-Stokes equations are temporally discretized with the semi-implicit Euler method.All terms are discretized implicitly except for the forcing.For spatial discretization, we use Taylor-Hood elements of the lowest order on hexahedral cells.Linearization with the Picard iteration leads to (a sequence of) saddle-point systems of the form for each Picard iteration and each time step.The matrix blocks B and B T correspond to the divergence and the gradient.The matrix block A is dominated by the velocity mass matrix.Furthermore, it includes diffusion and advection, both scaled with the time step length.The advection is scaled with the inverse Reynolds number.We set .This leads to a Reynolds number of about 708.8.We test our preconditioners for the system that we obtain during the first time step after five Picard iterations.We solve the block system with FGMRes with a restart length of 40 iterations.We stop when the relative residual drops below 10 −8 .In the following, we show results for the right update scheme since we are using a right preconditioner.
The model and solvers are implemented in C++ using deal.II 9.4.0 [25, 26] and Trilinos 13.1 [27].The numerical results are obtained on an Intel Xeon Gold 6240 processor with 2.60 GHz and 18 cores and 32 GB RAM.

Initial preconditioners
We start with a comparison of different preconditioners A −1 and S −1 for the block triangular preconditioner in (2) without any low-rank updates.A standard choice as a preconditioner A −1 is given by multigrid methods.However, due to the advection in our system, algebraic multigrid methods may fail.Furthermore, we require the transposed application of the preconditioner for the setup of the update scheme which is not defined for multigrid techniques for unsymmetric systems.Hence, we have chosen an incomplete factorization (ilu(0)) from Ifpack, included in Trilinos.We compare this choice with an algebraic multigrid from the ML package in Trilinos where we use one V-cycle with a Gauss-Seidel smoother and a direct solver on the coarsest level.
Both SIMPLE and LSC Schur complement preconditioners in (3) and (4) require the (approximate) solution of Poisson-type systems B D −1 B T with a diagonal matrix D. We compare an inner iterative solver, the conjugate gradient method with a relative (stopping) tolerance of 0.1, with an incomplete Cholesky factorization (ic(0)).The

Table 1
Setup and solver times (in seconds) as well as iteration counts for different initial preconditioners A −1  LSC preconditioner seem to be sufficient concerning the reduction of iteration counts, a further increase of q has hardly any impact on the iteration counts.The observations for the cube geometry are very different: the number of iterations with the updated LSC preconditioner is significantly reduced for each of the first two power iteration whereas the updated SIMPLE preconditioner even becomes worse by applying updates.Here, for q = 0 to q = 3, increasing the number of power iterations does not decrease the iteration counts.A larger number of power iterations might be needed in this case but leads to even higher setup times.It is also interesting to note that the standard deviations of the iteration counts are larger for the cube geometry.
In an attempt to explain the somewhat disappointing results for updated preconditioners on the cube, we take a look at the singular values of the error matrix.Figure 2 compares the decay in approximate singular values for the shell and the cube.Figure 2a shows the slow decay of the 100 largest approximate singular values of E R,α with α = 1.9 from (9) for the cube using the SIMPLE preconditioner.It is seen that the singular values decay rather slowly for the cube which makes it difficult to resolve the dominant subspace of the range.Figure 2b shows that the singular values of the shell system decay faster compared those of the cube.

Rank and relaxation parameter
Now, we test the update scheme for varying the rank r and the relaxation parameter α in (11).Figure 3 shows the number of iterations obtained with the updated LSC and SIMPLE preconditioners on the shell and the cube.We illustrate results for low-rank updates obtained both with the randomized method and the Arnoldi iteration.The rank r is varied from 0 to 60 with step size 5 and relaxation parameters α are varied from 0.6 to 2.9 with step size 0.1.The white color refers to the number of iterations obtained with a preconditioner without relaxation and update.Green areas denote a reduction of iteration counts and red areas denote an increase in iterations.
We observe that the relaxation parameter has a significant impact on the needed number of iterations.The interval of "optimal" α seems to depend mainly on the  9) for the two domains using the SIMPLE preconditioner Fig. 3 Iteration counts for the Oseen systems on the cube (n u = 107811, n p = 4913) and the thin shell (n u = 78438, n p = 3474) for varying the parameters α, r in (11).For each subfigure, the left image shows results obtained with the randomized low-rank approximation (with q = 3 power iterations and oversampling parameter k = 0) and the right image shows results obtained with the Arnoldi iteration.The white color refers to results obtained without relaxation and updating initial preconditioner but not on the system (at least for the considered test systems).Furthermore, the range of (nearly) optimal relaxation parameters is relatively large.For the LSC preconditioner, the optimal relaxation parameter is close to 1.0.For the SIMPLE preconditioner, the optimal relaxation parameter lies in the interval [1.6, 1.8] for the shell and in the interval [1.6, 2.3] for the cube for ranks r ≥ 45.In these settings, the low-rank update computed with the randomized method leads to lower iteration counts compared to the Arnoldi method.However, the setup costs with the randomized method using q = 3 power iterations are about six times higher than the setup costs of the update with the Arnoldi method.
Table 3 shows the needed number of iterations, setup, solver and total costs for different relaxation parameters and ranks for updates computed with the randomized method and the Arnoldi iteration.The hyphen denotes that the solver did not converge within 1000 iterations.Here, we do not use power iterations for the randomized method (i.e., q = 0).We observe that the relaxation parameter accelerates convergence.For the coarse shell, the low-rank update decreases iterations counts further but increases the total computational time.For the other systems, the update does not improve the preconditioners further.The results obtained with the Arnoldi iteration lead to lower iteration counts in most cases and about 35% lower setup costs.

Outlook
We have derived a right and a left low-rank update for a relaxed Schur complement preconditioner.The update vectors are computed with a (randomized) low-rank The "setup" columns include only the costs for the update, not those for the initial preconditioner.The parameter k n denotes the time step length approximation of the difference between the identity and the scaled preconditioned approximate Schur complement.
Relaxing the initial preconditioner has a significant impact on the convergence behavior of the iterative solver.The range of (nearly) optimal relaxation parameters seems to depend mainly on the initial preconditioner and is relatively broad.
We have observed that the update can decrease iteration counts but this is not guaranteed.The efficiency of the update depends on parameters for the underlying low-rank approximation, the relaxation parameter, and the update rank.Choosing the parameters for the low-rank approximation and the update rank is a trade-off between balancing setup costs and solver times.Low-rank updates computed with the randomized method using a few power iterations can decrease the iteration counts and solver times significantly but lead to higher setup costs and also higher total costs than low-rank updates computed with the Arnoldi iteration or the randomized method without power iterations.The drawback of high setup costs may be somewhat reduced when using parallel computing.
Future research concerns the adaptation of the initial preconditioners for the Picard iteration and for different time steps.The setup time needed to compute the low-rank update is lower than the setup time required for the initial preconditioners, so it may be beneficial to use a low-rank update instead of setting up a new preconditioner from scratch.

, S − 1 SIMPLE ( 4 ) 2
for the block preconditioner (2) for the shell system (n u = 78438, n p = 3474) and the cube system (n Results for varying the number of power iterations q with rank r = 20, oversampling parameter k = 0, for the systems on the shell (n u = 78438, n p = 3474) and the cube (n u = 107811, n p = 4913) (a) Shell, LSC,

Fig.
Fig. The 100 largest singular values of E R,α from (9) for the two domains using the SIMPLE preconditioner

Table 3
Number of needed FGMRes iterations and setup, solver and total times (seconds) for the Oseen systems obtained with q = 0 power iterations and k = 0 oversampling vectors