SIMD vectorization for simultaneous solution of locally varying linear systems with multiple right-hand sides

Developments in numerical simulation of flows and high-performance computing influence one another. More detailed simulation methods create a permanent need for more computational power, while new hardware developments often require changes to the software to exploit new hardware features. This dependency is very pronounced in the case of vector-units which are featured by all modern processors to increase their numerical throughput but require vectorization of the software to be used efficiently. We study the vectorization of a simulation method that exhibits an inherent level of vector-parallelism. This is of particular interest as SIMD operations will hopefully be available with std::simd in a future C++ standard. The simulation method considered here results in the simultaneous solution of multiple sparse linear systems of equations which only differ by their main diagonal and right-hand sides. Such structure arises in the simulation of unsteady flow in turbomachinery by means of a frequency domain approach called harmonic balance.


INTRODUCTION
torization of a simulation method that exhibits an inherent level of vectorparallelism.This is of particular interest as SIMD operations will hopefully be available with std::simd in a future C++ standard.
The simulation method considered here results in the simultaneous solution of multiple sparse linear systems of equations which only differ by their main diagonal and right hand sides.Such structure arises in the simulation of unsteady flow in turbomachinery by means of a frequency domain approach called harmonic balance.

Introduction
For many applications, the simulation of turbomachinery requires the resolution of instationary flow phenomena to adequately predict, e.g., aeroelastic or aeroacustic behavior of a component.A good overview of modeling approaches for turbomachinery can be found in [13].Assuming that the phenomena of interest are periodic in time, the equations can be expressed in the frequency domain using a Fourier series.One approach, called harmonic balance [5] is increasingly adopted in industrial turbomachinery design, as it leads to a reduction in computing times compared to the established Unstead Reynolds Averaged Navier-Stokes (URANS) method by up to two orders of magnitude [3].
In this paper we study the vectorization of this simulation method that exhibits an inherent level of vector-parallelism which results in the simultaneous solution of multiple sparse linear systems of equations which only differ by their main diagonal and right hand sides.
We implement and benchmark these techniques into the Sparse linear system solver library Spliss, a modern HPC library for solving block sparse linear algebra problems that is developed at the German Aerospace Center (DLR); see, e.g., [8].Spliss is especially well-suited to solve the problem at hand, since its modular and templatized structure allows us to build onto its existing performant solvers.
We introduce a new data-type in Spliss, the MultiScalar, which contains a compile-time constant number of aligned scalars.Using this MultiScalar as the scalar type for the diagonal of a matrix and the right-hand side vector of a linear system we can apply Spliss' native linear system solvers to simultaneously solve the multiple equations in a single step.Since iterative solvers like GMRES or SoR repeatedly apply the system matrix to a vector many load operations from different memory levels may be necessary.With our approach, we only need to load the off-diagonal values of the matrix once instead of multiple times.In combination with suited SIMD vectorization of the required operations our method achieves a significant speed-up.
We compare two approaches to implement MultiScalars, one that relies on the compiler to insert the vector instructions and one which leverages the Vc [6] library.This is of particular interest as SIMD operations will hopefully be available with std::simd in a future C++ standard.We carry out thorough benchmark scenarios comparing the different approaches for various block-sizes and provide a recommendation on when to use which of the two approaches.
The paper is structured as follows.In Section 2, we first introduce the considered model problem.In Section 3, we give a brief introduction into current HPC architectures and present our particular implementations.In Section 4, we present numerical results considering vectorization ratio, timings, and Flops.Eventually, we draw a conclusion in Section 5.

Model problem
Computational fluid dynamics is an essential tool in the design of all types of fluid machinery.Of particular interest are the flows through turbomachinery, which convert fluid energy into rotation of a shaft and vice versa.This class of machines contains pumps, compressors, turbines, wind turbines, jet engines and gas turbines.The flow through these machines with all unsteady phenomena is fully described by the Navier-Stokes equations.However, numerically resolving all turbulent scales for the flows under consideration, direct numerical simulation (DNS), requires far too much computational effort to be affordable inside a design loop.
A common remedy is the temporal averaging of the equations and modeling the effect of all unsteady phenomena on the time-averaged flow.This approach is called Reynolds-averaged Navier-Stokes (RANS) and allows to exploit rotational symmetry and to reduce mesh resolutions to manageable magnitudes.The RANS equations have been the workhorse behind the design of most of today's turbomachinery.However their inherent negligence of the unsteady interaction of rotating and stationary turbomachinery components renders it infeasible for many types of aeroelasticity considerations.
This can be overcome by including larger unsteady flow phenomena into the computation by means of the unsteady RANS (URANS) method.While URANS captures unsteady interactions between components, it comes at the cost of no longer allowing to assume rotational symmetry of the flow field.Furthermore URANS computations exhibit a transient phase, where the initially assumed flow develops into a flow that satisfies the URANS equations.Having to simulate the full annulus until the flow converges to the final unsteady flow makes the URANS method about two orders of magnitude more expensive than RANS computations.
The fact that the interactions between rotating and stationary parts are periodic in time gives rise to methods that only model the time-periodic be-havior.Consequently, an approach in between RANS and URANS is to solve the equations in the frequency domain and to restrict the unsteadiness to a selected number of base frequencies and harmonics thereof.
One approach for frequency-domain simulation in turbomachinery which allows for the nonlinear interaction between the mean flow and the harmonics is the harmonic balance approach [5].
After spatial discretization, e.g. through a finite-volume discretization, the semi-discrete Navier-Stokes equations take the form where R denotes the balance of fluxes and sources for the conservative flow state q = (ρ, ρu, ρv, ρw, ρE) comprising density ρ, u, v, w, the momentum in the three spatial directions, and internal energy density E. The periodic behavior of this flow state can be approximated by taking the real part of a finite Fourier series of a base frequency ω and multiple non-negative harmonics qk (2. 2) The residual R(q) can be transformed analogously.The time derivative transforms to a multiplication with ikω in the frequency domain which yields a system of equations in the frequency domain This system can be solved by pseudo-time stepping, similar to the steady RANS problem, albeit in the frequency domain [10] with a pseudo-time τ , i.e., This set of equations is solved by an Euler backward approach for stability reasons.For the determination of the solution update ∆q = qm+1 − qm , implicit pseudo-time stepping requires the solution of a linear system of equations A∆q = − R(q). (2.5) Note, that due to the nonlinearity of R, the pseudo-time operator would couple all harmonics and therefore would contain K 2 coupled linear systems of equations.Each system is of size where N c denotes the number of grid cells and N d the degrees of freedom per grid cell.In the case of finite-volume discretizations and when turbulence models are solved in a loosely coupled manner N d = 5, where N d is the number of physical variables describing the flow state.Larger values of N d would arise in a Discontinuous Galerkin (DG) discretization.For details on DG, we refer to, e.g., [2,12].
To avoid the quadratic growth of the linear system with K, the linearization may be based on the zeroth harmonic, yielding where I is the identity matrix.This is a sequence of one real-valued and K complex-valued linear systems of size In practical implementations the systems are represented as a sparse N c × N c matrix of dense blocks each being N d × N d in size.Note that only the entries on the main diagonal itself are complex-valued.However the whole dense N d ×N d blocks on the diagonal are stored as complex.This is done for the ease of implementing block diagonal preconditioners such as occuring in Jacobi iterations or successive overrelaxation (SSOR).
These systems share their off-diagonal entries and differ only in the main diagonal and right hand sides where D designates the complex-valued diagonal matrix and J the linearization of the residual.This structure lends itself to a vectorized solution method, since the main diagonal, forming the majority of data, would only have to be transported once over the memory bus instead of K + 1 times in the sequential approach.
In many CFD-related applications, we find typical block or stencil sizes of, e.g., 5 or 7 which do not lend themselves to SIMD operations naturally; see also the general introduction into computer architecture in the next section.The vectorization over multiple linear systems is thus an attractive alternative.
3 Computer architecture and SIMD vectorization

General introduction
Today's supercomputers feature multiple levels of parallelism; see, e.g., [4] for more details.On the highest level, multiple compute nodes communicate through a network.Each node consists of one or multiple CPU sockets, each with multiple cores.A node may also contain further accelerators such as GPUs or vector processors.Typically, the hardware in one node shares a memory address space even though there are multiple physical memories with different access speeds when accessed from different parts of the node (NUMA).A hierarchy of caches helps to bridge the gap between relatively slow main memory compared to the high performance of current processing units.On the lowest level, each CPU core (similar also for accelerators) consists of a pipeline of units that execute the desired instructions.One unit usually completes one instruction every cycle but needs multiple cycles to process it (latency).Similar to GPUs and vector processors, the CPU units allow to perform the same operation with multiple elements of data (Single Instruction Multiple Data: SIMD).Most Current CPU architectures have floating-point units with a SIMD width of 256 or 512 bit and allow to calculate one fused-multiply-add (FMA) instruction with vectors of 4 or 8 double-precision numbers respectively 8 or 16 single-precision numbers.CPUs for servers / HPC systems usually have 2 FMA units per core (superscalarity).To fully leverage the performance of one CPU core, one thus needs about hundred to several hundreds of independent floating-point operations to fill the pipeline.
There are some additional constraints for using SIMD instructions: ideally, the data should be stored in a contiguous array that starts at an aligned memory address (an address that is a multiple of the SIMD width).In addition, the code should access independent, consecutive chunks of data of size of the SIMD width.If the array length is not a multiple of the SIMD width, a remainder loop (without SIMD operations) is needed or a special masked SIMD operation must be added for the last few elements.Therefore, it is common practice to insert some zeros to obtain a data layout that allows better SIMD processing (padding) as the compiler can usually not adjust the data layout (which is used across multiple files and in external interfaces).It is then the task of the compiler optimization to rearrange the floating-point operations in such a way that they can be transformed into independent SIMD instructions.However, due to the inherent complexity of the required code transformations and due to possible pointer aliasing, compilers may not generate optimal SIMD instructions for a given algorithm.Therefore, SIMD-libraries like Vc [6] provide a suitable abstraction level to use SIMD features in a portable way across different CPU instruction sets.
This work focuses on the node-level performance.To ease the performance analysis, we consider the Roofline performance model [16] which states that the performance is either limited by the maximal in-core performance P max or by data transfers.The maximal in-core performance depends on the mix of operations (and possible dependencies between them) and assumes that all required data is readily available in the nearest cache level.The data transfers are characterized through the main memory bandwidth b m , respectively the bandwidth of the slowest data path used.Depending on the algorithm, the computational intensity I c may change.The computational intensity specifies the number of (floating-point) operations per transferred byte.Combining these definitions, we obtain the Roofline performance: If the data transfers are the limiting factor, the algorithm is called memorybound.If, in contrast, the (floating-point) operations are the limiting factor, the algorithm is called compute-bound (or core-bound).Characteristic values for the peak memory bandwidth and the peak performance are shown in Table 3 Gold 6123 that is used for the numerical experiments measured using likwid-bench [15].We use an axpy memory benchmark with an array size of 1 GB instead of the usual STREAM [16] benchmark as it better reflects the memory access pattern of our implementation.The hardware also supports AVX512 instructions (512 bit SIMD width instead of 256 bit with AVX2) but we do not use them in this paper.

Realization in Sparse Linear System Solver library
The Sparse Linear System Solver library Spliss is a novel block sparse linear library that is developed for large scale CFD simulations; see [8].Spliss is currently used in modern HPC CFD solver frameworks in aerospace and engineering, such as CODA [7] and TRACE [3].Spliss is designed as a modern C++ library and relies heavily on templatization.This allows for a decoupling of abstract linear solvers and matrix format implementations on the one hand and concrete data types on the other hand.Spliss employs distributed and shared memory parallelization.The former is realized via an internal abstraction layer which allows the usage of either two-sided MPI [11] or one-sided GASPI [1] communication as backend.
For shared memory parallelization Spliss uses the Alpaka (abstraction library for parallel kernels) framework [9], a performance portable, platform independent abstraction layer that allows using multiple (possibly different) accelerators concurrently.This enables Spliss to use, for example, OpenMP threads or CUDA for NVidia GPUs with the same high-level implementation.
Spliss implements a collection of common sparse linear system solvers such as CG and GMRES and allows for the application of preconditioners such as (Block)-Gauss-Seidel or SOR.It supports several sparse (block) matrix formats with either fixed or varying block size.Due to templatization these matrix formats and algorithms can operate on any arithmetic data type, such as double, float, complex or a user defined data type.

Realization of SIMD operations on multiple scalars
To efficiently execute SIMD operations on multiple scalars, we introduce the MultiScalar object as a custom data type which contains a compile-time constant number of aligned scalars.We have implemented two different realizations of MultiScalars, our naive implementation uses a member which is an array of scalars while the Vc-based implementation derives from Vc::SimdArray.
Together with the MultiScalar object itself, we need a mask object which is of the size of the MultiScalar and allows the comparison of MultiScalars.
The concept of the naive MultiScalar implementation is given in Fig. 3.1 (left).The Vc-based MultiScalar is only slightly more complex.In particular, it needs some additional lines of code for the definition of a corresponding  MaskType.In Fig. 3.1 (right), we present our VcAuto implementation which lets Vc decide on the used vector length and which does not automatically introduce padding.For general padding, we would need to introduce another constant value validEntries.This number can then differ from Size which would be the size of the MultiScalar with padding.We present the realizations of simple MultiScalar comparison or addition operators in the appendix.
For the case of complex diagonals, we extend the concept of MultiScalar to ComplexMultiScalar.Since many operations on complex numbers (e.g., comparisons or additions) are based on a separate handling of real and imaginary parts, we use a Complex object which then holds two MultiScalars; one for the real and one for the imaginary part, each purple box representing one scalar value; see Fig.

Mixed scalar type matrices
Owing to the application of harmonic balance problems, where, e.g., the diagonal blocks of a matrix can be complex-valued while the off-diagonal blocks only contain real-valued entries, Spliss offers the so called CompositeMatrix.
For such a matrix A, we define A C as the set of (diagonal) matrix blocks that are complex-valued and A R as the set of (off-diagonal) blocks that are realvalued.Then Ax = A C x+A R x and the matrix is split up into multiple separate parts, each comprising a homogeneous datatype.This allows for a very flexible and memory efficient assembly of mixed data type matrices, or even matrices featuring additional matrix-free operators.For the sake of simplicity, we assume in our description that A C and A R are stored in a meaningful way such that the above matrix-vector-products can be executed.
A very simple CompositeMatrix with only one diagonal (in purple) that can have a different data type than the green off-diagonal blocks is shown in Fig. 3.3 (left).The more relevant use case with multiple diagonals (i.e., MultiScalars on the diagonal) that have a different data type than the offdiagonal blocks can be found in Fig. 3

.3 (right).
The downside of the current implementation is that the vectors have to be loaded from memory once for the diagonal and once for the off-diagonal part.Nevertheless, due to the focus on block matrices, this effect becomes smaller with larger block sizes, since the matrix (which is loaded once) requires O(N c N 2 d ) data transfers whereas the vector only requires O(N c N d ) transfers.In this work we use the CompositeMatrix to define (Complex)MultiScalar entries on the diagonal and real-valued entries on the off-diagonal but other choices are also possible.The application of real-valued off-diagonal blocks on either complex-or real-valued right hand sides then saves memory transfers and operations since it avoids storing (and calculating with) zeros for the imaginary part of the matrix entries.

Numerical results
We consider a variety of different systems to test the performance of our implementations.All test cases are based on CompositeMatrices with a maximum of seven nonzero blocks per row, i.e., up to six off-diagonal blocks (three left and three right of the diagonal with a distance of 1, 10, and 100 blocks).All these blocks (diagonal and off-diagonal) are fully dense and the whole system matrix is (block) sparse.Off-diagonal blocks are always real-valued.For the diagonal blocks, we consider either real or complex entries.We consider examples with single diagonals as well as with MultiScalar or ComplexMultiScalar diagonals of different size.
We consider two edge cases with either small 5×5 or large 240×240 blocks.In case of small blocks, we consider a matrix with 5 * 1.5 million rows.This leads to 37.5 million nonzeros for the diagonal blocks and 225 million nonzeros in the off-diagonal blocks of the matrix.For the larger block size, we consider a matrix with 240 * 700 rows leading to 40.3 million nonzeros for the diagonal blocks and 229 million nonzeros in the off-diagonal blocks.
The size of the matrices in the memory of the different tests differs by the data type of the diagonal as well as the precision used.We consider test cases with single precision (denoted FP32) or double precision (denoted FP64).
Let us provide an example on the data size for the case of 5 × 5 blocks, four complex diagonals and double precision.Using a ComplexMultiScalar of size 4, the required storage for the diagonal blocks is then given by where N N Z denotes the number of nonzero entries stored in the diagonal, N bd is the size in bytes of one double, N M S specifies the number of entries in the Multiscalar and N RC is a multiplier to distinguish between real-and complex-valued entries.Accordingly the off-diagonal needs bytes.Four complex double-precision vectors result in We use gcc 10.2 and Vc 1.4.1.We perform numerical tests on a single socket with the hardware characteristics shown in Table 3.1.We do not use AVX512 SIMD instructions as AVX-512 will only be available with Vc 2 1 .In addition, the CPU reduces the frequency when AVX512 instructions are executed.This makes AVX512 instructions not always beneficial and complicates the performance analysis.We divide the numerical results section into four different sections: Bandwidth saturation: In Section 4.1, we briefly discuss the compute intensity of our application and show that we obtain saturating behavior.Vectorization behavior: In Section 4.2, we show the vectorization behavior of the different implementations.Timings and Flops/s: In Section 4.3, we consider timings and Flops per second for different numbers of diagonals and benchmarks for the different implementations.Realistic Example: In Section 4.4, we validate the findings using a linear matrix obtained from the CFD solver TRACE within a harmonic balance context.

Bandwidth saturation
For the matrix with 5 × 5 blocks and 1.5 • 5 million rows in complex arithmetic, we need Therefore, we obtain the compute intensities for the block-diagonal part and the off-diagonal parts of the computation.For bigger blocks and more vectors, the compute intensity increases (e.g. to ∼ 5.7 [Flop/Byte] for the off-diagonal part of the matrix with 240 × 240 blocks and 8 vectors).From Table 3.1, the machine intensity is Thus, for the Roofline performance model all variants are memory-bound.However, for the cases with multiple diagonals the compute intensity is close enough to the machine balance that the SIMD vectorization affects the performance.This is the regime (compute intensity close to the machine intensity) where the Roofline model is too optimistic in the sense that it assumes a perfect overlap of data transfers and computations; more sophisticated performance models like the Execution-Cache-Memory (ECM) model [14] could predict the performance more accurately but we will focus on the generic implementation and SIMD vectorization here.
In the following, we show that our model problems still feature bandwidthsaturating behavior.We compute the MatVec benchmark for the model problems with 5x5 and 240x240 blocks on the diagonal.The MatVec benchmark conducts one matrix-vector-product with both the block-diagonal and the offdiagonal parts.We run this with MultiScalars of either 1, 4, or 8 scalars on these diagonals on 1 to 14 cores; cf.Fig. 4.1.All benchmarks are executed on the machine depicted in Table 3.1.We compare the bandwidth to the bandwidth obtained with the AXPY benchmark in LIKWID [15].We chose the AXPY benchmark as reference as it has a similar load-to-store ratio.All computations update the vector (no nontemporal stores).
We can see from Figure 4.1 that all cases with VcAuto achieve a high fraction (> 75%) of the peak bandwidth and that they feature saturating behavior.Nevertheless, most cases are not completely saturated (more cores could further increase the performance), especially the case with 240 × 240 blocks and 1 diagonal scales almost linearly with the number of cores.As the cases with 240 × 240 blocks and more diagonals achieve a higher bandwidth even though they have a higher compute intensity, this indicates sub-optimal compiler optimization.The less degressive scaling of the implementation without Vc in comparison already indicates a better utilization of the compute performance when using Vc.This observation will be further investigated in the following sections.

Vectorization behavior
In this section, we compare the naive Multiscalar implementation, denoted noVc, with the padded implementation, denoted VcPad, as well as with the implementation, where potential padding is decided by Vc, denoted VcAuto.Finally, we also provide compiler-obtained vectorization for a sequential solution of the systems with different diagonals.Note that padding of, e.g., a MultiScalar of size 3 to size 4 can help to make the code more suitable for SIMD vectorization.However, it comes at the cost of storing and transferring additional zeros (4/3 of the original date transfers required for the vectors).
As expected, we see from Fig. 4.2 that vectorization of VcPad is always at 256 bit since MultiScalars are padded to this size.For VcAuto, we observe that practically no padding is used.For instance, in all cases, we see that vectorization for VcAuto MultiScalar of size 7 is done with one third to lengths 64, 128, and 256 bits.For Vc-based implementations, we see that vectorization does not depend on optimization flags (see left column for O2 and right column for O3 in Fig. 4.2).We explicitly consider both O2 and O3 optimization, as it is common for scientific and engineering codes, to enable only the O2 optimization level.
For sequential solutions as well as the naive noVc implementation, we do not see any vectorization by the compiler with O2 flag; see left column in Fig. 4.2.We see modest vectorization with O3 flag but this is far below the vectorization as achieved by Vc-based implementations; see right column in Fig. 4.2.

Timings and Flops
The results on timings and Flops can be divided into two different categories.In Fig. 4.3 and Fig. 4.4, we consider the MatVec benchmark for one up to 16 diagonals and 5x5 or 240x240 sized blocks, respectively.Results are presented for O2 and O3 optimization, single and double precision as well as realand complex-valued diagonals.In Fig. 4.5-Fig.4.8, we consider five different benchmarks for different block sizes and different precision.We consider O3 optimization and real-as well as complex-valued diagonals.
In Fig. 4.3 and Fig. 4.4, we see that different optimization flags make a huge difference for the naive MultiScalar implementation while for single systems as well as Vc-based implementations the optimization gain is smaller.For Vc, O2 optimization already yields best performance in some cases.
For Vc-based implementations, we see that the best performance is obtained for the number of diagonals that fits a multiple 256-byte width, i.e., eight or 16 for single precision and four or eight (in some cases also 12 or 16) for double precision.
We generally see that the padded Vc-MultiScalar behaves suboptimally for systems with a small number of diagonals.This is due to the relatively large padding to four (double precision) or eight (single precision) diagonals.Here, the Vc-MultiScalar, where vector widths are derived automatically (VcAuto), yields much better results.However, for a badly chosen number of diagonals (i.e., seven for double precision), VcAuto conducts three SIMD operation of length 256, 128, and 64 bytes instead of two 256-bytes operations for VcPad; see also Fig. 4.2).
For a small number of diagonals, VcAuto and the naive noVc approach perform similarly well.In case of small block sizes (5x5), the noVc performs best for double precision and more than four diagonals (except eight).On the other hand, for eight double precision diagonals VcAuto still performs best on small block sizes (5x5).For single precision, independent of the block size, or double precision and large block sizes (240x240) the picture is clear.Here, VcAuto performs best for all multiples of 256-bytes widths.
In Fig. 4.5-Fig.4.8, we consider five different benchmarks.Besides the previous MatVec benchmark, we test a colored matrix vector product ColMatVec with three different colors.Furthermore, we test a block Jacobi BlockJac and block Gauss-Seidel BlockGS iteration scheme, where the preconditioner is computed as the LU decomposition of the block diagonal matrix.In the resulting timings and Flops, we only consider the iteration scheme, not the set up of the preconditioner.Finally, we also test a GMRES iteration scheme.For all three iterative schemes, a modest number of iterations smaller 10 is conducted.As we do not want to compare the different benchmarks (e.g., BlockJac vs. GMRES), the number of iterations is not important.
The Flop count that we present only includes intended Flops that we need for the corresponding result.That means, that an addition of two MultiScalars with one value each padded up to four values will only result in one Flop, not  in four.Consequently, the GFlops/s obtained with VcPad are low for small numbers of diagonals.
Except for the case of 12 complex diagonals in double precision and small 5x5 blocks, the VcAuto MultiScalar implementation always achieve the most Flops per second (or are within a range of some percent of the best performance).

Realistic example from the CFD solver TRACE
To validate the benchmark results, we tested the MultiScalar implementation in a realistic use case.Fig. 4.9 shows the sparse matrix structure extracted from the CFD solver TRACE [3] for the simulation of a transsonic compressor fan.The discretization results from a structured mesh consisting of about one million finite volume cells.Each entry of the sparse matrix consists of a dense 5x5 matrix block.A harmonic balance solution which features N higher harmonics will require N additional equally structured linear matrices with a complex diagonal to be solved.As before, we solve these systems using different approaches.First, we use the classical approach without MultiScalars and solve  the systems sequentially.Second and third, we compare the performance of our own naive MultiScalar implementation against one implementation using the Vc library.In TRACE, typically, a colored block Gauss-Seidel approach is used with a fixed number of iterations.The library Spliss inverts the 5x5 blocks on the diagonal using an LU decomposition.To exclude effects of distributed MPI parallelization, the system is solved on a single node comprising 32 cores (System specification: Sky Lake Xeon(R) Silver 4216).The results are shown in Fig. 4.9 for single precision, double precision and different numbers of higher harmonics considered, i.e., N ∈ {4, 5, 7, 8}.To allow for comparison, we used a fixed number of 200 iterations.Let us note that all systems represented by one MultiScalar are always computed together.The iteration process can then be stopped if either one or all approximations have reached the convergence criterion.This means that, under certain conditions, some systems are solved unnecessarily precise.Such kind of algorithmic losses are not considered here.
For the case of sequentially solved systems a proportional increase in computing time can be observed as expected.For double precision the computing time is about twice the amount as for single precision.For either four or eight entries of a MultiScalar, the Vc-based implementation yields an improvement of about a factor of two.For odd numbers which do not fully fill a SIMD register, the own (naive) implementation relying on the compiler is faster in some cases.For numbers filling a SIMD register or being a multiple of it, the naive implementation is slower.Compared to sequential solution of the systems, the use of Vc-based implementation of the MultiScalar always reduces the computational time significantly.

Conclusion
In this paper we presented three implementations of MultiScalars.The first implementation is a naive C++ implementation while the second and third one make use of the Vc library [6].Hereby, we conduct the parallel solution of linear systems, which only differ for a limited number of matrix entries.These systems may naturally arise from computational fluid dynamics problems as described in Section 2. The parallel solution of these systems is conducted using SIMD operations allowing the concurrent processing of, e.g., four double precision numbers.
Since our model problems are memory-bound, we see that we benefit from lower memory transfer using our CompositeMatrices.However, we also benefit from SIMD parallelism realized on the diagonal blocks of the matrices.We observe limits on default compiler vectorization for naive MultiScalar implementations (denoted noVc); on the other hand, we see that Vc-based implementations (denoted VcPad and VcAuto) vectorize well; see Fig. 4.2.Insights into the different implementations is given in Fig. 3.1.However, our Vc-based implementations (VcPad and VcAuto) do not perform best for all use cases.As vectorization is well achieved with these implementations, the number of systems to be solved in parallel should be chosen carefully.For instance, for 256 bit sized SIMD registers and double precision systems, our padded Vcimplementation always solves multiples of four systems.This means that for just two systems to be solved, two systems are solved in padding.The VcAuto implementation would then only vectorize with length 128 bit and solve both systems without overhead.However, this implementation is disadvantageous if seven double precision systems need to be solved.Then vectorization is done in 256, 128, and 64 bit size; see, e.g., Fig. 4.2.
Chosing the number of systems to be solved in parallel in accordance with the SIMD width can lead to a substantial reduction of computation time.As expected, the solution of systems in parallel is substantially faster than sequentially solving these systems.Additionally, the Vc-based implementations also outperform the naive MultiScalar implementation considerably; see Fig- For a realistic, memory-bound example with small-sized 5x5 diagonal blocks, we finally achieve a speedup of factor two compared to a sequential solution of four or eight systems.We also obtain a significant reduction in computation time by using the Vc-based MultiScalar implementations against the presented naive implementation.Further advantages of SIMD execution could result from lower total energy consumption.However, this was not measured and is subject to future research.

Fig. 3 . 2
Fig. 3.2 ComplexMultiScalar composed out of two MultiScalar s, one for the real-and one for the complex-valued entries.

Fig. 3 . 3
Fig. 3.3 Matrix-vector-multiplication of a CompositeMatrix with a single diagonal (left) and a MultiScalar with four entries on the diagonal (right).

Fig. 4 . 1
Fig. 4.1 Memory bandwidth of the MatVec-benchmark measured with LIKWID for 1, 4, and 8 diagonals with dense 5x5 or 240x240 blocks of complex entries in double precision.VcAuto uses the Vc-based MultiScalar where Vc decides on the used vector lengths.noVc also solves all systems simultaneously but only uses a naive array-based MultiScalar implementation.

Fig. 4 . 2
Fig. 4.2 Vectorization ratios for -O2 (left column) and -O3 optimization (right column) for 5x5 (top row) and 240x240 (bottom row) blocks of complex entries on the diagonal.Seq.systems corresponds to a sequential solve of the systems.VcPad uses a padded Vcbased MultiScalar on the diagonals to solve all systems simultaneously.Similarly, VcAuto uses a Vc-based MultiScalar where Vc decides on the used vector lengths.noVc also solves all systems simultaneously but only uses a naive array-based MultiScalar implementation.Dotted lines correspond to scalar execution, dashed lines represent the share of FP 128 vectorization and solid lines FP 256 vectorization.FP 512 is not used.

Fig. 4 . 3
Fig. 4.3 MatVec Benchmark timings for diagonals with dense 5x5 blocks of real (top row) and complex (bottom row) entries and single (left) and double (right) precision.Dashed lines represent execution with -O2 optimization, solid lines represent execution with -O3 optimization.Other notation as in Fig. 4.2.

Fig. 4 . 4
Fig. 4.4 MatVec Benchmark timings for diagonals with dense 240x240 blocks of real (top row) and complex (bottom row) entries and single (left) and double (right) precision.Dashed lines represent execution with -O2 optimization, solid lines represent execution with -O3 optimization.Other notation as in Fig. 4.2.

Fig. 4 . 5
Fig. 4.5 Obtained performance with different benchmarks for 4, 8, 12, and 16 diagonals with dense 5x5 blocks of real entries in single precision.Other notation as in Fig. 4.2.

Fig. 4 . 6
Fig. 4.6 Obtained performance with different benchmarks for 1, 4, 8, and 12 diagonals with dense 5x5 blocks of complex entries in double precision.Other notation as in Fig. 4.2.

Fig. 4 . 7
Fig. 4.7 Obtained performance with different benchmarks for 4, 8, 12, and 16 diagonals with dense 240x240 blocks of real entries in single precision.Other notation as in Fig. 4.2.

Fig. 4 . 8
Fig. 4.8 Obtained performance with different benchmarks for 1, 4, 8, and 12 diagonals with dense 240x240 blocks of complex entries in double precision.Other notation as in Fig. 4.2.

Fig. 4 . 9
Fig. 4.9 Left: The structure of the linear system Matrix.Right: Results for four different numbers of higher harmonics and double/single precision.