A methodology for speeding up matrix vector multiplication for single/multi-core architectures
Abstract
In this paper, a new methodology for computing the Dense Matrix Vector Multiplication, for both embedded (processors without SIMD unit) and general purpose processors (single and multi-core processors, with SIMD unit), is presented. This methodology achieves higher execution speed than ATLAS state-of-the-art library (speedup from 1.2 up to 1.45). This is achieved by fully exploiting the combination of the software (e.g., data reuse) and hardware parameters (e.g., data cache associativity) which are considered simultaneously as one problem and not separately, giving a smaller search space and high-quality solutions. The proposed methodology produces a different schedule for different values of the (i) number of the levels of data cache; (ii) data cache sizes; (iii) data cache associativities; (iv) data cache and main memory latencies; (v) data array layout of the matrix and (vi) number of cores.
Keywords
Matrix vector multiplication Data cache Cache associativity Multi-core SIMD Memory management Data reuse1 Introduction
The Dense Matrix Vector Multiplication (MVM) is an important kernel in most varied domains and application areas. It constitutes the kernel operation for computations of signal and image processing, computer science, natural science, engineering, economics and everyday graphical visualizations. MVM software kernel is run everyday on a huge number of processors all over the world, from small embedded microcontrollers to big supercomputers. Hence, the need of a fast MVM software implementation arises. Moreover, by optimizing MVM software, we can substitute a big and expensive CPU with a smaller and cheaper, achieving the same performance. The complexity of MVM is \(O(M \times N)\) and its execution time highly depends on the memory hierarchy utilization.
The separate optimization of the following sub-problems, i.e., the sub-problems of finding the schedules with the minimum numbers of (i) load/store instructions; (ii) addressing instructions; (iii) L1 data cache accesses; (iv) L2 data cache accesses and (v) main memory data accesses, cannot give an optimum solution in most cases; this is because the separate sub-problems optimization gives a different schedule for each sub-problem and these schedules cannot coexist (refining one degrades the other). In this paper, these sub-problems are optimized as one problem and not separately and the final schedule is found theoretically according to the memory hierarchy parameters. Furthermore, by utilizing the software (e.g., data reuse, data dependences) and hardware parameters (e.g., data cache size and associativity), the exploration space is decreased by orders of magnitude and therefore instead of searching the whole space, only a small number of solutions is tested.
Regarding multi-core architectures, we theoretically and experimentally show that MVM performance is restricted by main memory latency and bandwidth values and thus a high core utilization factor cannot be achieved.
The state-of-the-art (SOA) self/hand-tuning libraries for linear algebra and Fast Fourier Transform (FFT) algorithm, such as ATLAS [1], OpenBLAS [2], GotoBLAS2 [3], Eigen [4], Intel_MKL [5], PHiPAC [6], FFTW [7], and SPIRAL [8], manage to find a near-optimum binary code for a specific application using a large exploration space (many different executables are tested and the fastest is picked). This is because the development of a self-tuning library is a difficult and time-consuming task, since (i) many parameters have to be taken into account, such as the number of the levels of tiling, tile sizes, loop unroll depth, software pipelining strategies, register allocation, code generation, data reuse, loop transformations, and (ii) the optimum parameters for two slightly different architectures normally are different. Such a case is MVM, which is a major kernel in linear algebra and also the topic of this paper.
The aforementioned SOA libraries optimize all the above parameters separately by using heuristics and empirical techniques. The independent optimization of the back end compiler phases (e.g., transformations, register allocation), leads to inefficient binary code due to dependencies among them. These dependencies require all phases be optimized together as one problem and not separately. Toward this, much research has been done, especially in the past, either to simultaneously optimize only two phases, e.g., register allocation and instruction scheduling (the problem is known to be NP-complete) [9, 10] or to apply predictive heuristics [11, 12]; nowadays compilers and related works, apply (i) iterative compilation techniques [13, 14, 15, 16], (ii) machine learning compilation techniques to restrict the configurations’ search space [17, 18, 19, 20], or (iii) both iterative compilation and machine learning compilation techniques [21, 22]. A predictive heuristic tries to determine a priori whether or not applying a particular optimization will be beneficial, while at iterative compilation, a large number of different versions of the program are generated-executed by applying transformations and the fastest version is selected; iterative compilation provides good results, but requires extremely long compilation times and an initial training phase lasting for days or weeks. This fundamental problem, i.e., optimization of all compiler phases together, has not been solved, resulting in low-quality solutions. As it is well known, splitting a problem into parts and optimizing them separately does not lead to the optimum solution; in fact, it is far from the optimum. However, for this algorithm, the proposed methodology optimizes the major hardware (e.g., the size of each level of data cache) and software (e.g., data reuse) parameters simultaneously as one problem and not separately, leading to high-quality solutions.
The major contributions of this paper are: (i) the introduction of a new MVM methodology which is faster than the ATLAS SOA library; (ii) the optimization is done by fully exploiting the major software and hardware parameters as one problem and not separately, giving a smaller search space and high-quality solutions; (iii) for the first time the memory hierarchy architecture details (e.g., data cache associativity and latency) are fully exploited for this algorithm; (iv) the proposed methodology, due to the major contribution of number (ii) above gives a smaller code size and a smaller compilation time, as it does not test a large number of alternative schedules, as the ATLAS library does.
The proposed methodology is evaluated in desktop personal computers, i.e., Intel Pentium core 2 duo and i7, by using the SIMD (Single Instruction Multiple Data) unit; the proposed methodology is compared with the \(cblas\_sgemv\) routine of ATLAS which runs on general purpose processors only. Also Valgrind [23] tool and SimpleScalar simulator [24] are used to measure the total number of instructions executed and the number of L1 and L2 data cache accesses and misses.
The remainder of this paper is organized as follows. In Sect. 2, the related work is given. The proposed methodology is presented in Sect. 3. In Sect. 4, experimental results are presented, and Sect. 5 is dedicated to conclusions.
2 Related work
One of the state-of-the-art libraries for linear algebra is ATLAS [1, 25, 26, 27, 28, 29]; ATLAS is an implementation of a new style of high performance software production/maintenance called Automated Empirical Optimization of Software (AEOS). In an AEOS enabled library, many different ways of performing a given kernel operation are supplied, and timers are used to empirically determine which implementation is best for a given architectural platform. ATLAS uses two techniques for supplying different implementations of kernel operations: multiple implementation and code generation.
During the installation of ATLAS, on the one hand an extremely complex empirical tuning step is required, and on the other hand a large number of compiler options are used, both of which are not included in the scope of this paper. Although ATLAS is one of the SOA libraries for MVM algorithm, its techniques for supplying different implementations of kernel operations concerning memory management are empirical and hence it does not provide a methodology for it. Moreover, for ATLAS implementation and tuning, there was access at a wide range of hardware architecture details, such as G4, G5, CoreDuo, and Core2Duo by Apple and UltraSPARC III platform which ATLAS exploited. Also, the proposed methodology lies at a higher level of abstraction than ATLAS because the main features of ATLAS are on the one hand the extremely complex empirical tuning step that is required and on the other hand the large number of compiler options that are used. These two features are beyond the scope of the proposed methodology which is mainly focused on memory utilization. Although the proposed methodology is written in C language using SSE intrinsics, it achieves speedup of 1.3 over ATLAS for one core; if the proposed methodology would be implemented in assembly language a higher speedup would occur (the proposed methodology is at a high level and the use of assembly is beyond the scope of this paper).
To our knowledge, there are only a few research works in optimizing the dense MVM software: [30, 31, 32, 33]. [30, 31, 33] are MVM implementations on GPU architectures while [32] describes a parallel MVM algorithm. However, many works exist in optimizing the sparse MVM on cluster architectures and GPUs (sparse MVM achieves lower complexity), such as [34, 35, 36]. In contrast to the proposed methodology, the above works find the final MVM scheduling and the tile sizes by searching, since they do not exploit all the hardware and software constraints. However, if these constraints are fully exploited, the optimum solution can be found by testing only a small number of solutions; in this paper, tile sizes are given by inequalities which contain the cache sizes and cache associativities.
Regarding multi-core architectures, the OpenMP programming model is used, e.g., in [37], several data decomposition schemes using the OpenMP programming model are examined showing that both row-wise and block-wise decomposition achieve the best overall performance.
Regarding cluster architectures, the k-NUMA methodology [38] optimizes the dense MVM problem in clusters of symmetric multiprocessor (SMP) nodes, based on the partial development of optimal algorithms. Furthermore, [39] shows that the optimal performance can be achieved by combining distributed-memory and shared-memory programming models (MPI and OpenMP approaches, respectively) instead of MPI only. The importance of combining OpenMP with MPI is also indicated in [40], which presents the MultiCore SUPort (MCSup) OpenMP add-on library to co-locate threads with the memory they are using so as to avoid remote memory accesses.
The proposed methodology is compared only with ATLAS library because (i) it is well known that ATLAS is one of the state-of-the-art libraries; (ii) the above-related works are oriented for specific architectures only; (iii) they do not propose any method that can be used by us to make our methodology more efficient; (iv) they exploit only a few number of hardware and software parameters; (v) they exploit the hardware and software parameters separately and (vi) the number of main memory and data cache accesses of the proposed methodology is less; given that the memory management problem is the performance-critical parameter here, we can deduce that the proposed methodology achieves higher performance.
Finally, [41] gives some basic background knowledge on optimizing MVM, while [42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52] present how hardware and software can work on scalable multi-processor systems for matrix algorithms.
3 Proposed methodology
This paper presents a new methodology for computing the matrix vector multiplication (MVM) in one and more cores using a shared cache. This methodology achieves high execution speed by fully and simultaneously exploiting the combination of production–consumption, data reuse and MVM parallelism, under the hardware constraints of (i) the number of cores; (ii) the number of levels of data cache hierarchy; (iii) the size of each level of data cache hierarchy; (iv) the number of CPU registers and the size and the number of SIMD registers (XMM/YMM); (v) associativities of the data caches; (vi) latencies of the data cache memories, and (vii) SSE instruction latencies. For different hardware parameters, different schedules for MVM are produced.
We define the MVM problem as \(Y = Y + A \times X\), where \(A\) is of size \(M \times N\).
MVM performance depends on the time needed for the (i) data to be loaded/stored; (ii) matrix operations to be executed; (iii) addressing instructions to be executed, and (iv) instructions to be loaded from L1 instruction cache. The time needed for the instructions to be fetched from L1 instruction cache, is lower than the above values, since the MVM code size is small and it always fits in L1 instruction cache. Furthermore, most of today’s general purpose processors contain separate L1 data and instruction caches and thus we can assume that shared/unified L2/L3 caches, contain only data. This is because the MVM code size is small and fits in L1 instruction cache here.
The \(T_\mathrm{addressing}\) value is not a constant number; its value increases when (i) the number of levels of tiling increases, (ii) the tile sizes decrease, (iii) less elements are assigned into registers, and (iv) the loop unroll factor decreases.
Furthermore, \(T_\mathrm{data} \succ T_\mathrm{matrix-operations}\) in most cases, i.e., if the data do not fit in L1 data cache, \(T_\mathrm{data} \succ T_\mathrm{matrix-operations}\), since at the minimum (\(N \times M + M + N\)) elements are accessed from the slower upper memory (arrays A, Y and X, respectively). Also, in most cases, \(T_\mathrm{data} \succ T_\mathrm{addressing}\), i.e., if the data do not fit in L1 data cache. However, \(T_\mathrm{matrix-operations}\) value is a constant number, but \(T_\mathrm{data}\) and \(T_\mathrm{addressing}\) are not; by decreasing the \(T_\mathrm{data}\) value, e.g., by applying loop tiling, the \(T_\mathrm{addressing}\) value is increased, while the number of the matrix operations remains constant.
To summarize, MVM performance depends on the \(T_\mathrm{data}\) and \(T_\mathrm{addressing}\) values; given that (i) \(T_\mathrm{data}\) and \(T_\mathrm{addressing}\) are dependent; (ii) the load/store unit and the execution unit, work in parallel, high performance is achieved, only for both low \(T_\mathrm{data}\) and \(T_\mathrm{addressing}\) values. The separate optimization of the \(T_\mathrm{data}\) and \(T_\mathrm{addressing}\) values, gives different schedules which cannot coexist (refining one degrades the other).
\(T_\mathrm{data}\) value is found theoretically according to the memory hierarchy parameters but \(T_\mathrm{addressing}\) is not. This is because the number of addressing instructions highly depends on the target compiler and on the unroll factor values; thus, although it is well known that the number of addressing instructions is increased by (a) increasing the number of levels of tiling, (b) decreasing the tile sizes, (c) assigning less elements into registers and (d) decreasing the loop unroll factor, we cannot make a good approximation for the number of addressing instructions. This is why a small number of possible solutions has to be tested. However, we do not test all possible solutions to find the most efficient one, but a number that is orders of magnitude smaller. To find out the minimum \(T_\mathrm{total}\) value according to Eqs. 1 and 2, only the schedules giving both low \(T_\mathrm{data}\) and \(T_\mathrm{addressing}\) values are selected (these two values are dependent); since we cannot make a good approximation for the number of addressing instructions (\(T_\mathrm{addressing}\)), the proposed methodology tests all solutions giving a low \(T_\mathrm{data}\) value. \(T_\mathrm{data}\) is found theoretically according to the memory hierarchy parameters.
To sum up, MVM performance depends on the \(\text {max}(T_\mathrm{data}, T_\mathrm{addressing})\) value. The proposed methodology minimizes this value by proposing a different schedule for different types of cores, number of cores, memory architecture parameters and data array layouts. The reminder of this paper presents all these schedules. The proposed methodology using one CPU core and all CPU cores, is given in Sects. 3.1 and 3.2, respectively.
3.1 Single-core CPUs
MVM performance highly depends on the data array layout of A in main memory, as matrix A is by far larger than X and Y arrays (\(N \times M\) elements are accessed for matrix A while \(N \times M + N + M\) elements are accessed in total). The default data array layout depends on the target compiler, e.g., Fortran stores the array data column-wise in main memory, while C row-wise.
However, if the elements of A are written in main memory neither row-wise nor column-wise, but tile-wise, i.e., the elements are written with the exact order they are fetched from main memory according to the corresponding memory access pattern, then the performance is highly increased.
On the other hand, by changing the data array layout of A, an extra overhead is introduced and thus it is not performance efficient in most cases; the overhead is large, as the data of A are reloaded and rewritten from/to main memory. However, there are several cases where in changing the data array layout of A is efficient; these are: (i) if MVM input data are produced by the current application at run time; in this case, the initialization of A and the change of its layout, are made together, decreasing the overhead, (ii) if the data array layout is precomputed, (iii) if the matrix is multiplied many times by different vectors, and (iv) if the data array layout is computed by a custom hardware module.
The proposed methodology gives different schedules when the data array layout of A is changed, i.e., (i)–(iv) above, and not. Furthermore, it gives different schedules whether the processor contains an SIMD unit (general purpose processors) or not (embedded processors, microcontrollers). The proposed methodology when array A is written in main memory tile-wise and no SIMD unit exists, is given in Sect. 3.1.1. The proposed methodology when array A is written in main memory row-wise and no SIMD unit exists, is given in Sect. 3.1.2. The proposed methodology for SIMD processors, is given in Sect. 3.1.3.
3.1.1 Tile-wise data array layout of A: CPUs without SIMD
\(k_0\), \(p\) and \(1\) registers are allocated for Y, A and X arrays, respectively. The elements of matrix A have no data reuse and each element of A is fetched just once; thus, just one register is needed to store the A’s elements; however, if \(p\) instructions can be executed in parallel, \(p\) registers are needed for matrix A to have all function units working in parallel (\(p \ge 1\)).
The scheduling follows. The first \(k_0\) elements of the first column of A (Fig. 1) make a tile (Tile0). All the tile’s elements are multiplied by the first-top element of X (data reuse of X) and their results are stored into the first-top \(k_0\) elements of Y (Tile0). After the multiplication of all tile’s elements by X(0), the tile is moved by one position to the right and all its elements are multiplied by X(1), etc. The results are always stored in-place, namely in the same registers, exploiting production–consumption of intermediate results of Y. According to the above schedule the \((Y, X, A)\) arrays are fetched \((M, M^2/k0, M)\) instead of \((M, M^2,M)\) times (when the row–column schedule is used), respectively.
The \(\text {L1}_\mathrm{loads}\), \(\text {L1}_\mathrm{stores}\), \(\text {MM}_\mathrm{loads}\) and \(\text {MM}_\mathrm{stores}\) values of Eq. 6 depend on the arrays size and on the data cache size. If the \(k_0\) rows of A and X remain in L1 data cache, the number of L1 misses is low; on the other hand, if they are larger than the L1 data cache size, the number of the L1 misses increases, and therefore the number of L2/main memory accesses too.
Number of data accesses when A is of size \(M \times M\)
X does not fit in L1 \(\left( \frac{\text {L1} \times \left( \text {assoc}-k\right) }{\text {assoc}} \prec k_0 \times M \times dtype\right) \) | |||
---|---|---|---|
1 level of data cache | |||
No optimization | |||
Y | X | A | |
L1 | \(M\) | \(M^2\) | \(M^2\) |
DDR | \(M\) | \(M^2\) | \(M^2\) |
Tiling for RF | |||
Y | X | A | |
L1 | \(M\) | \(M^2/k_0\) | \(M^2\) |
DDR | \(M\) | \(M^2/k_0\) | \(M^2\) |
Tiling for L1 and RF | |||
Y | X | A | |
L1 | \(M^2/k_1\) | \(M^2/k_0\) | \(M^2\) |
DDR | \(M^2/k_1\) | \(M\) | \(M^2\) |
X does not fit in L2 \(\left( \frac{\text {L2} \times \left( \text {assoc}-k\right) }{\text {assoc}} \prec k_0 \times M \times dtype\right) \) | |||
---|---|---|---|
2 levels of data cache | |||
No optimization | |||
Y | X | A | |
L1 | \(M\) | \(M^2\) | \(M^2\) |
L2 | \(M\) | \(M^2\) | \(M^2\) |
DDR | \(M\) | \(M^2\) | \(M^2\) |
Tiling for RF | |||
Y | X | A | |
L1 | \(M\) | \(M^2/k_0\) | \(M^2\) |
L2 | \(M\) | \(M^2/k_0\) | \(M^2\) |
DDR | \(M\) | \(M^2/k_0\) | \(M^2\) |
Tiling for L1 and RF | |||
Y | X | A | |
L1 | \(M^2/k_1\) | \(M^2/k_0\) | \(M^2\) |
L2 | \(M^2/k_1\) | \(M\) | \(M^2\) |
DDR | \(M^2/k_1\) | \(M\) | \(M^2\) |
Tiling for L2, L1 and RF | |||
Y | X | A | |
L1 | \(M^2/k_1\) | \(M^2/k_0\) | \(M^2\) |
L2 | \(M^2/k_{2x}\) | \(M^2/k_{2y}\) | \(M^2\) |
DDR | \(M^2/k_{2x}\) | \(M\) | \(M^2\) |
X fits in L2 \(\left( \frac{L2 \times \left( \text {assoc}-k\right) }{\text {assoc}} \succ k_0 \times M \times dtype\right) \) | |||
---|---|---|---|
X does not fit in L1 \(\left( \frac{\text {L1} \times \left( \text {assoc}-k\right) }{\text {assoc}} \prec k_0 \times M \times dtype\right) \) | |||
2 levels of data cache | |||
Tiling for RF | |||
Y | X | A | |
L1 | \(M\) | \(M^2/k_0\) | \(M^2\) |
L2 | \(M\) | \(M^2/k_0\) | \(M^2\) |
DDR | \(M\) | \(M\) | \(M^2\) |
Tiling for L2, L1 and RF | |||
Y | X | A | |
L1 | \(M^2/k_1\) | \(M^2/k_0\) | \(M^2\) |
L2 | \(M^2/k_1\) | \(M^2/k_{2y}\) | \(M^2\) |
DDR | \(M\) | \(M\) | \(M^2\) |
In the case that the sum of the \(k_0\) rows of A and the X array size is larger than L1 data cache size, i.e., \((k_0+1) \times N \succ L1\), the X array does not fit in L1 data cache and thus it is not fetched once but \(N/k_0\) times from the upper-slower memory of L1. To decrease the number of data accesses of the upper memory, loop tiling is applied by fully utilizing the L1 data cache size and associativity. Tiling is applied so the X array remains in L1 data cache since it is reused \(N/k_0\) times.
Let us explain Eq. 9, in more detail; Eq. 9 is introduced to minimize the number of L1 data cache misses and therefore the number of upper memory accesses. Given that each row of A is multiplied by X, X is reused \(M\) times, and thus, it has to remain in L1 data cache. To do this, cache lines of A are written in L1 without conflict with the X ones. This is achieved by loading \(k_0\) sub-rows of A (Tile1 of A) and one sub-row of X (Tile1 of X), into separate L1 ways. This is achieved by storing the A array tile-wise in main memory (consecutive main memory locations) and by using (\(k \times \frac{\text {L1}}{\text {assoc}_\mathrm{L1}}\)) L1 memory size for X and (\((\text {assoc}_\mathrm{L1}-k) \times \frac{\text {L1}}{\text {assoc}_\mathrm{L1}}\)) L1 memory size for A (in Eq. 9). We can more freely say that this is equivalent to using \(k\) cache ways for X and (\(\text {assoc}_\mathrm{L1}-k\)) cache ways for A. An empty cache line is always granted (with respect to the size of the cache) for each different modulo of A and X memory addresses. It is important to say that if associativity is not taken into account and \(\text {L1} \ge (k_0 \times k_1 + k_1)\) is used instead of in Eq. 9, the number of L1 misses will be much larger because A and X will conflict with each other. To our knowledge, this is the first time for an MVM algorithm that the cache associativity is utilized. Furthermore, the Y array is stored into main memory infrequently (usually 1–2 cache lines are written to memory when \(k_0\) sub-rows of A have been multiplied by X); thus the number of L1 conflicts due to Y can be neglected (if a victim cache exists, then these conflicts never occur).
In the case that the data cache is direct mapped (\(\text {assoc}_\mathrm{L1} = 1\)), loop tiling is not efficient since no tile can remain in data cache due to the cache modulo effect.
The \(k0\) value produced by in Eq. 5, is the value minimizing the number of load/store instructions and the number of register spills. Furthermore, the \(k1\) value produced in Eq. 9 minimizes the number of L1 data cache misses (or equivalent number of main memory accesses here), given the \(k0\) value. However, these two sub-problems depend on each other and therefore by decreasing the number of main memory data accesses, the number of L1 data accesses is increased. The number of the main memory data accesses is minimized when the \(k1\) value is maximized, according to in Eq. 9; also the number of the L1 data cache accesses is minimized when the \(k0\) value is maximized according to in Eq. 5; however, according to in Eq. 9, when \(k_1\) increases the \(k_0\) decreases (the maximum \(k_1\) value is given for \(k0=1\)). There is a trade-off on the (\(k0,k1\)) selection and thus the minimum \(T_\mathrm{data}\) value depends on the target architecture parameters. The proposed methodology optimizes these two sub-problems together as one problem and not separately; the (\(k0,k1\)) values are found theoretically according to the memory hierarchy architecture parameters, Eq. 6.
Furthermore, the Tile1 of A are chosen to be as large as possible in the x-axis only. This is because (i) the larger the tile of A, the larger the production and consumption of Y (as the Tile0 of A is moved by one position to the right in Fig. 1, the elements of Y are produced-consumed) (ii) by tiling only x-dimension, a lower number of addressing instructions is achieved (there is one loop less), and (iii) the larger the Tile1 of A is the larger the data reuse of X, as tiles are accessed in a column-wise way; this is because each Tile1 of X is fetched just once. Although the results are stored into memory \(N/k_1\) times, the \(k_1\) value is maximum, since Tile1 of A is as large as possible in the x-axis dimension.
Regarding 2-level data cache architectures, if the following inequality holds \(M \times k_1 + M + k_1 \succ L2\) (the Y array, all Tile1 tiles of one block column of A and one Tile1 of X, do not fit in L2 data cache), the \(Y\) array is loaded/stored not once but \(M/k_1\) times, from main memory. To decrease the number of main memory data accesses, loop tiling is applied by fully utilizing the L2 data cache size and associativity. Tiling is applied so that the Y array remains in L2 data cache, as it is loaded-reused \(N/k_1\) times.
Having these two tiles in L2, X is fetched just once while Y is fetched \(M/k_{2x}\) times, from main memory. The number of main memory accesses is minimized when \(k_{2x}=M\), but in this case \(k_{2y}\) value is decreased and the number of L2 data cache accesses is increased (Table 1); there is trade-off and thus the \(k_{2x}\) and \(k_{2y}\) values are found theoretically according to the memory hit latency values. For most general purpose processors where L2 data cache size is large, \(k_{2x}=M\). In most cases, we select \(k_{2x} \gg k_{2y}\) since the main memory is many times slower than data cache.
The (\(k_{2y}\), \(k_{2x}\), \(k_{1}\), \(k_0\)) values are computed by optimizing the following sub-problems, as one problem and not separately, as they depend on each other; these are the sub-problems of minimizing the number of L1, L2 and main memory accesses.
Regarding full memory hierarchy utilization, one level of tiling for each level of data cache and one more for the register file is used; loop tiling decreases the number of data cache misses but increases the number of load/store and addressing instructions (typically one more level of tiling in MVM algorithm adds two additional loops). However, tiling for each level of data cache is not performance efficient in several cases, as the additional number of instructions (both load/store and arithmetic) may degrade performance.
To summarize, the \(T_\mathrm{data}\) value for 1 level of tiling is given by Eqs. 7 and 8, while the \(T_\mathrm{data}\) value or 2 levels of tiling is given by Eq. 12. Also, Eqs. 9 and 10, give the tile sizes.
3.1.2 Row-wise data array layout of A: CPUs without SIMD
Regarding register file utilization (if the \(k_0\) rows of A and X fit in L1 data cache), the scheduling explained in the previous paragraph is used. However, in the row-wise case, if the register file size is large and thus \(k_0\) value is large, the main memory bandwidth is increased, degrading performance; this is because the elements of each sub-row of A are written in different main memory locations (and also in different main memory pages); reading from a large number of different main memory locations needs more time. Therefore, \(k_0\) value has an upper threshold which is bounded by main memory parameters (this is further explained into Sect. 3.2).
Regarding L2 data cache, tiling is applied as explained in the previous paragraph, but \(k_{2x}=N\). In this case, all the sub-rows of each Tile2 of A are written in non-consecutive main memory locations and, thus, in non-consecutive data cache locations; therefore, the number of L2 data cache misses is increased. To minimize the number of L2 data cache misses, \(k_{2x}=N\); in this case, all the sub-rows of Tile2 of A are of size \(N\) and they are written in consecutive main memory locations.
3.1.3 CPUs with SIMD
Most of the today’s general purpose processors have Single Instruction Multiple Data (SIMD) unit in order to further increase performance by using vector operations. However, to efficiently use the SIMD unit, (i) the schedule decision must take into account the instruction selection problem, since there are instructions with variant latency/throughput values (e.g., ‘hadd’ instructions have very large latency and throughput values), and (ii) the data have to be aligned in main memory according to the SIMD registers size.
Regarding SIMD architectures, the schedule explained in Sects. 3.1.1 and 3.1.2, is not performance efficient. To implement the above schedule into AVX/SSE technology, several copies of each element of X have to be stored into 1 YMM/XMM register depending on the elements’ size and registers’ size; loading just one element needs more time than loading a whole L1 cache line and this is why this schedule is not performance efficient. SIMD unit introduces additional constraints which are fully exploited here.
The scheduling follows. Consider that there are 8 XMM registers (XMM0:XMM7) of size 128-bit and the arrays contain floating point data (4 bytes each). The first 4 elements of A and X are fetched into XMM0 and XMM1, respectively. They are multiplied by each other and the result is stored into XMM2. Then, the first 4 elements of the second row of A are loaded into XMM0 again; XMM0 is multiplied by XMM1 and the results is stored into XMM3. The procedure is repeated for the 3rd, 4th, 5th and 6th row of A (XMM2:XMM7 contain the values of Y[0]:Y[5], respectively). Afterwards, the second quartet of X is loaded into XMM1 and the second quartet of the first row of A is loaded into XMM0; they are multiplied by each other and the result is accumulated into XMM2. Then the second quartet of the second row of A is loaded into XMM0 again, etc. When the first six rows of A have been fully multiplied by X, the four values of each one of XMM2:XMM7 registers are summed and the results are stored into main memory (Y array), e.g., the sum of the four XMM2 values is Y(0). The above procedure continues until all rows of A have been multiplied by X. The Y array is written in memory once, while array X is loaded \(N/(\text {Regs}-2)\) times.
Regarding L1 data cache utilization, the schedule used is the same as in the previous subsections. However, loop-tiling is less effective here, because most of the today’s general purpose processors have hardware prefetchers, for both L1 and L2 caches, decreasing the number of data cache misses.
3.2 Multi-core CPUs
Having more than one core, the MVM problem is partitioned into smaller MVM sub-problems. Each sub-problem corresponds to a thread and each thread is executed in one core only. Each thread must contain at least a specific number of instructions, to minimize the core ‘idle-time’; this is because partitioning MVM into smaller sub-problems results in thread initialization and synchronization time comparable to the thread execution time, leading to low performance. This is an additional constraint.
Most of today’s multi-core processors typically contain 2 or 3 levels of cache, having either (i) separate L1 data and instruction caches and a shared L2 cache or (ii) separate L1 data and instruction caches, separate unified L2 caches and a shared L3 cache, respectively. All caches have an LRU replacement policy.
The MVM problem is partitioned into threads according to the number of cores. All threads run in parallel, without communicating with each other. This is achieved by partitioning the matrix A horizontally into \(q\) equal parts and by partitioning the array Y into \(q\) equal parts, where \(q\) is the number of the cores. The multiplication of each part of A by X makes a different thread. The MVM problem is partitioned into \(q\) different threads and each thread writes on its own part of Y into main memory. Each thread is executed at only one core and the schedule is almost the same as in Sect. 3.1 (the differences are explained below).
By partitioning the MVM into threads and running them in parallel, the performance-critical parameters are main memory latency and bandwidth values. Although, data reuse and data cache hierarchy have been fully utilized, minimizing the number of X and Y main memory accesses, the elements of A (which are much more than X and Y), are used only once (no data reuse); thus, the number of main memory accesses of A cannot be reduced (\(N \times M\) loads); for one matrix–vector multiplication to be executed, a different value of A is needed and this is why main memory latency cannot be hidden by the cache (CPU remains idle several cycles).
Thus, the MVM execution time depends on the time needed for matrix A to be loaded from main memory. This is why a large core utilization factor cannot be achieved. MVM performance is restricted by main memory latency and bandwidth values, so there is no way to increase performance more than this.
Furthermore, the main memory bandwidth would be low if we adopt the schedule explained in Sect. 3.1. This is because each thread accesses data from many different main memory locations. To be more precise, the elements of each sub-row of A are written in different main memory locations, so reading from different main memory locations needs more time. Therefore, as the \(k_0\) value decreases, the time needed fetching the elements of A decreases. On the other hand, as the \(k_0\) value increases according to Eq. 15, the number of load/store instructions and the number of data cache accesses decreases (however, it is not the critical parameter here). Therefore, the proposed methodology gives a different \(k_0\) value according to the number of cores used and main memory parameters; as the number of the cores used increases, the \(k_0\) value decreases.
In the case that the data array layout is tile-wise, MVM performance is highly increased. This is because (i) array A is accessed from consecutive main memory locations only, increasing the main memory bandwidth; (ii) the hardware prefetchers can prefetch the data of A because only one page is accessed; (iii) the X array fits in the private caches in most cases.
For commercial general purpose processors, the minimum number of main memory accesses is achieved even if no tiling for data cache is applied, since shared caches are very large. This is because: (i) the Y and A are accessed only once and (ii) the X array elements which are loaded many times, will remain in the shared cache (data reuse of X). The X array remains in shared cache because: (i) it is fetched many times by all the cores (LRU cache replacement policy), and (ii) its size is always smaller than one shared cache way size.
According to in Eq. 17, one L2 cache way is needed for each row of A. This is because the elements of the \(k_0\) different rows of A are not in consecutive main memory locations and, thus, they are not fetched in consecutive L2 cache locations.
Regarding 3-level data cache architectures, tiling for data cache can be applied in several ways, i.e., tiling for (i) L1, (ii) L2 and (iii) both. However, by applying tiling for the cache, the number of main memory accesses is not further decreased, as both Y and X remain in L3 even if no tiling is applied because L3 shared cache size is very large in modern architectures; in general, performance is slightly affected by applying tiling for general purpose processors. This is because the critical parameter is the number of main memory accesses.
4 Experimental results
The experimental results for the proposed methodology, presented in this section, were carried out with a Pentium Intel core 2 duo E6550 [53], a Pentium Intel Core 2 Duo T6600 at 2.20 GHz [54] and with a Pentium Intel i7-3930K (6 cores) at 3.2 GHz [55], all using SSE/AVX instructions and ATLAS 3.8.4, respectively. Also, the Valgrind [23] tool and SimpleScalar simulator [24] are used to measure the total number of L1 and L2 data cache accesses and misses. The first processor contains 8128-bit XMM registers, L1 data and instruction caches of size 32 kbytes and shared L2 cache of size 4 Mbytes. The second processor contains 16 XMM registers, L1 data and instruction caches of size 32 kbytes and shared L2 cache of size 3 Mbytes. The third processor contains 16,256-bit YMM registers, L1 data and instruction caches of size 32 kbytes, L2 unified cache of size 256 kbytes and shared L3 cache of size 12 Mbytes. All processors use the Operating system Ubuntu and the gcc-4.4.3 compiler. Furthermore, SimpleScalar simulator [24] is used to measure the number of data cache misses when no SIMD is used. In the experimental procedure, square matrix sizes (\(N \times N\)), including floating point numbers (4 bytes) as elements, were used. It is important to say that MVM thread has to be manually assigned to just one core; the programmer has to give the CPU thread affinity flag. Otherwise, the operating system (OS) will make the core assignment, and it will toggle the thread among the cores, degrading performance because of the data movement involved.
The proposed methodology is on average \(1.23\) and \(1.33\) faster than ATLAS library on E6550 (arch1) and T6600 (arch2), respectively; this is because it achieves a lower number of load/store instructions or equivalently L1 data cache accesses. For arch1, we achieve \(1.38\) times less load/store instructions than ATLAS. The number of load/store instructions is much less since the register file size is fully utilized (in Eq. 15); first, we use all the available XMM registers and second we use them efficiently according to the data reuse. Regarding the number of main memory data accesses, the proposed methodology achieves approximately the same number with ATLAS (we achieve about 1 % less main memory data accesses); the gain is small because L2 caches are large and the whole Y and X arrays and several rows of A can fit in L2. As it was explained in the previous subsections, the three arrays are accessed only once from main memory (Table 1) and thus the number of main memory data accesses is the minimum. Although we achieve only 1 % less main memory accesses, this gain is significant since only the number of Y and X data accesses can be decreased and not of A (matrix A is many times larger than Y and X); the \(N \times M\) elements of A (in total \(N\times M + N + M\)) are accessed only once (no data reuse). Regarding the number of L2 cache accesses, the proposed methodology achieves approximately the same number with ATLAS (we achieve about 1 % more L2 data accesses). This is because ATLAS applies loop tiling for L1 here; it is important to say that we do not provide the schedule achieving the lower number of data cache accesses here, but the schedule achieving the best performance. Furthermore, the proposed methodology execution time increases proportionally to the matrix size, while ATLAS execution time does not; this is because ATLAS applies different schedules.
In the case that the data array layout of A is tile-wise, the speedup is further increased (about 1.42 %) for two reasons. First, the A elements are fetched from consecutive main memory locations (if the data array layout is row-wise, the elements of each sub-row of A are written from different main memory locations and reading from different memory addresses needs more time); in this way the main memory bandwidth is increased. Second, the hardware prefetchers can prefetch the data of A.
An evaluation for many different \(k_0\) values is also performed on i7-3930K, for the row-wise case (Fig. 7). Intel i7-3930K contains 16,256-bit AVX registers; by using only one core (\(q=1\)), only \(k0 \ge 8\) values achieve high performance here. By using more than one core, the main memory bandwidth increases as the \(k_0\) value decreases; this is because the elements of each sub-row of A are written in different main memory locations; reading from different addresses needs more time. On the other hand, as the \(k_0\) value increases according to Eq. 15, the number of load/store instructions and the number of data cache accesses decreases. However, main memory is by far the most critical parameter here and, thus, a small \(k0\) value is selected according to the main memory parameters. In the case that there are 5 or 6 cores, the main memory load is very high, and thus \(k_0=1\) and \(k_0=2\) are the best values; in this case, the register file and data cache utilization do not increase performance, as main memory is the critical parameter. For the same reason, \(k0=1\) value achieves the best performance for \(q=6\) and the worst for \(q=1\).
5 Conclusions
In this paper, a new methodology for computing the Dense Matrix Vector Multiplication for both embedded (single core processors without SIMD unit) and general purpose processors (single and multi-core processors, with SIMD unit) is presented. This methodology achieves higher execution speed than the ATLAS state-of-the-art library and the final schedule is found theoretically according to the memory hierarchy parameters. Furthermore, by utilizing software and hardware parameters, the number of solutions tested is orders of magnitude smaller; therefore, instead of searching the whole space, only a small number of solutions is tested.
References
- 1.Whaley RC, Petitet A (2005) Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw: Pract Exp 35(2):101–121Google Scholar
- 2.OpenBlas (2012). http://xianyi.github.com/OpenBLAS
- 3.Krivutsenko A (2008) GotoBLAS—anatomy of a fast matrix multiplication. Technical reportGoogle Scholar
- 4.Guennebaud G, Jacob B et al (2010) Eigen v3. http://eigen.tuxfamily.org
- 5.Intel: Intel MKL (2012). http://software.intel.com/en-us/intel-mkl
- 6.Bilmes J, Asanović K, Chin C, Demmel J (1997) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: Proceedings of the international conference on supercomputing. ACM SIGARC, Vienna, AustriaGoogle Scholar
- 7.Frigo M, Johnson SG (1997) The fastest Fourier transform in the west. Technical report. Cambridge, MA, USAGoogle Scholar
- 8.Milder P, Franchetti F, Hoe JC, Püschel M (2012) Computer generation of hardware for linear digital signal processing transforms. ACM Trans Des Autom Electron Syst 17(2) 15:1–15:33. doi:10.1145/2159542.2159547
- 9.Pinter SS (1996) Register allocation with instruction scheduling: a new approach. J Prog Lang 4(1):21–38Google Scholar
- 10.Shobaki G, Shawabkeh M, Rmaileh NEA (2008) Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach. ACM Trans Archit Code Optim 10(3):14:1–14:31. doi:10.1145/2512432
- 11.Bacon DF, Graham SL, Sharp OJ (1994) Compiler transformations for high-performance computing. ACM Comput Surv 26(4):345–420. doi:10.1145/197405.197406
- 12.Granston E, Holler A (2001) Automatic recommendation of compiler options. In: Proceedings of the workshop on feedback-directed and dynamic optimization (FDDO)Google Scholar
- 13.Triantafyllis S, Vachharajani M, Vachharajani N, August DI (2003) Compiler optimization-space exploration. In: Proceedings of the international symposium on code generation and optimization: feedback-directed and runtime optimization, CGO ’03, pp 204–215. IEEE Computer Society, Washington, DC, USA. http://dl.acm.org/citation.cfm?id=776261.776284
- 14.Cooper KD, Subramanian D, Torczon L (2002) Adaptive optimizing compilers for the 21st century. J Supercomput 23(1):7–22. doi:10.1023/A:1015729001611
- 15.Kisuki T, Knijnenburg PMW, O’Boyle MFP, Bodin F, Wijshoff HAG (1999) A feasibility study in iterative compilation. In: Proceedings of the 2nd international symposium on high performance computing, ISHPC ’99, pp 121–132. Springer-Verlag, London, UK. http://dl.acm.org/citation.cfm?id=646347.690219
- 16.Kulkarni PA, Whalley DB, Tyson GS, Davidson JW (2009) Practical exhaustive optimization phase order exploration and evaluation. ACM Trans Archit Code Optim 6(1):1:1–1:36. doi:10.1145/1509864.1509865
- 17.Kulkarni P, Hines S, Hiser J, Whalley D, Davidson J, Jones D (2004) Fast searches for effective optimization phase sequences. SIGPLAN Not 39(6):171–182. doi:10.1145/996893.996863 CrossRefGoogle Scholar
- 18.Park E, Kulkarni S, Cavazos J (2011) An evaluation of different modeling techniques for iterative compilation. In: Proceedings of the 14th international conference on compilers, architectures and synthesis for embedded systems, CASES ’11, pp 65–74. ACM, New York, NY, USA. doi:10.1145/2038698.2038711
- 19.Monsifrot A, Bodin F, Quiniou R (2002) A machine learning approach to automatic production of compiler heuristics. In: Proceedings of the 10th international conference on artificial intelligence: methodology, systems, and applications, AIMSA ’02, pp 41–50. Springer-Verlag, London, UK. http://dl.acm.org/citation.cfm?id=646053.677574
- 20.Stephenson M, Amarasinghe S, Martin M, O’Reilly UM (2003) Meta optimization: improving compiler heuristics with machine learning. SIGPLAN Not 38(5):77–90 (2003). doi:10.1145/780822.781141
- 21.Tartara M, Crespi Reghizzi S (2013) Continuous learning of compiler heuristics. ACM Trans Archit Code Optim 9(4):46:1–46:25. doi:10.1145/2400682.2400705
- 22.Agakov F, Bonilla E, Cavazos J, Franke B, Fursin G, O’Boyle MFP, Thomson J, Toussaint M, Williams CKI (2006) Using machine learning to focus iterative optimization. In: Proceedings of the international symposium on code generation and optimization, CGO ’06, pp 295–305. IEEE Computer Society, Washington, DC, USA. doi:10.1109/CGO.2006.37
- 23.Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. SIGPLAN Not 42(6):89–100. doi:10.1145/1273442.1250746
- 24.Simplescalar CI, Burger D, Austin TM (1997) The SimpleScalar tool set, version 2.0. Technical reportGoogle Scholar
- 25.Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1–2):3–35MATHCrossRefGoogle Scholar
- 26.Whaley RC, Dongarra J (1999) Automatically tuned linear algebra software. In: 9th SIAM conference on parallel processing for scientific computing. CD-ROM ProceedingsGoogle Scholar
- 27.Whaley RC, Dongarra J (1998) Automatically tuned linear algebra software. In: SuperComputing 1998: high performance networking and computingGoogle Scholar
- 28.Whaley RC, Dongarra J (1997) Automatically tuned linear algebra software. Technical report. UT-CS-97-366, University of TennesseeGoogle Scholar
- 29.See homepage for details: ATLAS homepage (2012). http://math-atlas.sourceforge.net/
- 30.Fujimoto N (2008) Dense matrix–vector multiplication on the CUDA architecture. Parallel Process Lett 18(4):511–530MathSciNetCrossRefGoogle Scholar
- 31.Fujimoto N (2008) Faster matrix–vector multiplication on GeForce 8800GTX. In: IPDPS, pp 1–8. IEEE. http://dblp.uni-trier.de/db/conf/ipps/ipdps2008.html
- 32.Hendrickson B, Leland R, Plimpton S (1995) An efficient parallel algorithm for matrix–vector multiplication. Int J High Speed Comput 7:73–88CrossRefGoogle Scholar
- 33.Sørensen HHB (2012) High-performance matrix–vector multiplication on the GPU. In: Proceedings of the 2011 international conference on parallel processing, Euro-Par’11, pp 377–386. Springer-Verlag, Berlin, HeidelbergGoogle Scholar
- 34.Zhang N (2012) A novel parallel scan for multicore processors and its application in sparse matrix–vector multiplication. IEEE Trans Parallel Distrib Syst 23(3):397–404. doi:10.1109/TPDS.2011.174 CrossRefGoogle Scholar
- 35.Williams S, Oliker L, Vuduc R, Shalf J, Yelick K, Demmel J (2007) Optimization of sparse matrix–vector multiplication on emerging multicore platforms. In: Proceedings of the 2007 ACM/IEEE conference on supercomputing, SC ’07, pp 38:1–38:12. ACM, New York, NY, USA. doi:10.1145/1362622.1362674
- 36.Goumas G, Kourtis K, Anastopoulos N, Karakasis V, Koziris N (2009) Performance evaluation of the sparse matrix–vector multiplication on modern architectures. J Supercomput 50(1):36–77. doi:10.1007/s11227-008-0251-8
- 37.Michailidis PD, Margaritis KG (2010) Performance models for matrix computations on multicore processors using OpenMP. In: Proceedings of the 2010 international conference on parallel and distributed computing. Applications and Technologies, PDCAT ’10, pp 375–380. IEEE Computer Society, Washington, DC, USA. doi:10.1109/PDCAT.2010.52
- 38.Schmollinger M, Kaufmann M (2002) Algorithms for SMP-clusters dense matrix–vector multiplication. In: Proceedings of the 16th international parallel and distributed processing Sysmposium, IPDPS ’02, pp 57–. IEEE Computer Society, Washington, DC, USA. http://dl.acm.org/citation.cfm?id=645610.661893
- 39.Waghmare VN, Kendre SV, Chordiya SG (2011) Article: performance analysis of matrix–vector multiplication in hybrid (MPI + OpenMP). Int J Comput Appl 22(5):22–25. Published by Foundation of Computer ScienceGoogle Scholar
- 40.Baker AH, Schulz M, Yang UM (2011) On the performance of an algebraic multigrid solver on multicore clusters. In: Proceedings of the 9th international conference on high performance computing for computational science, VECPAR’10, pp 102–115. Springer-Verlag, Berlin, Heidelberg. http://dl.acm.org/citation.cfm?id=1964238.1964252
- 41.Parallel methods for matrix–vector multiplication. http://www.hpcc.unn.ru/mskurs/ENG/DOC/pp07.pdf
- 42.Bhandarkar SM, Arabnia HR (1995) The REFINE multiprocessor—theoretical properties and algorithms. Parallel Comput 21(11):1783–1805CrossRefGoogle Scholar
- 43.Arabnia HR, Smith JW (1993) A reconfigurable interconnection network for imaging operations and its implementation using a multi-stage switching box. pp 349–357Google Scholar
- 44.Wani MA, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable MultiRing network. J Supercomput 25(1):43–62MATHCrossRefGoogle Scholar
- 45.Arabnia HR (1990) A parallel algorithm for the arbitrary rotation of digitized images using process-and-data-decomposition approach. J Parallel Distrib Comput 10(2):188–192CrossRefGoogle Scholar
- 46.Arabnia HR, Oliver MA (1989) A transputer network for fast operations on digitised images. Comput Graph Forum 8(1):3–11CrossRefGoogle Scholar
- 47.Bhandarkar SM, Arabnia HR (1995) The Hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114CrossRefGoogle Scholar
- 48.Arabnia HR, Oliver MA (1987) A transputer network for the arbitrary rotation of digitised images. Comput J 30(5):425–432CrossRefGoogle Scholar
- 49.Arabnia HR, Bhandarkar SM (1996) Parallel stereocorrelation on a reconfigurable multi-ring network. J Supercomput 10(3):243–269MATHCrossRefGoogle Scholar
- 50.Arabnia HR, Oliver MA (1987) Arbitrary rotation of raster images with SIMD machine architectures. Comput Graph Forum 6(1):3–11CrossRefGoogle Scholar
- 51.Bhandarkar SM, Arabnia HR, Smith JW (1995) A reconfigurable architecture for image processing and computer vision. Int J Pattern Recognit Artif Intell 9(02):201–229CrossRefGoogle Scholar
- 52.Arabnia H (1995) A distributed stereocorrelation algorithm. In: Computer communications and networks, 1995. Proceedings, 4th international conference on, pp 479–482, IEEEGoogle Scholar
- 53.Intel core 2 duo processor E6550. http://ark.intel.com/Product.aspx?id=30783
- 54.Intel core 2 duo processor T6600. http://ark.intel.com/products/37255/Intel-Core2-Duo-Processor-T6600
- 55.Intel i7-2600K Processor. http://ark.intel.com/products/52214