Hybrid 2D/1D Blocking as Optimal Matrix-Matrix Multiplication
Multiplication of huge matrices generates more cache misses than smaller matrices. 2D block decomposition of matrices that can be placed in L1 CPU cache decreases the cache misses since the operations will access data only stored in L1 cache. However, it also requires additional reads, writes, and operations compared to 1D partitioning, since the blocks are read multiple times.
In this paper we propose a new hybrid 2D/1D partitioning to exploit the advantages of both approaches. The idea is first to partition the matrices in 2D blocks and then to multiply each block with 1D partitioning to achieve minimum cache misses. We select also a block size to fit in L1 cache as 2D block decomposition, but we use rectangle instead of squared blocks in order to minimize the operations but also cache associativity. The experiments show that our proposed algorithm outperforms the 2D blocking algorithm for huge matrices on AMD Phenom CPU.
KeywordsCPU Cache Multiprocessor Matrix Partitioning
Unable to display preview. Download preview PDF.
- 3.DeFlumere, A., Lastovetsky, A., Becker, B.: Partitioning for parallel matrix-matrix multiplication with heterogeneous processors: The optimal solution. In: 21st International Heterogeneity in Computing Workshop (HCW 2012). IEEE Computer Society Press, Shanghai (2012)Google Scholar
- 4.Drevet, C.E., Islam, M.N., Schost, E.: Optimization techniques for small matrix multiplication. ACM Comm. Comp. Algebra 44(3/4), 107–108 (2011)Google Scholar
- 5.Gusev, M., Ristov, S.: Matrix multiplication performance analysis in virtualized shared memory multiprocessor. In: MIPRO, 2012 Proceedings of the 35th International Convention, pp. 264–269. IEEE Conference Publications (2012)Google Scholar
- 7.Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach, 5th edn. (2012)Google Scholar
- 8.Jenks, S.: Multithreading and thread migration using mpi and myrinet. In: Proc. of the Parallel and Distrib. Computing and Systems, PDCS 2004 (2004)Google Scholar
- 10.Ristov, S., Gusev, M.: Superlinear speedup for matrix multiplication. In: Proceedings of the 34th International Conference on Information Technology Interfaces, ITI 2012, pp. 499–504 (2012)Google Scholar
- 11.So, B., Ghuloum, A.M., Wu, Y.: Optimizing data parallel operations on many-core platforms. In: First Workshop on Software Tools for Multi-Core Systems (STMCS), pp. 66–70 (2006)Google Scholar