# Optimal Data Partitioning Shape for Matrix Multiplication on Three Fully Connected Heterogeneous Processors

## Abstract

Parallel Matrix Matrix Multiplication (MMM) is used in scientific codes across many disciplines. While it has been widely studied how to optimally divide MMM among homogenous compute nodes, the optimal solution for heterogeneous systems remains an open problem. Dividing MMM across multiple processors or clusters requires consideration of the performance characteristics of both the computation and the communication subsystems. The degree to which each of these affects execution time depends on the system and the algorithm used to divide, communicate, and compute the MMM data. Our previous work has determined the optimum shape must be, for all ratios of processing power, communication bandwidth and matrix size, one of six well-defined shapes for each of the five MMM algorithms studied. This paper further reduces the number of potentially optimal candidate shapes to three defined shapes known as Square Corner, Square Rectangle, and Block Rectangle. We then find, for each algorithm and all ratios of computational power among processors, ratios of overall computational power and communication speed, and problem size, the optimum shape. The Block Rectangle, a traditional 2D rectangular partition shape, is predictably optimal when using relatively homogeneous processors, and is also optimal for heterogeneous systems with a fast, medium and slow processor. However, the Square Corner shape is the optimum for heterogeneous environments with a powerful processor and two slower processors, and the Square Rectangle is optimal for heterogeneous environments composed of a two fast processors and a single less powerful processor. These theoretical results are confirmed using a series of experiments conducted on Grid’5000, which show both that the predicted optimum shape is indeed optimal, and that the remaining two partition shapes perform in their predicted order.

## Preview

Unable to display preview. Download preview PDF.

## References

- 1.Dongarra, J.J., Meuer, H.W., Simon, H.D., Strohmaier, E.: Top500 supercomputer sites, http://www.top500.org/
- 2.Clarke, D., Lastovetsky, A., Rychkov, V.: Column-based matrix partitioning for parallel matrix multiplication on heterogeneous processors based on functional performance models. In: Alexander, M., et al. (eds.) Euro-Par 2011, Part I. LNCS, vol. 7155, pp. 450–459. Springer, Heidelberg (2012)CrossRefGoogle Scholar
- 3.Dovolnov, E., Kalinov, A., Kilmov, S.: Natural bloc data decomposition for heterogeneous clusters. In: Proceedings of the 17th International Parallel and Distributed Processing Symposium, IPDPS 2003 (April 2003)Google Scholar
- 4.Kalinov, A., Lastovetsky, A.: Heterogeneous distribution of computations solving linear algebra problems on networks of heterogeneous computers. Journal of Parallel and Distributed Computing 61, 520–535 (2001)CrossRefzbMATHGoogle Scholar
- 5.Beaumont, O., Boudet, V., Rastello, F., Robert, Y.: Partitioning a square into rectangles: NP-completeness and approximation algorithms. Algorithmica 34, 217–239 (2002)CrossRefMathSciNetzbMATHGoogle Scholar
- 6.DeFlumere, A., Lastovetsky, A., Becker, B.A.: Partitioning for parallel matrix-matrix multiplication with heterogeneous processors: The optimal solution. In: Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 125–139 (2012)Google Scholar
- 7.DeFlumere, A., Lastovetsky, A.: Searching for the optimal data partitioning shape for parallel matrix matrix multiplication on 3 heterogeneous processors. In: Parallel and Distributed Processing Symposium Workshops, IPDPSW (2014)Google Scholar
- 8.Zhong, Z., Rychkov, V., Lastovetsky, A.: Data partitioning on heterogeneous multicore and multi-GPU systems using functional performance models of data-parallel applications. In: IEEE International Conference on Cluster Computing (CLUSTER), pp. 191–199. IEEE (2012)Google Scholar
- 9.Hockney, R.: The communication challenge for mpp: Intel paragon and meiko cs-2. Parallel Computing 20(3), 389–398 (1994)CrossRefGoogle Scholar
- 10.Van De Geijn, R., Watts, J.: SUMMA: Scalable universal matrix multiplication algorithm. Concurrency-Practice and Experience 9(4), 255–274 (1997)CrossRefGoogle Scholar
- 11.DeFlumere, A., Lastovetsky, A.: Theoretical results on optimal partitioning for matrix matrix multiplication on three fully connected heterogeneous processors. School of Computer Science and Informatics, University College Dublin, Tech. Rep. UCD-CSI-2014-01 (2014)Google Scholar