Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors

Catalán, Sandra; Igual, Francisco D.; Mayo, Rafael; Rodríguez-Sánchez, Rafael; Quintana-Ortí, Enrique S.

doi:10.1007/s10586-016-0611-8

Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors

Published: 04 August 2016

Volume 19, pages 1037–1051, (2016)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Sandra Catalán¹,
Francisco D. Igual²,
Rafael Mayo¹,
Rafael Rodríguez-Sánchez¹ &
…
Enrique S. Quintana-Ortí¹

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Asymmetric multicore processors have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications on clusters of commodity systems-on-chip. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric-static and dynamic scheduling strategies that carefully tune and distribute the operation’s micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BLAS3 optimization for the Godson-3B1500

Article Open access 25 November 2016

Influence of Architectural Features of the SNC-4 Mode of the Intel Xeon Phi KNL on Matrix Multiplication

NUMA-Aware Optimization of Sparse Matrix-Vector Multiplication on ARMv8-Based Many-Core Architectures

Notes

According to this definition, servers equipped with one (or more) general-purpose multicore processor(s) and a PCIe-attached graphics accelerator, or systems-on-chip like the NVIDIA Tegra TK1, are excluded from this category.
An ARM cluster, in the company’s official terminology, can be viewed as a NUMA island or socket, and should not be confused with a parallel system with distributed memory.

References

Dennard, R., Gaensslen, F., Rideout, V., Bassous, E., LeBlanc, A.: Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J. Solid State Circuit 9(5), 256–268 (1974)
Article Google Scholar
Moore, G.: Cramming more components onto integrated circuits. Electronics 38(8), 114–117 (1965)
Google Scholar
Duranton, M., et al.: The HiPEAC vision for advanced computing in horizon 2020. http://www.hipeac.net/roadmap (2013)
Lavignon J.F., et al.: ETP4HPC strategic research agenda achieving HPC leadership in Europe. (2013)
Lucas R., et al.: Top ten Exascale research challenges. http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/20140210/Top10reportFEB14 (2014)
Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., Burger, D.: Dark silicon and the end of multicore scaling. In: Proceedings 38th Annual International Symposium on Computer Architecture, ISCA’11, pp. 365–376 (2011)
Göddeke, D., Komatitsch, D., Geveler, M., Ribbrock, D., Rajovic, N., Puzovic, N., Ramirez, A.: Energy efficiency vs. performance of the numerical solution of PDEs: An application study on a low-power ARM-based cluster. J. Comput. Phys. 237(0), 132–150 (2013). doi:10.1016/j.jcp.2012.11.031 http://www.sciencedirect.com/science/article/pii/S0021999112007115
Rajovic, N., Carpenter, P.M., Gelado, I., Puzovic, N., Ramirez, A., Valero, M.: Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC? In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pp. 40:1–40:12. ACM, New York, NY, USA (2013). doi:10.1145/2503210.2503281
The TOP500 list. http://www.top500.org (2015)
The GREEN500 list. http://www.green500.org (2015)
Hill, M., Marty, M.: Amdahl’s law in the multicore era. Computer 41(7), 33–38 (2008)
Article Google Scholar
Kumar, R., Tullsen, D.M., Ranganathan, P., Jouppi, N.P., Farkas, K.I.: Single-ISA heterogeneous multi-core architectures for multithreaded workload performance. In: Proceedings 31st Annual International Symposium on Computer Architecture, ISCA’04, p. 64 (2004)
Morad, T., Weiser, U., Kolodny, A., Valero, M., Ayguade, E.: Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. Comput. Arch. Lett. 5(1), 14–17 (2006)
Winter, J.A., Albonesi, D.H., Shoemaker, C.A.: Scalable thread scheduling and global power management for heterogeneous many-core architectures. In: Proceeding 19th International Conference Parallel Architectures and Compilation Techniques, PACT’10, pp. 29–40 (2010)
Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)
Article MATH Google Scholar
Kågström, B., Ling, P., van Loan, C.: GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark. ACM Trans. Math. Softw. 24(3), 268–302 (1998)
Article MATH Google Scholar
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: a view from Berkeley. Technical report UCB/EECS-2006-183, University of California at Berkeley, Electrical Engineering and Computer Sciences (2006)
Intel Corp.: Intel math kernel library (MKL) 11.0. http://software.intel.com/en-us/intel-mkl (2015)
AMD: AMD core math library. http://developer.amd.com/tools/cpu/acml/pages/default.aspx (2015)
IBM: Engineering and scientific subroutine library. http://www.ibm.com/systems/software/essl/ (2015)
NVIDIA: CUDA basic linear algebra subprograms. https://developer.nvidia.com/cuBLAS (2015)
Goto, K., van de Geijn, R.: Anatomy of a high-performance matrix multiplication. ACM Trans. Math. Softw. 34(3), 12:1–12:25 (2008)
Goto, K., van de Geijn, R.: High performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35(1), 4:1–4:14 (2008). doi:10.1145/1377603.1377607
OpenBLAS. http://xianyi.github.com/OpenBLAS/ (2015)
Van Zee, F.G., van de Geijn, R.A.: BLIS: A framework for generating BLAS-like libraries. ACM Trans. Math. Softw. (2016), (To appear)
Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: Proceedings of SC’98 (1998)
Chitlur, N., Srinivasa, G., Hahn, S., Gupta, P., Reddy, D., Koufaty, D., Brett, P., Prabhakaran, A., Zhao, L., Ijih, N., Subhaschandra, S., Grover, S., Jiang, X., Iyer, R.: Quickia: Exploring heterogeneous architectures on real prototypes. In: High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on, pp. 1–8 (2012). doi:10.1109/HPCA.2012.6169046
Chitlur, N., Srinivasa, G., Hahn, S., Gupta, P.K., Reddy, D., Koufaty, D., Brett, P., Prabhakaran, A., Zhao, L., Ijih, N., Subhaschandra, S., Grover, S., Jiang, X., Iyer, R.: Quickia: Exploring heterogeneous architectures on real prototypes. In: Proceedings IEEE 18th International Symposium on High-Performance Computer Architecture, HPCA’12, pp. 1–8 (2012)
Hourd, J., Fan, C., Zeng, J., Zhang, Q.S., Best, M.J., Fedorova, A., Mustard, C.: Exploring practical benefits of asymmetric multicore processors. In: 2nd Workshop on Parallel Execution of Sequential Programs on Multi-core Architectures, PESPMA (2009)
Lakshminarayana, N.B., Lee, J., Kim, H.: Age based scheduling for asymmetric multiprocessors. In: Proceedings Conference on High Performance Computing Networking, Storage and Analysis, SC’09, pp. 25:1–25:12 (2009)
Rodrigues, R., Annamalai, A., Koren, I., Kundu, S.: Improving performance per watt of asymmetric multi-core processors via online program phase classification and adaptive core morphing. ACM Trans. Des. Autom. Electron. Syst. 18(1), 5:1–5:23 (2013)
Clarke, D., Lastovetsky, A., Rychkov, V.: Column-based matrix partitioning for parallel matrix multiplication on heterogeneous processors based on functional performance models. In: Euro-Par 2011: Parallel Processing Workshops, LNCS, vol. 7155, pp. 450–459 (2012)
Beaumont, O., Marchal, L.: Analysis of dynamic scheduling strategies for matrix multiplication on heterogeneous platforms. In: Proceedings 23rd International Symposium High-performance Parallel and Distributed Computing, HPDC’14, pp. 141–152 (2014)
Low, T.M., Igual, F.D., Smith, T.M., Quintana-Ortí, E.S.: Analytical modeling is enough for high performance BLIS. Technical report FLAWN #74, Department of Computer Sciences, The University of Texas at Austin ACM Trans. Math. Softw. (2014). http://www.cs.utexas.edu/users/flame/
Van Zee, F.G., Smith, T.M., Marker, B., Low, T.M., van de Geijn, R.A., Igual, F.D., Smelyanskiy, M., Zhang, X., Kistler, M., Austel, V., Gunnels, J., Killough, L.: The BLIS framework: Experiments in portability. ACM Trans. Math. Softw. (2014). In review. http://www.cs.utexas.edu/users/flame
Smith, T.M., van de Geijn, R., Smelyanskiy, M., Hammond, J.R., Van Zee, F.G.: Anatomy of high-performance many-threaded matrix multiplication. In: Proceedings IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS’14, pp. 1049–1059 (2014)
Alonso, P., Badia, R.M., Labarta, J., Barreda, M., Dolz, M.F., Mayo, R., Quintana-Ortí, E.S., Reyes, R.: Tools for power-energy modelling and analysis of parallel scientific applications. In: 41st International Conference on Parallel Processing—ICPP, pp. 420–429 (2012)

Download references

Acknowledgments

The researchers from Universitat Jaume I were supported by projects CICYT TIN2011-23283 and TIN2014-53495-R of MINECO and FEDER, the EU project FP7 318793 “EXA2GREEN” and the FPU program of MECD. The researcher from Universidad Complutense de Madrid was supported by project CICYT TIN2012-32180.

Author information

Authors and Affiliations

Depto. Ingeniería y Ciencia de Computadores, Universidad Jaume I, Avda. Sos Baynat s/n, 12071, Castellón de la plana, Spain
Sandra Catalán, Rafael Mayo, Rafael Rodríguez-Sánchez & Enrique S. Quintana-Ortí
Depto. de Arquitectura de Computadores y Automática, Universidad Complutense de Madrid, Madrid, Spain
Francisco D. Igual

Authors

Sandra Catalán
View author publications
You can also search for this author in PubMed Google Scholar
Francisco D. Igual
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Mayo
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Rodríguez-Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Enrique S. Quintana-Ortí
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rafael Rodríguez-Sánchez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Catalán, S., Igual, F.D., Mayo, R. et al. Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors. Cluster Comput 19, 1037–1051 (2016). https://doi.org/10.1007/s10586-016-0611-8

Download citation

Received: 17 September 2015
Revised: 20 March 2016
Accepted: 26 July 2016
Published: 04 August 2016
Issue Date: September 2016
DOI: https://doi.org/10.1007/s10586-016-0611-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors

Abstract

Access this article

Similar content being viewed by others

BLAS3 optimization for the Godson-3B1500

Influence of Architectural Features of the SNC-4 Mode of the Intel Xeon Phi KNL on Matrix Multiplication

NUMA-Aware Optimization of Sparse Matrix-Vector Multiplication on ARMv8-Based Many-Core Architectures

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors

Abstract

Access this article

Similar content being viewed by others

BLAS3 optimization for the Godson-3B1500

Influence of Architectural Features of the SNC-4 Mode of the Intel Xeon Phi KNL on Matrix Multiplication

NUMA-Aware Optimization of Sparse Matrix-Vector Multiplication on ARMv8-Based Many-Core Architectures

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation