Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors

Zhang, Feng; Liu, Weifeng; Feng, Ningxuan; Zhai, Jidong; Du, Xiaoyong

doi:10.1007/s42514-019-00008-6

Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors

Regular Paper
Published: 12 June 2019

Volume 1, pages 131–143, (2019)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Feng Zhang¹,
Weifeng Liu ORCID: orcid.org/0000-0003-1983-7321²,
Ningxuan Feng¹,
Jidong Zhai³ &
…
Xiaoyong Du¹

1077 Accesses
8 Citations
Explore all metrics

Abstract

Heterogeneous processors integrate very distinct compute resources such as CPUs and GPUs into the same chip, thus can exploit the advantages and avoid disadvantages of those compute units. We in this work evaluate and analyze eight sparse matrix and graph kernels on an AMD CPU–GPU heterogeneous processor by using 956 sparse matrices. Five characteristics, i.e., load balancing, indirect addressing, memory reallocation, atomic operations, and dynamic characteristics are our major considerations. The experimental results show that although the CPU and GPU parts access the same DRAM, very different performance behaviors are observed. For example, though the GPU part in general outperforms the CPU part, it cannot achieve the best performance in all cases given by the CPU part. Moreover, the bandwidth utilization of atomic operations on heterogeneous processors can be much higher than a high-end discrete GPU.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

Article 22 January 2016

Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms

Techniques for Solving Large-Scale Graph Problems on Heterogeneous Platforms

Notes

Since the Linux GPU driver of this integrated GPU is not officially available yet, we have to do all benchmarks on Microsoft Windows.

References

Agarwal, N., Nellans, D., Ebrahimi, E., Wenisch, T.F., Danskin, J., Keckler, S.W.: Selective GPU caches to eliminate CPU–GPU HW cache coherence. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 494–506 (2016)
Barik, R., Kaleem, R., Majeti, D., Lewis, B.T., Shpeisman, T., Hu, C., Ni, Y., Adl-Tabatabai, A.R.: Efficient mapping of irregular C++ applications to integrated GPUs. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, p. 33, ACM (2014)
Boggs, D., Brown, G., Tuck, N., Venkatraman, K.S.: Denver: Nvidia’s first 64-bit ARM processor. IEEE Micro 35(2), 46–55 (2015)
Article Google Scholar
Branover, A., Foley, D., Steinman, M.: AMD Fusion APU: Llano. IEEE Micro 32(2), 28–37 (2012)
Article Google Scholar
Buluç, A., Gilbert, J.: Parallel sparse matrix–matrix multiplication and indexing: implementation and experiments. SIAM J. Sci. Comput. 34(4), C170–C191 (2012)
Article MathSciNet MATH Google Scholar
Daga, M., Aji, A.M., Feng, W.C: On the efficacy of a fused CPU+GPU processor (or APU) for parallel computing. In: 2011 Symposium on Application Accelerators in High-Performance Computing, pp. 141–149 (2011)
Daga, M., Nutter, M., Meswani, M.: Efficient breadth-first search on a heterogeneous processor. In: 2014 IEEE International Conference on Big Data (Big Data), pp. 373–382 (2014)
Daga, M., Nutter, M.: Exploiting coarse-grained parallelism in B+ tree searches on an APU. In: 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, pp. 240–247 (2012)
Dashti, M., Fedorova, A.: Analyzing Memory Management Methods on Integrated CPU–GPU Systems. In: Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management, ISMM 2017, pp. 59–69 (2017)
Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011)
MathSciNet MATH Google Scholar
Doerksen, M., Solomon, S., Thulasiraman, P.: Designing APU oriented scientific computing applications in opencl. In: High Performance Computing and Communications (HPCC), 2011 IEEE 13th International Conference on, pp. 587–592, IEEE (2011)
Doweck, J., Kao, W., Lu, A.K., Mandelblat, J., Rahatekar, A., Rappoport, L., Rotem, E., Yasin, A., Yoaz, A.: Inside 6th-generation intel core: new microarchitecture code-named skylake. IEEE Micro 37(2), 52–62 (2017)
Article Google Scholar
Duff, I.S., Heroux, M.A., Pozo, R.: An overview of the sparse basic linear algebra subprograms: the new standard from the BLAS technical forum. ACM Trans. Math. Softw. (TOMS) 28(2), 239–267 (2002)
Article MathSciNet MATH Google Scholar
Garzón, E.M., Moreno, J., Martínez, J.: An approach to optimise the energy efficiency of iterative computation on integrated GPU–CPU systems. J. Supercomput. 73(1), 114–125 (2017)
Article Google Scholar
Greathouse, J.L., Daga, M.: Efficient sparse matrix–vector multiplication on GPUs using the CSR storage format. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pp. 769–780 (2014)
Gregg, C., Hazelwood, K.: Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In: (IEEE ISPASS) IEEE International Symposium on Performance Analysis of Systems and Software, pp. 134–144 (2011)
Kaleem, R., Barik, R., Shpeisman, T., Hu, C., Lewis, B.T., Pingali, K.: Adaptive heterogeneous scheduling for integrated GPUs. In: Parallel Architecture and Compilation Techniques (PACT), 2014 23rd International Conference on, pp. 151–162. IEEE (2014)
Krishnan, G., Bouvier, D., Naffziger, S.: Energy-efficient graphics and multimedia in 28-nm Carrizo accelerated processing unit. IEEE Micro 36(2), 22–33 (2016)
Article Google Scholar
Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, pp. 451–460 (2010)
Lee, K., Lin, H., Feng, Wc: Performance characterization of data-intensive kernels on AMD fusion architectures. Comput. Sci. Res. Dev. 28(2–3), 175–184 (2013)
Article Google Scholar
Li, A., Liu, W., Kristensen, M.R.B., Vinter, B., Wang, H., Hou, K., Marquez, A., Song, S.L.: Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’17, pp. 26:1–26:14 (2017a)
Li, A., Song, S.L., Liu, W., Liu, X., Kumar, A., Corporaal, H.: Locality-aware CTA clustering for modern GPUs. In: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’17, pp. 297–311 (2017b)
Liu, W.: Parallel and scalable sparse basic linear algebra subprograms. Ph.D. thesis, University of Copenhagen (2015)
Liu, W., Vinter, B.: Ad-heap: an efficient heap data structure for asymmetric multicore processors. In: Proceedings of Workshop on General Purpose Processing Using GPUs, GPGPU-7, pp. 54:54–54:63 (2014)
Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix–vector multiplication. In: Proceedings of the 29th ACM International Conference on Supercomputing, ICS ’15, pp. 339–350 (2015a)
Liu, W., Vinter, B.: A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors. J. Parallel Distrib. Comput. 85(C), 47–61 (2015b)
Article Google Scholar
Liu, W., Vinter, B.: Speculative segmented sum for sparse matrix–vector multiplication on heterogeneous processors. Parallel Comput. 49(C), 179–193 (2015c)
Article MathSciNet Google Scholar
Liu, T., Chen, C.C., Kim, W., Milor, L.: Comprehensive reliability and aging analysis on SRAMs within microprocessor systems. Microelectron. Reliab. 55(9), 1290–1296 (2015)
Article Google Scholar
Liu, T., Chen, C.C., Wu, J., Milor, L.: SRAM stability analysis for different cache configurations due to bias temperature instability and hot carrier injection. In: Computer Design (ICCD), 2016 IEEE 34th International Conference on, pp. 225–232, IEEE (2016)
Liu, W., Li, A., Hogg, J.D., Duff, I.S., Vinter, B.: Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides. Concurr. Comput. Pract. Exp. 29(21), e4244 (2017)
Article Google Scholar
Liu, J., He, X., Liu, W., Tan, G.: Register-based implementation of the sparse general matrix–matrix multiplication on gpus. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’18, pp. 407–408 (2018)
Liu, J., He, X., Liu, W., Tan, G.: Register-aware optimizations for parallel sparse matrix–matrix multiplication. Int. J. Parallel Program. (2019). https://doi.org/10.1007/s10766-018-0604-
Google Scholar
Mekkat, V., Holey, A., Yew, P.C., Zhai, A.: Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, pp. 225–234, IEEE Press (2013)
Merrill, D., Garland, M.: Merge-based parallel sparse matrix–vector multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 58. IEEE Press (2016)
Nai, L., Xia, Y., Tanase, I.G., Kim, H., Lin, C.Y.: GraphBIG: understanding graph computing in the context of industrial solutions. In: High Performance Computing, Networking, Storage and Analysis, 2015 SC-International Conference for, pp. 1–12, IEEE (2015)
Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing. Proc. IEEE 96(5), 879–899 (2008)
Article Google Scholar
Pandit, P., Govindarajan, R.: Fluidic kernels: Cooperative execution of opencl programs on multiple heterogeneous devices. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, p. 273, ACM (2014)
Power, J., Basu, A., Gu, J., Puthoor, S., Beckmann, B.M., Hill, M.D., Reinhardt, S.K., Wood, D.A.: Heterogeneous system coherence for integrated CPU–GPU systems. In: 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 457–467 (2013)
Puthoor, S., Aji, A.M., Che, S., Daga, M., Wu, W., Beckmann, B.M., Rodgers, G.: Implementing directed acyclic graphs with the heterogeneous system architecture. In: Proceedings of the 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit, GPGPU ’16, pp. 53–62 (2016)
Said, I., Fortin, P., Lamotte, J., Calandra, H.: Leveraging the accelerated processing units for seismic imaging: a performance and power efficiency comparison against CPUs and GPUs. Int. J. High Perform. Comput. Appl. 32(6), 819–837 (2017)
Article Google Scholar
Schulte, M.J., Ignatowski, M., Loh, G.H., Beckmann, B.M., Brantley, W.C., Gurumurthi, S., Jayasena, N., Paul, I., Reinhardt, S.K., Rodgers, G.: Achieving exascale capabilities through heterogeneous computing. IEEE Micro 35(4), 26–36 (2015)
Article Google Scholar
Shen, J., Varbanescu, A.L., Sips, H., Arntzen, M., Simons, D.G.: Glinda: A framework for accelerating imbalanced applications on heterogeneous platforms. In: Proceedings of the ACM International Conference on Computing Frontiers, CF ’13, pp. 14:1–14:10 (2013)
Shen, J., Varbanescu, A.L., Zou, P., Lu, Y., Sips, H.: Improving performance by matching imbalanced workloads with heterogeneous platforms. In: Proceedings of the 28th ACM International Conference on Supercomputing, ICS ’14, pp. 241–250 (2014)
Shen, J., Varbanescu, A.L., Lu, Y., Zou, P., Sips, H.: Workload partitioning for accelerating applications on heterogeneous platforms. IEEE Trans. Parallel Distrib. Syst. 27(9), 2766–2780 (2016)
Article Google Scholar
Spafford, K.L., Meredith, J.S., Lee, S., Li, D., Roth, P.C., Vetter, J.S.: The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. In: Proceedings of the 9th conference on Computing Frontiers, pp. 103–112, ACM (2012)
Tang, S., He, B., Zhang, S., Niu, Z.: Elastic multi-resource fairness: balancing fairness and efficiency in coupled CPU–GPU architectures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 75. IEEE Press (2016)
Vijayaraghavan, T., Eckert, Y., Loh, G.H., Schulte, M.J., Ignatowski, M., Beckmann, B.M., Brantley, W.C., Greathouse, J.L., Huang, W., Karunanithi, A., Kayiran, O., Meswani, M., Paul, I., Poremba, M., Raasch, S., Reinhardt, S.K., Sadowski, G., Sridharan, V.: Design and analysis of an APU for Exascale computing. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 85–96 (2017)
Wang, H., Liu, W., Hou, K., Feng, W.C.: Parallel transposition of sparse data structures. In: Proceedings of the 2016 International Conference on Supercomputing, ICS ’16, pp. 33:1–33:13 (2016)
Wang, X., Liu, W., Xue, W., Wu, L.: swsptrsv: A fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’18, pp. 338–353 (2018)
Wang, H., Geng, L., Lee, R., Hou, K., Zhang, Y., Zhang, X.: Sep-graph: finding shortest execution paths for graph processing under a hybrid framework on GPU. In: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, PPoPP ’19, pp. 38–52 (2019)
Yang, Y., Xiang, P., Mantor, M., Zhou, H.: CPU-assisted GPGPU on fused CPU–GPU architectures. In: IEEE International Symposium on High-Performance Comp Architecture, pp. 1–12 (2012)
Zakharenko, V., Aamodt, T., Moshovos, A.: Characterizing the performance benefits of fused CPU/GPU systems using FusionSim. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 685–688, EDA Consortium (2013)
Zakharenko, V.: FusionSim: characterizing the performance benefits of fused CPU/GPU systems. Ph.D. thesis (2012)
Zhang, F., Zhai, J., Chen, W., He, B., Zhang, S.: To co-run, or not to co-run: a performance study on integrated architectures. In: 2015 IEEE 23rd International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pp. 89–92 (2015)
Zhang, F., Wu, B., Zhai, J., He, B., Chen, W.: FinePar: Irregularity-aware Fine-grained Workload Partitioning on Integrated Architectures. In: Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO ’17, pp. 27–38 (2017a)
Zhang, F., Zhai, J., He, B., Zhang, S., Chen, W.: Understanding co-running behaviors on integrated CPU/GPU architectures. IEEE Trans. Parallel Distrib. Syst. 28(3), 905–918 (2017b)
Article Google Scholar
Zhang, F., Lin, H., Zhai, J., Cheng, J., Xiang, D., Li, J., Chai, Y., Du, X.: An Adaptive Breadth-First Search Algorithm on Integrated Architectures. The Journal of Supercomputing (2018)
Zhu, Q., Wu, B., Shen, X., Shen, L., Wang, Z.: Understanding co-run degradations on integrated heterogeneous processors. In: International Workshop on Languages and Compilers for Parallel Computing, pp. 82–97. Springer (2014)
Zhu, Q., Wu, B., Shen, X., Shen, L., Wang, Z.: Co-run scheduling with power cap on integrated CPU–GPU systems. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 967–977 (2017a)
Zhu, Q., Wu, B., Shen, X., Shen, K., Shen, L., Wang, Z.: Understanding co-run performance on CPU–GPU integrated processors: observations, insights, directions. Front. Comput. Sci. 11(1), 130–146 (2017b)
Article Google Scholar

Download references

Acknowledgements

This work has been partly supported by the National Natural Science Foundation of China (Grant nos. 61732014, 61802412, 61671151), Beijing Natural Science Foundation (no. 4172031), and SenseTime Young Scholars Research Fund.

Author information

Authors and Affiliations

Key Laboratory of Data Engineering and Knowledge Engineering (MOE), and School of Information, Renmin University of China, Beijing, China
Feng Zhang, Ningxuan Feng & Xiaoyong Du
Department of Computer Science and Technology, China University of Petroleum, Beijing, China
Weifeng Liu
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Jidong Zhai

Authors

Feng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Weifeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ningxuan Feng
View author publications
You can also search for this author in PubMed Google Scholar
Jidong Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyong Du
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weifeng Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, F., Liu, W., Feng, N. et al. Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors. CCF Trans. HPC 1, 131–143 (2019). https://doi.org/10.1007/s42514-019-00008-6

Download citation

Received: 23 January 2019
Accepted: 13 May 2019
Published: 12 June 2019
Issue Date: 01 August 2019
DOI: https://doi.org/10.1007/s42514-019-00008-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors

Abstract

Access this article

Similar content being viewed by others

A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms

Techniques for Solving Large-Scale Graph Problems on Heterogeneous Platforms

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors

Abstract

Access this article

Similar content being viewed by others

A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms

Techniques for Solving Large-Scale Graph Problems on Heterogeneous Platforms

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation