Accelerated bulk memory operations on heterogeneous multi-core systems

Lee, JongHyuk; Shi, Weidong; Gil, JoonMin

doi:10.1007/s11227-018-2589-x

Accelerated bulk memory operations on heterogeneous multi-core systems

Published: 08 September 2018

Volume 74, pages 6898–6922, (2018)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

JongHyuk Lee¹,
Weidong Shi² &
JoonMin Gil³

285 Accesses
6 Citations
Explore all metrics

Abstract

A traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the past few years, the general-purpose computing on GPU (GPGPU). Recently, revolutionary measures have been taken along this direction: an integrated GPU, i.e., CPUs and GPUs are integrated into the same package or even into the same die. However, considering a system-on-chip, the GPU takes up considerable silicon resources, but when running non-graphical workloads or non-GPGPU applications it is likely that overall system performance will not be affected. This paper presents a novel approach to accelerate conventional operations that are normally performed on CPUs, which are bulk memory operations such as memcpy or memcmp, using an integrated GPU. Offloading bulk memory operations to the GPU has many benefits: (i) The throughput GPU outperforms the CPU in bulk memory operations; (ii) for on-die GPUs with unified cache between the GPU and the CPU, the CPU can utilize the GPU private cache to store the moved data and reduce the CPU cache bottleneck; (iii) additional lightweight hardware can also support asynchronous offloads; and (iv) unlike the prior art using a dedicated hardware copy engine (e.g., DMA), our approach utilizes as much GPU hardware resources as possible. The performance results based on our solution showed that offloaded bulk memory operations outperform CPU up to 4.3 times faster on micro-benchmarks while using fewer resources. Using eight real-world applications and a cycle-based full-system simulation environment, five of eight applications showed about 30% speedup and two applications showed about 20% speedup.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Survey on chiplets: interface, interconnect and integration methodology

Article 31 March 2022

Can GPU performance increase faster than the code error rate?

Article Open access 18 April 2024

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Article 27 April 2021

References

Lee J, Liu Z, Tian X, Woo DH, Shi W, Boumber D, Yan Y, Kwon KA (2012) Acceleration of bulk memory operations in a heterogeneous multicore architecture. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. ACM, pp 423–424
The 50th TOP500 list (2017). https://www.top500.org/lists/2017/11/. Accessed 4 Sept 2018
Benziane SH, Benyettou A (2017) Dorsal hand vein identification based on binary particle swarm optimization. J Inf Process Syst 13(2):268–283
Google Scholar
Finogeev AG, Parygin DS, Finogeev AA (2017) The convergence computing model for big sensor data mining and knowledge discovery. Hum Centric Comput Inf Sci 7(1):11
Article Google Scholar
Ghadekar PP, Chopade NB (2016) Content based dynamic texture analysis and synthesis based on SPIHT with GPU. J Inf Process Syst 12(1):46–56
Google Scholar
Koo KM, Cha EY (2017) Image recognition performance enhancements using image normalization. Hum Centric Comput Inf Sci 7(1):33
Article Google Scholar
Mohd-Hilmi MN, Al-Laila MH, Malim H, Ahamed NH (2016) Accelerating group fusion for ligand-based virtual screening on multi-core and many-core platforms. J Inf Process Syst 12(4):724–740
Google Scholar
Hao F, Min G, Pei Z, Park DS, Yang LT (2017) \( k \)-clique community detection in social networks based on formal concept analysis. IEEE Syst J 11(1):250–259
Article Google Scholar
Hao F, Pei Z, Park DS, Yang LT, Jeong YS, Park JH (2017) Iceberg clique queries in large graphs. Neurocomputing 256:101–110
Article Google Scholar
Song W, Liu L, Tian Y, Sun G, Fong S, Cho K (2017) A 3D localisation method in indoor environments for virtual reality applications. Hum Centric Comput Inf Sci 7(1):39
Article Google Scholar
Memcached—a distributed memory object caching system (2015). http://www.memcached.org/ Accessed 4 Sept 2018
Fung W, Sham I, Yuan G, Aamodt T (2007) Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, pp 407–420
Intel streaming SIMD extensions technology (2017). https://www.intel.com/content/www/us/en/support/articles/000005779/processors.html. Accessed 4 Sept 2018
Nvidia CUDA (2007). https://developer.nvidia.com/cuda-zone. Accessed 4 Sept 2018
Intel advanced vector extensions 512 (AVX-512) (2015). https://www.intel.com/content/www/us/en/architecture-andtechnology/avx-512-overview.html. Accessed 4 Sept 2018
Gschwind M (2006) Chip multiprocessing and the cell broadband engine. In: Proceedings of the 3rd Conference on Computing Frontiers, CF ’06. ACM, New York, NY, USA, pp 1–8
Jiang X, Solihin Y, Zhao L, Iyer R (2009) Architecture support for improving bulk memory copying and initialization performance. In: Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, Washington, DC, USA, pp 169–180
Seshadri V, Mutlu O (2017) Simple operations in memory to reduce data movement. In: Hurson AR, Milutinovic V (ed) Advances in computers, vol 106. Elsevier, New York, pp 107–166
Google Scholar
Zhao L, Bhuyan LN, Iyer R, Makineni S, Newell D (2007) Hardware support for accelerating data movement in server platform. IEEE Trans Comput 56:740–753
Article MathSciNet Google Scholar
Woo DH, Lee HHS (2010) Compass: a programmable data prefetcher using idle GPU shaders. In: Hoe JC, Adve VS (eds) ASPLOS. ACM, New York, pp 297–310
Google Scholar
Abts D, Bataineh A, Scott S, Faanes G, Schwarzmeier J, Lundberg E, Johnson T, Bye M, Schwoerer G (2007) The Cray BlackWidow: a highly scalable vector multiprocessor. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC ’07. ACM, New York, NY, USA, pp 17:1–17:12
Ahn J, Hong S, Yoo S, Mutlu O, Choi K (2016) A scalable processing-in-memory accelerator for parallel graph processing. ACM SIGARCH Comput Archit News 43(3):105–117
Article Google Scholar
Hsieh K, Ebrahimi E, Kim G, Chatterjee N, O’Connor M, Vijaykumar N, Mutlu O, Keckler SW (2016) Transparent offloading and mapping (tom): enabling programmer-transparent near-data processing in GPU systems. ACM SIGARCH Comput Archit News 44(3):204–216
Article Google Scholar
Pattnaik A, Tang X, Jog A, Kayiran O, Mishra AK, Kandemir MT, Mutlu O, Das CR (2016) Scheduling techniques for GPU architectures with processing-in-memory capabilities. In: Proceeedings of the 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT). IEEE, pp 31–44
Seshadri V, Lee D, Mullins T, Hassan H, Boroumand A, Kim J, Kozuch MA, Mutlu O, Gibbons PB, Mowry TC (2017) Ambit: in-memory accelerator for bulk bitwise operations using commodity dram technology. In: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, pp 273–287
Vaidyanathan K, Chai L, Huang W, Panda DK (2007) Efficient asynchronous memory copy operations on multi-core systems and I/OAT. In: Proceedings of the 2007 IEEE International Conference on Cluster Computing, CLUSTER ’07. IEEE Computer Society, Washington, DC, USA, pp 159–168
Kernighan BW, Dennis M (1988) The C programming language. Prentice-Hall, Upper Saddle River
Google Scholar
7th generation Intel core and Celeron desktop processor families with Intel H110 and Intel Q170 chipsets: platform brief (2017). https://www.intel.com/content/dam/www/public/us/en/documents/platformbriefs/7th-generation-core-processor-deskop-iot-platform-brief.pdf. Accessed 4 Sept 2018
Magnusson P, Christensson M, Eskilson J, Forsgren D, Hallberg G, Hogberg J, Larsson F, Moestedt A, Werner B (2002) Simics: a full system simulation platform. Computer 35(2):50–58
Article Google Scholar
Neelakantam N, Blundell C, Devietti J, Martin MM, Zilles C (2008) FeS2: A full-system execution-driven simulator for x86. In: Proceedings of the Architectural Support for Programming Languages and Operating Systems. ASPLOS 2018
Martin MMK, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput Archit News 33:92–99
Article Google Scholar
Yourst MT (2007) PTLsim: a cycle accurate full system x86-64 microarchitectural simulator. In: IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS 2007
Meng J, Skadron K (2009) Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling. In: Proceedings of the 2009 IEEE International Conference on Computer Design, ICCD’09. IEEE Press, Piscataway, NJ, USA, pp 282–288
Blackburn SM, Garner R, Hoffmann C, Khang AM, McKinley KS, Bentzur R, Diwan A, Feinberg D, Frampton D, Guyer SZ, Hirzel M, Hosking A, Jump M, Lee H, Moss JEB, Phansalkar A, Stefanović D, VanDrunen T, von Dincklage D, Wiedermann B (2006) The dacapo benchmarks: java benchmarking development and analysis. SIGPLAN Not 41:169–190
Article Google Scholar
DaCapo benchmark suite. http://dacapobench.org/. Accessed 4 Sept 2018
Pybench. http://svn.python.org/. Accessed 4 Sept 2018
ClamAV open source antivirus engine. http://www.clamav.net/. Accessed 4 Sept 2018
Koziol J (2003) Intrusion detection with Snort, 1st edn. Sams, Indianapolis
Google Scholar
Gzip. http://www.gzip.org/. Accessed 4 Sept 2018
Sphinx text search server. http://sphinxsearch.com/. Accessed 4 Sept 2018
ClamAV test files. https://packages.ubuntu.com/xenial-updates/utils/clamav-testfiles. Accessed 4 Sept 2018
MIT Lincoln Laboratory 1998/1999 DARPA off-line intrusion detection (1999). https://www.ll.mit.edu/rd/datasets. Accessed 4 Sept 2018
TREC-9 filtering track collections (2007). http://trec.nist.gov/data/t9_filtering.html. Accessed 4 Sept 2018
Large text compression benchmark (2009). http://cs.fit.edu/~mmahoney/compression/text.html. Accessed 4 Sept 2018

Download references

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2016R1D1A3B03933370).

Author information

Authors and Affiliations

Department of Big Data Engineering, Daegu Catholic University, Gyeongsan-si, Republic of Korea
JongHyuk Lee
Department of Computer Science, University of Houston, Houston, USA
Weidong Shi
School of Information Technology Engineering, Daegu Catholic University, Gyeongsan-si, Republic of Korea
JoonMin Gil

Authors

JongHyuk Lee
View author publications
You can also search for this author in PubMed Google Scholar
Weidong Shi
View author publications
You can also search for this author in PubMed Google Scholar
JoonMin Gil
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to JoonMin Gil.

Additional information

This paper is an extended version of a conference paper that appeared as [1].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, J., Shi, W. & Gil, J. Accelerated bulk memory operations on heterogeneous multi-core systems. J Supercomput 74, 6898–6922 (2018). https://doi.org/10.1007/s11227-018-2589-x

Download citation

Published: 08 September 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s11227-018-2589-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerated bulk memory operations on heterogeneous multi-core systems

Abstract

Access this article

Similar content being viewed by others

Survey on chiplets: interface, interconnect and integration methodology

Can GPU performance increase faster than the code error rate?

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Accelerated bulk memory operations on heterogeneous multi-core systems

Abstract

Access this article

Similar content being viewed by others

Survey on chiplets: interface, interconnect and integration methodology

Can GPU performance increase faster than the code error rate?

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation