D-Cubicle: boosting data transfer dynamically for large-scale analytical queries in single-GPU systems

Wang, Jialun; Pang, Wenhao; Weng, Chuliang; Zhou, Aoying

doi:10.1007/s11704-022-2160-z

D-Cubicle: boosting data transfer dynamically for large-scale analytical queries in single-GPU systems

Research Article
Published: 12 December 2022

Volume 17, article number 174610, (2023)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Jialun Wang¹,
Wenhao Pang¹,
Chuliang Weng¹ &
…
Aoying Zhou¹

35 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

In analytical queries, a number of important operators like JOIN and GROUP BY are suitable for parallelization, and GPU is an ideal accelerator considering its power of parallel computing. However, when data size increases to hundreds of gigabytes, one GPU card becomes insufficient due to the small capacity of global memory and the slow data transfer between host and device. A straightforward solution is to equip more GPUs linked with high-bandwidth connectors, but the cost will be highly increased. We utilize unified memory (UM) produced by NVIDIA CUDA (Compute Unified Device Architecture) to make it possible to accelerate large-scale queries on just one GPU, but we notice that the transfer performance between host and UM, which happens before kernel execution, is often significantly slower than the theoretical bandwidth. An important reason is that, in single-GPU environment, data processing systems usually invoke only one or a static number of threads for data copy, leading to an inefficient transfer which slows down the overall performance heavily. In this paper, we present D-Cubicle, a runtime module to accelerate data transfer between host-managed memory and unified memory. D-Cubicle boosts the actual transfer speed dynamically through a self-adaptive approach. In our experiments, taking data transfer into account, D-Cubicle processes 200 GB of data on a single GPU with 32 GB of global memory, achieving 1.43x averagely and 2.09x maximally the performance of the baseline system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An effective 3-D fast fourier transform framework for multi-GPU accelerated distributed-memory systems

Article 13 May 2022

Optimizing non-coalesced memory access for irregular applications with GPU computing

Article 17 September 2020

Distributed out-of-memory NMF on CPU/GPU architectures

Article Open access 08 September 2023

References

Rosenfeld V, Breß S, Markl V. Query processing on heterogeneous CPU/GPU systems. ACM Computing Surveys, 2023, 55(1): 11
Article Google Scholar
Kaldewey T, Lohman G, Mueller R, Volk P. GPU join processing revisited. In: Proceedings of the 8th International Workshop on Data Management on New Hardware. 2012, 55–62
Rui R, Tu Y C. Fast Equi-join algorithms on GPUs: design and implementation. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management. 2017, 17
Chrysogelos P, Sioulas P, Ailamaki A. Hardware-conscious query processing in GPU-accelerated analytical engines. In: Proceedings of the 9th Biennial Conference on Innovative Data Systems Research. 2019, 1–9
Sioulas P, Chrysogelos P, Karpathiotakis M, Appuswamy R, Ailamaki A. Hardware-conscious hash-joins on GPUs. In: Proceedings of the 35th IEEE International Conference on Data Engineering. 2019, 698–709
Chrysogelos P, Karpathiotakis M, Appuswamy R, Ailamaki A. HetExchange: encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines. Proceedings of the VLDB Endowment, 2019, 12(5): 544–556
Article Google Scholar
Paul J, He B, Lu S, Lau C T. Revisiting hash join on graphics processors: a decade later. Distributed and Parallel Databases, 2020, 38(4): 771–793
Article Google Scholar
Nam Y M N, Han D H, Kim M S K. SPRINTER: a fast n-ary join query processing method for complex OLAP queries. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 2055–2070
Paul J, Lu S, He B, Lau C T. MG-Join: a scalable join for massively parallel multi-GPU architectures. In: Proceedings of 2021 International Conference on Management of Data. 2021, 1413–1425
Jung J, Park D, Do Y, Park J, Lee J. Overlapping host-to-device copy and computation using hidden unified memory. In: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2020, 321–335
Koliousis A, Weidlich M, Fernandez R C, Wolf A L, Costa P, Pietzuch P. SABER: window-based hybrid stream processing for heterogeneous architectures. In: Proceedings of 2016 International Conference on Management of Data. 2016, 555–569
Arefyeva I, Broneske D, Campero G, Pinnecke M, Saake G. Memory management strategies in CPU/GPU database systems: a survey. In: Proceedings of the 14th International Conference on Beyond Databases, Architectures and Structures. Facing the Challenges of Data Proliferation and Growing Variety. 2018, 128–142
Li A, Song S L, Chen J, Li J, Liu X, Tallent N R, Barker K J. Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(1): 94–110
Article Google Scholar
Li L, Chapman B. Compiler assisted hybrid implicit and explicit GPU memory management under unified address space. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2019, 51
Lutz C, Breß S, Zeuch S, Rabl T, Markl V. Pump up the volume: processing large data on GPUs with fast interconnects. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 1633–1649
Lutz C, Breß S, Zeuch S, Rabl T, Markl V. Triton join: efficiently scaling to a large join state on GPUs with fast interconnects. In: Proceedings of 2022 International Conference on Management of Data. 2022, 1017–1032
Kim H, Sim J, Gera P, Hadidi R, Kim H. Batch-aware unified memory management in GPUs for irregular workloads. In: Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. 2020, 1357–1370
Lee R, Zhou M, Li C, Hu S, Teng J, Li D, Zhang X. The art of balance: a RateupDB™ experience of building a CPU/GPU hybrid database product. Proceedings of the VLDB Endowment, 2021, 14(12): 2999–3013
Article Google Scholar
Jung J, Park D, Jo G, Park J, Lee J. SnuRHAC: a runtime for heterogeneous accelerator clusters with CUDA unified memory. In: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing. 2021, 107–120
Cho S, Hong J, Choi J, Han H. Multithreaded double queuing for balanced CPU-GPU memory copying. In: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. 2019, 1444–1450
He B, Lu M, Yang K, Fang R, Govindaraju N K, Luo Q, Sander P V. Relational query coprocessing on graphics processors. ACM Transactions on Database Systems, 2009, 34(4): 21
Article Google Scholar
Wang K, Zhang K, Yuan Y, Ma S, Lee R, Ding X, Zhang X. Concurrent analytical query processing with GPUs. Proceedings of the VLDB Endowment, 2014, 7(11): 1011–1022
Article Google Scholar
Paul J, He J, He B. GPL: A GPU-based pipelined query processing engine. In: Proceedings of the 2016 International Conference on Management of Data. 2016, 1935–1950
Breß S. The design and implementation of CoGaDB: a column-oriented GPU-accelerated DBMS. Datenbank-Spektrum, 2014, 14(3): 199–209
Article Google Scholar
Breß S, Saake G. Why it is time for a HyPE: a hybrid query processing engine for efficient GPU coprocessing in DBMS. Proceedings of the VLDB Endowment, 2013, 6(12): 1398–1403
Article Google Scholar
Breß S, Köcher B, Heimel M, Markl V, Saecker M, Saake G. Ocelot/HyPE: optimized data processing on heterogeneous hardware. Proceedings of the VLDB Endowment, 2014, 7(13): 1609–1612
Article Google Scholar
Guo C, Chen H, Zhang F, Li C. Distributed join algorithms on multi-CPU clusters with GPUDirect RDMA. In: Proceedings of the 48th International Conference on Parallel Processing. 2019, 65
Rui R, Li H, Tu Y C. Efficient join algorithms for large database tables in a multi-GPU environment. Proceedings of the VLDB Endowment, 2020, 14(4): 708–720
Article Google Scholar
Hou N, He F, Zhou Y, Chen Y. An efficient GPU-based parallel tabu search algorithm for hardware/software co-design. Frontiers of Computer Science, 2020, 14(5): 145316
Article Google Scholar
Chen Y, He F, Li H, Zhang D, Wu Y. A full migration BBO algorithm with enhanced population quality bounds for multimodal biomedical image registration. Applied Soft Computing, 2020, 93: 106335
Article Google Scholar
Liang Y, He F, Zeng X, Luo J. An improved loop subdivision to coordinate the smoothness and the number of faces via multi-objective optimization. Integrated Computer-Aided Engineering, 2022, 29(1): 23–41
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61732014 and 62141214) and the National Key Research and Development Program of China (2018YFB1003400).

Author information

Authors and Affiliations

School of Data Science and Engineering, East China Normal University, Shanghai, 200062, China
Jialun Wang, Wenhao Pang, Chuliang Weng & Aoying Zhou

Authors

Jialun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wenhao Pang
View author publications
You can also search for this author in PubMed Google Scholar
Chuliang Weng
View author publications
You can also search for this author in PubMed Google Scholar
Aoying Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chuliang Weng.

Additional information

Jialun Wang is a PhD candidate in the School of Data Science and Engineering, East China Normal University, China. He received his bachelor’s degree in computer science from Sichuan University, China. His research interests include parallel and distributed systems, CPU-GPU heterogeneous systems, and in-memory data processing.

Wenhao Pang is currently working toward the master’s degree in the School of Data Science and Engineering, East China Normal University, China. He received his bachelor’s degree from Donghua University, China. His research interests include parallel and distributed systems, and heterogeneous computing.

Chuliang Weng is a full professor on computer science at East China Normal University (ECNU), China. Before joining ECNU, he worked at Huawei Central Research Institute as a principal researcher and technical director. Before joining Huawei, he was an associate professor in the Department of Computer Science and Engineering at Shanghai Jiao Tong University, China. He was also a visiting research scientist in the Department of Computer Science at Columbia University, USA. His interests center on building fast and secure systems. His research includes parallel and distributed systems, system virtualization and cloud, storage systems, operating systems, and system security.

Aoying Zhou, Professor at School of Data Science and Engineering (DaSE), East China Normal University (ECNU), China. He is CCF Fellow, Vice President of Shanghai Computer Society, and Associate Editor-in-Chief of Chinese Journal of Computer. His research interests include database, data management, digital transformation, and data-driven applications such as Financial Technology (FinTech), Education Technology (EduTech), and Logistics Technology (LogTech).

Electronic supplementary material