Skip to main content
Log in

D-Cubicle: boosting data transfer dynamically for large-scale analytical queries in single-GPU systems

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

In analytical queries, a number of important operators like JOIN and GROUP BY are suitable for parallelization, and GPU is an ideal accelerator considering its power of parallel computing. However, when data size increases to hundreds of gigabytes, one GPU card becomes insufficient due to the small capacity of global memory and the slow data transfer between host and device. A straightforward solution is to equip more GPUs linked with high-bandwidth connectors, but the cost will be highly increased. We utilize unified memory (UM) produced by NVIDIA CUDA (Compute Unified Device Architecture) to make it possible to accelerate large-scale queries on just one GPU, but we notice that the transfer performance between host and UM, which happens before kernel execution, is often significantly slower than the theoretical bandwidth. An important reason is that, in single-GPU environment, data processing systems usually invoke only one or a static number of threads for data copy, leading to an inefficient transfer which slows down the overall performance heavily. In this paper, we present D-Cubicle, a runtime module to accelerate data transfer between host-managed memory and unified memory. D-Cubicle boosts the actual transfer speed dynamically through a self-adaptive approach. In our experiments, taking data transfer into account, D-Cubicle processes 200 GB of data on a single GPU with 32 GB of global memory, achieving 1.43x averagely and 2.09x maximally the performance of the baseline system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Rosenfeld V, Breß S, Markl V. Query processing on heterogeneous CPU/GPU systems. ACM Computing Surveys, 2023, 55(1): 11

    Article  Google Scholar 

  2. Kaldewey T, Lohman G, Mueller R, Volk P. GPU join processing revisited. In: Proceedings of the 8th International Workshop on Data Management on New Hardware. 2012, 55–62

  3. Rui R, Tu Y C. Fast Equi-join algorithms on GPUs: design and implementation. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management. 2017, 17

  4. Chrysogelos P, Sioulas P, Ailamaki A. Hardware-conscious query processing in GPU-accelerated analytical engines. In: Proceedings of the 9th Biennial Conference on Innovative Data Systems Research. 2019, 1–9

  5. Sioulas P, Chrysogelos P, Karpathiotakis M, Appuswamy R, Ailamaki A. Hardware-conscious hash-joins on GPUs. In: Proceedings of the 35th IEEE International Conference on Data Engineering. 2019, 698–709

  6. Chrysogelos P, Karpathiotakis M, Appuswamy R, Ailamaki A. HetExchange: encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines. Proceedings of the VLDB Endowment, 2019, 12(5): 544–556

    Article  Google Scholar 

  7. Paul J, He B, Lu S, Lau C T. Revisiting hash join on graphics processors: a decade later. Distributed and Parallel Databases, 2020, 38(4): 771–793

    Article  Google Scholar 

  8. Nam Y M N, Han D H, Kim M S K. SPRINTER: a fast n-ary join query processing method for complex OLAP queries. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 2055–2070

  9. Paul J, Lu S, He B, Lau C T. MG-Join: a scalable join for massively parallel multi-GPU architectures. In: Proceedings of 2021 International Conference on Management of Data. 2021, 1413–1425

  10. Jung J, Park D, Do Y, Park J, Lee J. Overlapping host-to-device copy and computation using hidden unified memory. In: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2020, 321–335

  11. Koliousis A, Weidlich M, Fernandez R C, Wolf A L, Costa P, Pietzuch P. SABER: window-based hybrid stream processing for heterogeneous architectures. In: Proceedings of 2016 International Conference on Management of Data. 2016, 555–569

  12. Arefyeva I, Broneske D, Campero G, Pinnecke M, Saake G. Memory management strategies in CPU/GPU database systems: a survey. In: Proceedings of the 14th International Conference on Beyond Databases, Architectures and Structures. Facing the Challenges of Data Proliferation and Growing Variety. 2018, 128–142

  13. Li A, Song S L, Chen J, Li J, Liu X, Tallent N R, Barker K J. Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(1): 94–110

    Article  Google Scholar 

  14. Li L, Chapman B. Compiler assisted hybrid implicit and explicit GPU memory management under unified address space. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2019, 51

  15. Lutz C, Breß S, Zeuch S, Rabl T, Markl V. Pump up the volume: processing large data on GPUs with fast interconnects. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 1633–1649

  16. Lutz C, Breß S, Zeuch S, Rabl T, Markl V. Triton join: efficiently scaling to a large join state on GPUs with fast interconnects. In: Proceedings of 2022 International Conference on Management of Data. 2022, 1017–1032

  17. Kim H, Sim J, Gera P, Hadidi R, Kim H. Batch-aware unified memory management in GPUs for irregular workloads. In: Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. 2020, 1357–1370

  18. Lee R, Zhou M, Li C, Hu S, Teng J, Li D, Zhang X. The art of balance: a RateupDB™ experience of building a CPU/GPU hybrid database product. Proceedings of the VLDB Endowment, 2021, 14(12): 2999–3013

    Article  Google Scholar 

  19. Jung J, Park D, Jo G, Park J, Lee J. SnuRHAC: a runtime for heterogeneous accelerator clusters with CUDA unified memory. In: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing. 2021, 107–120

  20. Cho S, Hong J, Choi J, Han H. Multithreaded double queuing for balanced CPU-GPU memory copying. In: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. 2019, 1444–1450

  21. He B, Lu M, Yang K, Fang R, Govindaraju N K, Luo Q, Sander P V. Relational query coprocessing on graphics processors. ACM Transactions on Database Systems, 2009, 34(4): 21

    Article  Google Scholar 

  22. Wang K, Zhang K, Yuan Y, Ma S, Lee R, Ding X, Zhang X. Concurrent analytical query processing with GPUs. Proceedings of the VLDB Endowment, 2014, 7(11): 1011–1022

    Article  Google Scholar 

  23. Paul J, He J, He B. GPL: A GPU-based pipelined query processing engine. In: Proceedings of the 2016 International Conference on Management of Data. 2016, 1935–1950

  24. Breß S. The design and implementation of CoGaDB: a column-oriented GPU-accelerated DBMS. Datenbank-Spektrum, 2014, 14(3): 199–209

    Article  Google Scholar 

  25. Breß S, Saake G. Why it is time for a HyPE: a hybrid query processing engine for efficient GPU coprocessing in DBMS. Proceedings of the VLDB Endowment, 2013, 6(12): 1398–1403

    Article  Google Scholar 

  26. Breß S, Köcher B, Heimel M, Markl V, Saecker M, Saake G. Ocelot/HyPE: optimized data processing on heterogeneous hardware. Proceedings of the VLDB Endowment, 2014, 7(13): 1609–1612

    Article  Google Scholar 

  27. Guo C, Chen H, Zhang F, Li C. Distributed join algorithms on multi-CPU clusters with GPUDirect RDMA. In: Proceedings of the 48th International Conference on Parallel Processing. 2019, 65

  28. Rui R, Li H, Tu Y C. Efficient join algorithms for large database tables in a multi-GPU environment. Proceedings of the VLDB Endowment, 2020, 14(4): 708–720

    Article  Google Scholar 

  29. Hou N, He F, Zhou Y, Chen Y. An efficient GPU-based parallel tabu search algorithm for hardware/software co-design. Frontiers of Computer Science, 2020, 14(5): 145316

    Article  Google Scholar 

  30. Chen Y, He F, Li H, Zhang D, Wu Y. A full migration BBO algorithm with enhanced population quality bounds for multimodal biomedical image registration. Applied Soft Computing, 2020, 93: 106335

    Article  Google Scholar 

  31. Liang Y, He F, Zeng X, Luo J. An improved loop subdivision to coordinate the smoothness and the number of faces via multi-objective optimization. Integrated Computer-Aided Engineering, 2022, 29(1): 23–41

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61732014 and 62141214) and the National Key Research and Development Program of China (2018YFB1003400).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chuliang Weng.

Additional information

Jialun Wang is a PhD candidate in the School of Data Science and Engineering, East China Normal University, China. He received his bachelor’s degree in computer science from Sichuan University, China. His research interests include parallel and distributed systems, CPU-GPU heterogeneous systems, and in-memory data processing.

Wenhao Pang is currently working toward the master’s degree in the School of Data Science and Engineering, East China Normal University, China. He received his bachelor’s degree from Donghua University, China. His research interests include parallel and distributed systems, and heterogeneous computing.

Chuliang Weng is a full professor on computer science at East China Normal University (ECNU), China. Before joining ECNU, he worked at Huawei Central Research Institute as a principal researcher and technical director. Before joining Huawei, he was an associate professor in the Department of Computer Science and Engineering at Shanghai Jiao Tong University, China. He was also a visiting research scientist in the Department of Computer Science at Columbia University, USA. His interests center on building fast and secure systems. His research includes parallel and distributed systems, system virtualization and cloud, storage systems, operating systems, and system security.

Aoying Zhou, Professor at School of Data Science and Engineering (DaSE), East China Normal University (ECNU), China. He is CCF Fellow, Vice President of Shanghai Computer Society, and Associate Editor-in-Chief of Chinese Journal of Computer. His research interests include database, data management, digital transformation, and data-driven applications such as Financial Technology (FinTech), Education Technology (EduTech), and Logistics Technology (LogTech).

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, J., Pang, W., Weng, C. et al. D-Cubicle: boosting data transfer dynamically for large-scale analytical queries in single-GPU systems. Front. Comput. Sci. 17, 174610 (2023). https://doi.org/10.1007/s11704-022-2160-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-022-2160-z

Keywords

Navigation