Accelerating multi-way joins on the GPU

Lai, Zhuohang; Sun, Xibo; Luo, Qiong; Xie, Xiaolong

doi:10.1007/s00778-021-00708-y

Accelerating multi-way joins on the GPU

Regular Paper
Published: 02 November 2021

Volume 31, pages 529–553, (2022)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Zhuohang Lai ORCID: orcid.org/0000-0002-6549-2427¹,
Xibo Sun¹,
Qiong Luo¹ &
…
Xiaolong Xie²

1073 Accesses
5 Citations
Explore all metrics

Abstract

Graphic processing units (GPUs) have been employed as hardware accelerators for online analytics. However, multi-way joins, which are common in analytic workloads, are inefficient on GPUs. Therefore, we propose to accelerate two representative multi-way join algorithms on the GPU: a multi-way hash join (MHJ) and the worst-case optimal Leapfrog Triejoin (LFTJ). Specifically, we design a warp-based parallelization strategy to reduce thread divergence and to facilitate coalesced memory access in parallel searches in a table. We further enhance our implementations with a set of GPU-friendly optimizations, including dynamic workload sharing among threads and elimination of the result counting phase. Additionally, we enable out-of-core multi-way joins with software pipelining. Our experiments show that our optimized MHJ and LFTJ outperform the state-of-the-art GPU algorithms by a factor of up to 67 on an NVIDIA V100 GPU.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 20

Revisiting hash join on graphics processors: a decade later

Article 08 January 2020

Massively Parallel NUMA-Aware Hash Joins

Many-query join: efficient shared execution of relational joins on modern hardware

Article 30 August 2017

Notes

To be consistent with AMHJ, we use join order in ALFTJ to refer to the attribute order.
The Profiler cannot profile the data prefetch from CPU to GPU due to a bug of the Nvidia driver along with CUDA 10.2. Therefore, we invoke a dummy kernel in Stream 16 right before the prefetch operation to identify its start position in the timeline.

References

Aberger, C.R., Lamb, A., Tu, S., Nötzli, A., Olukotun, K., Ré, C.: Emptyheaded: a relational engine for graph processing. ACM Trans. Database Syst. 42(4), 20:1-20:44 (2017). https://doi.org/10.1145/3129246
Article MathSciNet Google Scholar
Aghajarian, D., Puri, S., Prasad, S.K.: GCMF: an efficient end-to-end spatial join system over large polygonal datasets on GPGPU platform. In: Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 18:1–18:10 (2016). https://doi.org/10.1145/2996913.2996982
Alcantara, D.A., Sharf, A., Abbasinejad, F., Sengupta, S., Mitzenmacher, M., Owens, J.D., Amenta, N.: Real-time parallel hashing on the GPU. ACM Trans. Graph. 28(5), 154 (2009). https://doi.org/10.1145/1618452.1618500
Article Google Scholar
Alcantara, D.A., Volkov, V., Sengupta, S., Mitzenmacher, M., Owens, J.D., Amenta, N.: Building an efficient hash table on the GPU. In: GPU Computing Gems Jade Edition, pp. 39–53 (2012)
Appleby, A.: Murmurhash. http://code.google.com/p/smhasher/
Atserias, A., Grohe, M., Marx, D.: Size bounds and query plans for relational joins. In: 49th Annual IEEE Symposium on Foundations of Computer Science, pp. 739–748 (2008). https://doi.org/10.1109/FOCS.2008.43
Balkesen, C., Alonso, G., Teubner, J., Özsu, M.T.: Multi-core, main-memory joins: sort vs. hash revisited. PVLDB 7(1), 85–96 (2013)
Google Scholar
Balkesen, C., Teubner, J., Alonso, G., Özsu, M.T.: Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In: ICDE, pp. 362–373 (2013). https://doi.org/10.1109/ICDE.2013.6544839
Barber, R., Lohman, G.M., Pandis, I., Raman, V., Sidle, R., Attaluri, G.K., Chainani, N., Lightstone, S., Sharpe, D.: Memory-efficient hash joins. PVLDB 8(4), 353–364 (2014)
Google Scholar
Barthels, C., Alonso, G., Hoefler, T., Schneider, T., Müller, I.: Distributed join algorithms on thousands of cores. PVLDB 10(5), 517–528 (2017)
Google Scholar
Bentley, J.L., Yao, A.C.: An almost optimal algorithm for unbounded searching. Inf. Process. Lett. 5(3), 82–87 (1976). https://doi.org/10.1016/0020-0190(76)90071-5
Article MathSciNet MATH Google Scholar
Blanas, S., Li, Y., Patel, J.M.: Design and evaluation of main memory hash join algorithms for multi-core CPUs. In: SIGMOD, pp. 37–48 (2011). https://doi.org/10.1145/1989323.1989328
Böhm, C., Noll, R., Plant, C., Zherdin, A.: Index-supported similarity join on graphics processors. DBIS (2009)
Boncz, P.A., Manegold, S., Kersten, M.L.: Database architecture evolution: mammals flourished long before dinosaurs became extinct. PVLDB 2(2), 1648–1653 (2009)
Google Scholar
Boncz, P.A., Zukowski, M., Nes, N.: MonetDB/X100: Hyper-pipelining query execution. In: CIDR, pp. 225–237 (2005). http://cidrdb.org/cidr2005/papers/P19.pdf
Breß, S., Funke, H., Teubner, J.: Robust query processing in co-processor-accelerated databases. In: SIGMOD, pp. 1891–1906 (2016). https://doi.org/10.1145/2882903.2882936
Büttcher, S., Clarke, C.L.A., Cormack, G.V.: Information Retrieval - Implementing and Evaluating Search Engines (2010). http://mitpress.mit.edu/books/information-retrieval
Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: A recursive model for graph mining. In: Proceedings of the 4th SIAM International Conference on Data Mining, pp. 442–446 (2004). https://doi.org/10.1137/1.9781611972740.43
Chen, S., Ailamaki, A., Gibbons, P.B., Mowry, T.C.: Improving hash join performance through prefetching. ACM Trans. Database Syst. 32(3), 17 (2007). https://doi.org/10.1145/1272743.1272747
Article Google Scholar
Chu, S., Balazinska, M., Suciu, D.: From theory to practice: Efficient join query evaluation in a parallel database system. In: SIGMOD, pp. 63–78 (2015). https://doi.org/10.1145/2723372.2750545
Cormode, G., Hadjieleftheriou, M.: Methods for finding frequent items in data streams. VLDB J. 19(1), 3–20 (2010). https://doi.org/10.1007/s00778-009-0172-z
Article Google Scholar
Council, T.: TPC benchmark H specification. http://www.tpc.org/tpch/
Funke, H., Breß, S., Noll, S., Markl, V., Teubner, J.: Pipelined query processing in coprocessor environments. In: SIGMOD, pp. 1603–1618 (2018). https://doi.org/10.1145/3183713.3183734
Gallet, B., Gowanlock, M.: Load imbalance mitigation optimizations for GPU-accelerated similarity joins. In: IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019, pp. 396–405 (2019). https://doi.org/10.1109/IPDPSW.2019.00078
Golab, L., Özsu, M.T.: Processing sliding window multi-joins in continuous queries over data streams. In: PVLDB, pp. 500–511 (2003). http://www.vldb.org/conf/2003/papers/S16P01.pdf
Gowanlock, M., Karsin, B.: Accelerating the similarity self-join using the GPU. J. Parallel Distrib. Comput. 133, 107–123 (2019). https://doi.org/10.1016/j.jpdc.2019.06.005
Article Google Scholar
He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N.K., Luo, Q., Sander, P.V.: Relational joins on graphics processors. In: SIGMOD, pp. 511–524 (2008). https://doi.org/10.1145/1376616.1376670
He, J., Lu, M., He, B.: Revisiting co-processing for hash joins on the coupled CPU-GPU architecture. PVLDB 6(10), 889–900 (2013)
Google Scholar
He, J., Zhang, S., He, B.: In-cache query co-processing on coupled CPU-GPU architectures. PVLDB 8(4), 329–340 (2014)
Google Scholar
Heimel, M., Saecker, M., Pirk, H., Manegold, S., Markl, V.: Hardware-oblivious parallelism for in-memory column-stores. PVLDB 6(9), 709–720 (2013)
Google Scholar
Herlihy, M., Shavit, N.: The art of multiprocessor programming. ACM SIGSOFT Softw. Eng. Not. 36(5), 52–53 (2011). https://doi.org/10.1145/2020976.2021006
Article Google Scholar
Idreos, S., Groffen, F., Nes, N., Manegold, S., Mullender, K.S., Kersten, M.L., Monet, D.B.: Two decades of research in column-oriented database architectures. IEEE Data. Eng. Bull. 35(1), 40–45 (2012)
Google Scholar
Jenkins, J., Arkatkar, I., Owens, J.D., Choudhary, A.N., Samatova, N.F.: Lessons learned from exploring the backtracking paradigm on the GPU. In: Euro-Par 2011 Parallel Processing—17th International Conference, vol. 6853, pp. 425–437 (2011). https://doi.org/10.1007/978-3-642-23397-5_42
Kaldewey, T., Lohman, G.M., Müller, R., Volk, P.B.: GPU join processing revisited. In: Proceedings of the Eighth International Workshop on Data Management on New Hardware, DaMoN 2012, pp. 55–62 (2012). https://doi.org/10.1145/2236584.2236592
Kalinsky, O., Etsion, Y., Kimelfeld, B.: Flexible caching in trie joins. In: Proceedings of the 20th International Conference on Extending Database Technology, EDBT 2017, pp. 282–293 (2017). https://doi.org/10.5441/002/edbt.2017.26
Kemper, A., Kossmann, D., Wiesner, C.: Generalised hash teams for join and group-by. In: PVLDB, pp. 30–41 (1999). http://www.vldb.org/conf/1999/P3.pdf
Kersten, T., Leis, V., Kemper, A., Neumann, T., Pavlo, A., Boncz, P.A.: Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. PVLDB 11(13), 2209–2222 (2018)
Google Scholar
Kim, C., Sedlar, E., Chhugani, J., Kaldewey, T., Nguyen, A.D., Blas, A.D., Lee, V.W., Satish, N., Dubey, P.: Sort vs. hash revisited: fast join implementation on modern multi-core CPUs. PVLDB 2(2), 1378–1389 (2009)
Google Scholar
Lang, H., Leis, V., Albutiu, M., Neumann, T., Kemper, A.: Massively parallel NUMA-aware hash joins. In: Proceedings of the 1st International Workshop on In Memory Data Management and Analytics, IMDM 2013, pp. 1–12 (2013). http://www-db.in.tum.de/other/imdm2013/papers/Lang.pdf
Lin, X., Zhang, R., Wen, Z., Wang, H., Qi, J.: Efficient subgraph matching using GPUs. In: Databases Theory and Applications—25th Australasian Database Conference, ADC 2014, vol. 8506, pp. 74–85 (2014). https://doi.org/10.1007/978-3-319-08608-8_7
Manegold, S., Boncz, P.A., Kersten, M.L.: Optimizing main-memory join on modern hardware. IEEE Trans. Knowl. Data Eng. 14(4), 709–730 (2002). https://doi.org/10.1109/TKDE.2002.1019210
Article Google Scholar
Neumann, T.: Efficiently compiling efficient query plans for modern hardware. PVLDB 4(9), 539–550 (2011)
Google Scholar
Ngo, H.Q., Porat, E., Ré, C., Rudra, A.: Worst-case optimal join algorithms. In: PODS, pp. 37–48 (2012). https://doi.org/10.1145/2213556.2213565
Ngo, H.Q., Ré, C., Rudra, A.: Skew strikes back: new developments in the theory of join algorithms. SIGMOD 42(4), 5–16 (2013). https://doi.org/10.1145/2590989.2590991
Article Google Scholar
Nvidia: CUDA toolkit documentation. https://docs.nvidia.com/cuda/
Paul, J., He, J., He, B.: GPL: A GPU-based pipelined query processing engine. In: SIGMOD, pp. 1935–1950 (2016). https://doi.org/10.1145/2882903.2915224
Pirk, H., Manegold, S., Kersten, M.L.: Accelerating foreign-key joins using asymmetric memory channels. In: International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures, ADMS 2011, pp. 27–35 (2011). http://www.adms-conf.org/p27-PIRK.pdf
Rui, R., Tu, Y.: Fast equi-join algorithms on GPUs: Design and implementation. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, pp. 17:1–17:12 (2017). https://doi.org/10.1145/3085504.3085521
Schneider, D.A., DeWitt, D.J.: Tradeoffs in processing complex join queries via hashing in multiprocessor database machines. In: PVLDB, pp. 469–480 (1990). http://www.vldb.org/conf/1990/P469.PDF
Schuh, S., Chen, X., Dittrich, J.: An experimental comparison of thirteen relational equi-joins in main memory. In: SIGMOD, pp. 1961–1976 (2016). https://doi.org/10.1145/2882903.2882917
Sioulas, P., Chrysogelos, P., Karpathiotakis, M., Appuswamy, R., Ailamaki, A.: Hardware-conscious hash-joins on GPUs. In: ICDE, pp. 698–709 (2019). https://doi.org/10.1109/ICDE.2019.00068
Veldhuizen, T.L.: Triejoin: A simple, worst-case optimal join algorithm. In: Proc. 17th International Conference on Database Theory (ICDT), pp. 96–106 (2014). https://doi.org/10.5441/002/icdt.2014.13
Viglas, S., Naughton, J.F., Burger, J.: Maximizing the output rate of multi-way join queries over streaming information sources. In: PVLDB, pp. 285–296 (2003). http://www.vldb.org/conf/2003/papers/S10P01.pdf
Wang, J., Yalamanchili, S.: Characterization and analysis of dynamic parallelism in unstructured GPU applications. In: 2014 IEEE International Symposium on Workload Characterization, IISWC 2014, pp. 51–60 (2014). https://doi.org/10.1109/IISWC.2014.6983039
Wang, L., Wang, Y., Owens, J.D.: Fast parallel subgraph matching on the GPU. In: HPDC (2016)
Wu, H., Zinn, D., Aref, M., Yalamanchili, S.: Multipredicate join algorithms for accelerating relational graph processing on GPUs. In: International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures, ADMS 2014, pp. 1–12 (2014). http://www.adms-conf.org/2014/adms14_wu.pdf
Wu, S., Li, F., Mehrotra, S., Ooi, B.C.: Query optimization for massively parallel data processing. In: ACM Symposium on Cloud Computing in conjunction with SOSP 2011, p. 12 (2011). https://doi.org/10.1145/2038916.2038928
Yabuta, M., Nguyen, A., Kato, S., Edahiro, M., Kawashima, H.: Relational joins on GPUs: a closer look. IEEE Trans. Parallel Distrib. Syst. 28(9), 2663–2673 (2017). https://doi.org/10.1109/TPDS.2017.2677451
Article Google Scholar
Yuan, Y., Lee, R., Zhang, X.: The yin and yang of processing data warehousing queries on GPU devices. PVLDB 6(10), 817–828 (2013)
Google Scholar
Zinn, D., Wu, H., Wang, J., Aref, M., Yalamanchili, S.: General-purpose join algorithms for large graph triangle listing on heterogeneous systems. In: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, GPGPU@PPoPP 2016, pp. 12–21 (2016). https://doi.org/10.1145/2884045.2884054

Download references

Acknowledgements

We thank the reviewers for their insightful suggestions. This work was supported by Alibaba Group through the Alibaba Innovative Research (AIR) Program and Grant 16209821 from the Hong Kong Research Grants Council.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Kowloon, Hong Kong SAR, China
Zhuohang Lai, Xibo Sun & Qiong Luo
Alibaba Inc., Hangzhou, China
Xiaolong Xie

Authors

Zhuohang Lai
View author publications
You can also search for this author in PubMed Google Scholar
Xibo Sun
View author publications
You can also search for this author in PubMed Google Scholar
Qiong Luo
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolong Xie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhuohang Lai.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lai, Z., Sun, X., Luo, Q. et al. Accelerating multi-way joins on the GPU. The VLDB Journal 31, 529–553 (2022). https://doi.org/10.1007/s00778-021-00708-y

Download citation

Received: 16 August 2020
Revised: 19 September 2021
Accepted: 02 October 2021
Published: 02 November 2021
Issue Date: May 2022
DOI: https://doi.org/10.1007/s00778-021-00708-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerating multi-way joins on the GPU

Abstract

Access this article

Similar content being viewed by others

Revisiting hash join on graphics processors: a decade later

Massively Parallel NUMA-Aware Hash Joins

Many-query join: efficient shared execution of relational joins on modern hardware

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Accelerating multi-way joins on the GPU

Abstract

Access this article

Similar content being viewed by others

Revisiting hash join on graphics processors: a decade later

Massively Parallel NUMA-Aware Hash Joins

Many-query join: efficient shared execution of relational joins on modern hardware

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation