Skip to main content
Log in

Accelerating multi-way joins on the GPU

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Graphic processing units (GPUs) have been employed as hardware accelerators for online analytics. However, multi-way joins, which are common in analytic workloads, are inefficient on GPUs. Therefore, we propose to accelerate two representative multi-way join algorithms on the GPU: a multi-way hash join (MHJ) and the worst-case optimal Leapfrog Triejoin (LFTJ). Specifically, we design a warp-based parallelization strategy to reduce thread divergence and to facilitate coalesced memory access in parallel searches in a table. We further enhance our implementations with a set of GPU-friendly optimizations, including dynamic workload sharing among threads and elimination of the result counting phase. Additionally, we enable out-of-core multi-way joins with software pipelining. Our experiments show that our optimized MHJ and LFTJ outperform the state-of-the-art GPU algorithms by a factor of up to 67 on an NVIDIA V100 GPU.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30

Similar content being viewed by others

Notes

  1. To be consistent with AMHJ, we use join order in ALFTJ to refer to the attribute order.

  2. The Profiler cannot profile the data prefetch from CPU to GPU due to a bug of the Nvidia driver along with CUDA 10.2. Therefore, we invoke a dummy kernel in Stream 16 right before the prefetch operation to identify its start position in the timeline.

References

  1. Aberger, C.R., Lamb, A., Tu, S., Nötzli, A., Olukotun, K., Ré, C.: Emptyheaded: a relational engine for graph processing. ACM Trans. Database Syst. 42(4), 20:1-20:44 (2017). https://doi.org/10.1145/3129246

    Article  MathSciNet  Google Scholar 

  2. Aghajarian, D., Puri, S., Prasad, S.K.: GCMF: an efficient end-to-end spatial join system over large polygonal datasets on GPGPU platform. In: Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 18:1–18:10 (2016). https://doi.org/10.1145/2996913.2996982

  3. Alcantara, D.A., Sharf, A., Abbasinejad, F., Sengupta, S., Mitzenmacher, M., Owens, J.D., Amenta, N.: Real-time parallel hashing on the GPU. ACM Trans. Graph. 28(5), 154 (2009). https://doi.org/10.1145/1618452.1618500

    Article  Google Scholar 

  4. Alcantara, D.A., Volkov, V., Sengupta, S., Mitzenmacher, M., Owens, J.D., Amenta, N.: Building an efficient hash table on the GPU. In: GPU Computing Gems Jade Edition, pp. 39–53 (2012)

  5. Appleby, A.: Murmurhash. http://code.google.com/p/smhasher/

  6. Atserias, A., Grohe, M., Marx, D.: Size bounds and query plans for relational joins. In: 49th Annual IEEE Symposium on Foundations of Computer Science, pp. 739–748 (2008). https://doi.org/10.1109/FOCS.2008.43

  7. Balkesen, C., Alonso, G., Teubner, J., Özsu, M.T.: Multi-core, main-memory joins: sort vs. hash revisited. PVLDB 7(1), 85–96 (2013)

    Google Scholar 

  8. Balkesen, C., Teubner, J., Alonso, G., Özsu, M.T.: Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In: ICDE, pp. 362–373 (2013). https://doi.org/10.1109/ICDE.2013.6544839

  9. Barber, R., Lohman, G.M., Pandis, I., Raman, V., Sidle, R., Attaluri, G.K., Chainani, N., Lightstone, S., Sharpe, D.: Memory-efficient hash joins. PVLDB 8(4), 353–364 (2014)

    Google Scholar 

  10. Barthels, C., Alonso, G., Hoefler, T., Schneider, T., Müller, I.: Distributed join algorithms on thousands of cores. PVLDB 10(5), 517–528 (2017)

    Google Scholar 

  11. Bentley, J.L., Yao, A.C.: An almost optimal algorithm for unbounded searching. Inf. Process. Lett. 5(3), 82–87 (1976). https://doi.org/10.1016/0020-0190(76)90071-5

    Article  MathSciNet  MATH  Google Scholar 

  12. Blanas, S., Li, Y., Patel, J.M.: Design and evaluation of main memory hash join algorithms for multi-core CPUs. In: SIGMOD, pp. 37–48 (2011). https://doi.org/10.1145/1989323.1989328

  13. Böhm, C., Noll, R., Plant, C., Zherdin, A.: Index-supported similarity join on graphics processors. DBIS (2009)

  14. Boncz, P.A., Manegold, S., Kersten, M.L.: Database architecture evolution: mammals flourished long before dinosaurs became extinct. PVLDB 2(2), 1648–1653 (2009)

    Google Scholar 

  15. Boncz, P.A., Zukowski, M., Nes, N.: MonetDB/X100: Hyper-pipelining query execution. In: CIDR, pp. 225–237 (2005). http://cidrdb.org/cidr2005/papers/P19.pdf

  16. Breß, S., Funke, H., Teubner, J.: Robust query processing in co-processor-accelerated databases. In: SIGMOD, pp. 1891–1906 (2016). https://doi.org/10.1145/2882903.2882936

  17. Büttcher, S., Clarke, C.L.A., Cormack, G.V.: Information Retrieval - Implementing and Evaluating Search Engines (2010). http://mitpress.mit.edu/books/information-retrieval

  18. Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: A recursive model for graph mining. In: Proceedings of the 4th SIAM International Conference on Data Mining, pp. 442–446 (2004). https://doi.org/10.1137/1.9781611972740.43

  19. Chen, S., Ailamaki, A., Gibbons, P.B., Mowry, T.C.: Improving hash join performance through prefetching. ACM Trans. Database Syst. 32(3), 17 (2007). https://doi.org/10.1145/1272743.1272747

    Article  Google Scholar 

  20. Chu, S., Balazinska, M., Suciu, D.: From theory to practice: Efficient join query evaluation in a parallel database system. In: SIGMOD, pp. 63–78 (2015). https://doi.org/10.1145/2723372.2750545

  21. Cormode, G., Hadjieleftheriou, M.: Methods for finding frequent items in data streams. VLDB J. 19(1), 3–20 (2010). https://doi.org/10.1007/s00778-009-0172-z

    Article  Google Scholar 

  22. Council, T.: TPC benchmark H specification. http://www.tpc.org/tpch/

  23. Funke, H., Breß, S., Noll, S., Markl, V., Teubner, J.: Pipelined query processing in coprocessor environments. In: SIGMOD, pp. 1603–1618 (2018). https://doi.org/10.1145/3183713.3183734

  24. Gallet, B., Gowanlock, M.: Load imbalance mitigation optimizations for GPU-accelerated similarity joins. In: IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019, pp. 396–405 (2019). https://doi.org/10.1109/IPDPSW.2019.00078

  25. Golab, L., Özsu, M.T.: Processing sliding window multi-joins in continuous queries over data streams. In: PVLDB, pp. 500–511 (2003). http://www.vldb.org/conf/2003/papers/S16P01.pdf

  26. Gowanlock, M., Karsin, B.: Accelerating the similarity self-join using the GPU. J. Parallel Distrib. Comput. 133, 107–123 (2019). https://doi.org/10.1016/j.jpdc.2019.06.005

    Article  Google Scholar 

  27. He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N.K., Luo, Q., Sander, P.V.: Relational joins on graphics processors. In: SIGMOD, pp. 511–524 (2008). https://doi.org/10.1145/1376616.1376670

  28. He, J., Lu, M., He, B.: Revisiting co-processing for hash joins on the coupled CPU-GPU architecture. PVLDB 6(10), 889–900 (2013)

    Google Scholar 

  29. He, J., Zhang, S., He, B.: In-cache query co-processing on coupled CPU-GPU architectures. PVLDB 8(4), 329–340 (2014)

    Google Scholar 

  30. Heimel, M., Saecker, M., Pirk, H., Manegold, S., Markl, V.: Hardware-oblivious parallelism for in-memory column-stores. PVLDB 6(9), 709–720 (2013)

    Google Scholar 

  31. Herlihy, M., Shavit, N.: The art of multiprocessor programming. ACM SIGSOFT Softw. Eng. Not. 36(5), 52–53 (2011). https://doi.org/10.1145/2020976.2021006

    Article  Google Scholar 

  32. Idreos, S., Groffen, F., Nes, N., Manegold, S., Mullender, K.S., Kersten, M.L., Monet, D.B.: Two decades of research in column-oriented database architectures. IEEE Data. Eng. Bull. 35(1), 40–45 (2012)

    Google Scholar 

  33. Jenkins, J., Arkatkar, I., Owens, J.D., Choudhary, A.N., Samatova, N.F.: Lessons learned from exploring the backtracking paradigm on the GPU. In: Euro-Par 2011 Parallel Processing—17th International Conference, vol. 6853, pp. 425–437 (2011). https://doi.org/10.1007/978-3-642-23397-5_42

  34. Kaldewey, T., Lohman, G.M., Müller, R., Volk, P.B.: GPU join processing revisited. In: Proceedings of the Eighth International Workshop on Data Management on New Hardware, DaMoN 2012, pp. 55–62 (2012). https://doi.org/10.1145/2236584.2236592

  35. Kalinsky, O., Etsion, Y., Kimelfeld, B.: Flexible caching in trie joins. In: Proceedings of the 20th International Conference on Extending Database Technology, EDBT 2017, pp. 282–293 (2017). https://doi.org/10.5441/002/edbt.2017.26

  36. Kemper, A., Kossmann, D., Wiesner, C.: Generalised hash teams for join and group-by. In: PVLDB, pp. 30–41 (1999). http://www.vldb.org/conf/1999/P3.pdf

  37. Kersten, T., Leis, V., Kemper, A., Neumann, T., Pavlo, A., Boncz, P.A.: Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. PVLDB 11(13), 2209–2222 (2018)

    Google Scholar 

  38. Kim, C., Sedlar, E., Chhugani, J., Kaldewey, T., Nguyen, A.D., Blas, A.D., Lee, V.W., Satish, N., Dubey, P.: Sort vs. hash revisited: fast join implementation on modern multi-core CPUs. PVLDB 2(2), 1378–1389 (2009)

    Google Scholar 

  39. Lang, H., Leis, V., Albutiu, M., Neumann, T., Kemper, A.: Massively parallel NUMA-aware hash joins. In: Proceedings of the 1st International Workshop on In Memory Data Management and Analytics, IMDM 2013, pp. 1–12 (2013). http://www-db.in.tum.de/other/imdm2013/papers/Lang.pdf

  40. Lin, X., Zhang, R., Wen, Z., Wang, H., Qi, J.: Efficient subgraph matching using GPUs. In: Databases Theory and Applications—25th Australasian Database Conference, ADC 2014, vol. 8506, pp. 74–85 (2014). https://doi.org/10.1007/978-3-319-08608-8_7

  41. Manegold, S., Boncz, P.A., Kersten, M.L.: Optimizing main-memory join on modern hardware. IEEE Trans. Knowl. Data Eng. 14(4), 709–730 (2002). https://doi.org/10.1109/TKDE.2002.1019210

    Article  Google Scholar 

  42. Neumann, T.: Efficiently compiling efficient query plans for modern hardware. PVLDB 4(9), 539–550 (2011)

    Google Scholar 

  43. Ngo, H.Q., Porat, E., Ré, C., Rudra, A.: Worst-case optimal join algorithms. In: PODS, pp. 37–48 (2012). https://doi.org/10.1145/2213556.2213565

  44. Ngo, H.Q., Ré, C., Rudra, A.: Skew strikes back: new developments in the theory of join algorithms. SIGMOD 42(4), 5–16 (2013). https://doi.org/10.1145/2590989.2590991

    Article  Google Scholar 

  45. Nvidia: CUDA toolkit documentation. https://docs.nvidia.com/cuda/

  46. Paul, J., He, J., He, B.: GPL: A GPU-based pipelined query processing engine. In: SIGMOD, pp. 1935–1950 (2016). https://doi.org/10.1145/2882903.2915224

  47. Pirk, H., Manegold, S., Kersten, M.L.: Accelerating foreign-key joins using asymmetric memory channels. In: International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures, ADMS 2011, pp. 27–35 (2011). http://www.adms-conf.org/p27-PIRK.pdf

  48. Rui, R., Tu, Y.: Fast equi-join algorithms on GPUs: Design and implementation. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, pp. 17:1–17:12 (2017). https://doi.org/10.1145/3085504.3085521

  49. Schneider, D.A., DeWitt, D.J.: Tradeoffs in processing complex join queries via hashing in multiprocessor database machines. In: PVLDB, pp. 469–480 (1990). http://www.vldb.org/conf/1990/P469.PDF

  50. Schuh, S., Chen, X., Dittrich, J.: An experimental comparison of thirteen relational equi-joins in main memory. In: SIGMOD, pp. 1961–1976 (2016). https://doi.org/10.1145/2882903.2882917

  51. Sioulas, P., Chrysogelos, P., Karpathiotakis, M., Appuswamy, R., Ailamaki, A.: Hardware-conscious hash-joins on GPUs. In: ICDE, pp. 698–709 (2019). https://doi.org/10.1109/ICDE.2019.00068

  52. Veldhuizen, T.L.: Triejoin: A simple, worst-case optimal join algorithm. In: Proc. 17th International Conference on Database Theory (ICDT), pp. 96–106 (2014). https://doi.org/10.5441/002/icdt.2014.13

  53. Viglas, S., Naughton, J.F., Burger, J.: Maximizing the output rate of multi-way join queries over streaming information sources. In: PVLDB, pp. 285–296 (2003). http://www.vldb.org/conf/2003/papers/S10P01.pdf

  54. Wang, J., Yalamanchili, S.: Characterization and analysis of dynamic parallelism in unstructured GPU applications. In: 2014 IEEE International Symposium on Workload Characterization, IISWC 2014, pp. 51–60 (2014). https://doi.org/10.1109/IISWC.2014.6983039

  55. Wang, L., Wang, Y., Owens, J.D.: Fast parallel subgraph matching on the GPU. In: HPDC (2016)

  56. Wu, H., Zinn, D., Aref, M., Yalamanchili, S.: Multipredicate join algorithms for accelerating relational graph processing on GPUs. In: International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures, ADMS 2014, pp. 1–12 (2014). http://www.adms-conf.org/2014/adms14_wu.pdf

  57. Wu, S., Li, F., Mehrotra, S., Ooi, B.C.: Query optimization for massively parallel data processing. In: ACM Symposium on Cloud Computing in conjunction with SOSP 2011, p. 12 (2011). https://doi.org/10.1145/2038916.2038928

  58. Yabuta, M., Nguyen, A., Kato, S., Edahiro, M., Kawashima, H.: Relational joins on GPUs: a closer look. IEEE Trans. Parallel Distrib. Syst. 28(9), 2663–2673 (2017). https://doi.org/10.1109/TPDS.2017.2677451

    Article  Google Scholar 

  59. Yuan, Y., Lee, R., Zhang, X.: The yin and yang of processing data warehousing queries on GPU devices. PVLDB 6(10), 817–828 (2013)

    Google Scholar 

  60. Zinn, D., Wu, H., Wang, J., Aref, M., Yalamanchili, S.: General-purpose join algorithms for large graph triangle listing on heterogeneous systems. In: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, GPGPU@PPoPP 2016, pp. 12–21 (2016). https://doi.org/10.1145/2884045.2884054

Download references

Acknowledgements

We thank the reviewers for their insightful suggestions. This work was supported by Alibaba Group through the Alibaba Innovative Research (AIR) Program and Grant 16209821 from the Hong Kong Research Grants Council.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhuohang Lai.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lai, Z., Sun, X., Luo, Q. et al. Accelerating multi-way joins on the GPU. The VLDB Journal 31, 529–553 (2022). https://doi.org/10.1007/s00778-021-00708-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-021-00708-y

Keywords

Navigation