Abstract
Graphic processing units (GPUs) have been employed as hardware accelerators for online analytics. However, multi-way joins, which are common in analytic workloads, are inefficient on GPUs. Therefore, we propose to accelerate two representative multi-way join algorithms on the GPU: a multi-way hash join (MHJ) and the worst-case optimal Leapfrog Triejoin (LFTJ). Specifically, we design a warp-based parallelization strategy to reduce thread divergence and to facilitate coalesced memory access in parallel searches in a table. We further enhance our implementations with a set of GPU-friendly optimizations, including dynamic workload sharing among threads and elimination of the result counting phase. Additionally, we enable out-of-core multi-way joins with software pipelining. Our experiments show that our optimized MHJ and LFTJ outperform the state-of-the-art GPU algorithms by a factor of up to 67 on an NVIDIA V100 GPU.
Similar content being viewed by others
Notes
To be consistent with AMHJ, we use join order in ALFTJ to refer to the attribute order.
The Profiler cannot profile the data prefetch from CPU to GPU due to a bug of the Nvidia driver along with CUDA 10.2. Therefore, we invoke a dummy kernel in Stream 16 right before the prefetch operation to identify its start position in the timeline.
References
Aberger, C.R., Lamb, A., Tu, S., Nötzli, A., Olukotun, K., Ré, C.: Emptyheaded: a relational engine for graph processing. ACM Trans. Database Syst. 42(4), 20:1-20:44 (2017). https://doi.org/10.1145/3129246
Aghajarian, D., Puri, S., Prasad, S.K.: GCMF: an efficient end-to-end spatial join system over large polygonal datasets on GPGPU platform. In: Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 18:1–18:10 (2016). https://doi.org/10.1145/2996913.2996982
Alcantara, D.A., Sharf, A., Abbasinejad, F., Sengupta, S., Mitzenmacher, M., Owens, J.D., Amenta, N.: Real-time parallel hashing on the GPU. ACM Trans. Graph. 28(5), 154 (2009). https://doi.org/10.1145/1618452.1618500
Alcantara, D.A., Volkov, V., Sengupta, S., Mitzenmacher, M., Owens, J.D., Amenta, N.: Building an efficient hash table on the GPU. In: GPU Computing Gems Jade Edition, pp. 39–53 (2012)
Appleby, A.: Murmurhash. http://code.google.com/p/smhasher/
Atserias, A., Grohe, M., Marx, D.: Size bounds and query plans for relational joins. In: 49th Annual IEEE Symposium on Foundations of Computer Science, pp. 739–748 (2008). https://doi.org/10.1109/FOCS.2008.43
Balkesen, C., Alonso, G., Teubner, J., Özsu, M.T.: Multi-core, main-memory joins: sort vs. hash revisited. PVLDB 7(1), 85–96 (2013)
Balkesen, C., Teubner, J., Alonso, G., Özsu, M.T.: Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In: ICDE, pp. 362–373 (2013). https://doi.org/10.1109/ICDE.2013.6544839
Barber, R., Lohman, G.M., Pandis, I., Raman, V., Sidle, R., Attaluri, G.K., Chainani, N., Lightstone, S., Sharpe, D.: Memory-efficient hash joins. PVLDB 8(4), 353–364 (2014)
Barthels, C., Alonso, G., Hoefler, T., Schneider, T., Müller, I.: Distributed join algorithms on thousands of cores. PVLDB 10(5), 517–528 (2017)
Bentley, J.L., Yao, A.C.: An almost optimal algorithm for unbounded searching. Inf. Process. Lett. 5(3), 82–87 (1976). https://doi.org/10.1016/0020-0190(76)90071-5
Blanas, S., Li, Y., Patel, J.M.: Design and evaluation of main memory hash join algorithms for multi-core CPUs. In: SIGMOD, pp. 37–48 (2011). https://doi.org/10.1145/1989323.1989328
Böhm, C., Noll, R., Plant, C., Zherdin, A.: Index-supported similarity join on graphics processors. DBIS (2009)
Boncz, P.A., Manegold, S., Kersten, M.L.: Database architecture evolution: mammals flourished long before dinosaurs became extinct. PVLDB 2(2), 1648–1653 (2009)
Boncz, P.A., Zukowski, M., Nes, N.: MonetDB/X100: Hyper-pipelining query execution. In: CIDR, pp. 225–237 (2005). http://cidrdb.org/cidr2005/papers/P19.pdf
Breß, S., Funke, H., Teubner, J.: Robust query processing in co-processor-accelerated databases. In: SIGMOD, pp. 1891–1906 (2016). https://doi.org/10.1145/2882903.2882936
Büttcher, S., Clarke, C.L.A., Cormack, G.V.: Information Retrieval - Implementing and Evaluating Search Engines (2010). http://mitpress.mit.edu/books/information-retrieval
Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: A recursive model for graph mining. In: Proceedings of the 4th SIAM International Conference on Data Mining, pp. 442–446 (2004). https://doi.org/10.1137/1.9781611972740.43
Chen, S., Ailamaki, A., Gibbons, P.B., Mowry, T.C.: Improving hash join performance through prefetching. ACM Trans. Database Syst. 32(3), 17 (2007). https://doi.org/10.1145/1272743.1272747
Chu, S., Balazinska, M., Suciu, D.: From theory to practice: Efficient join query evaluation in a parallel database system. In: SIGMOD, pp. 63–78 (2015). https://doi.org/10.1145/2723372.2750545
Cormode, G., Hadjieleftheriou, M.: Methods for finding frequent items in data streams. VLDB J. 19(1), 3–20 (2010). https://doi.org/10.1007/s00778-009-0172-z
Council, T.: TPC benchmark H specification. http://www.tpc.org/tpch/
Funke, H., Breß, S., Noll, S., Markl, V., Teubner, J.: Pipelined query processing in coprocessor environments. In: SIGMOD, pp. 1603–1618 (2018). https://doi.org/10.1145/3183713.3183734
Gallet, B., Gowanlock, M.: Load imbalance mitigation optimizations for GPU-accelerated similarity joins. In: IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019, pp. 396–405 (2019). https://doi.org/10.1109/IPDPSW.2019.00078
Golab, L., Özsu, M.T.: Processing sliding window multi-joins in continuous queries over data streams. In: PVLDB, pp. 500–511 (2003). http://www.vldb.org/conf/2003/papers/S16P01.pdf
Gowanlock, M., Karsin, B.: Accelerating the similarity self-join using the GPU. J. Parallel Distrib. Comput. 133, 107–123 (2019). https://doi.org/10.1016/j.jpdc.2019.06.005
He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N.K., Luo, Q., Sander, P.V.: Relational joins on graphics processors. In: SIGMOD, pp. 511–524 (2008). https://doi.org/10.1145/1376616.1376670
He, J., Lu, M., He, B.: Revisiting co-processing for hash joins on the coupled CPU-GPU architecture. PVLDB 6(10), 889–900 (2013)
He, J., Zhang, S., He, B.: In-cache query co-processing on coupled CPU-GPU architectures. PVLDB 8(4), 329–340 (2014)
Heimel, M., Saecker, M., Pirk, H., Manegold, S., Markl, V.: Hardware-oblivious parallelism for in-memory column-stores. PVLDB 6(9), 709–720 (2013)
Herlihy, M., Shavit, N.: The art of multiprocessor programming. ACM SIGSOFT Softw. Eng. Not. 36(5), 52–53 (2011). https://doi.org/10.1145/2020976.2021006
Idreos, S., Groffen, F., Nes, N., Manegold, S., Mullender, K.S., Kersten, M.L., Monet, D.B.: Two decades of research in column-oriented database architectures. IEEE Data. Eng. Bull. 35(1), 40–45 (2012)
Jenkins, J., Arkatkar, I., Owens, J.D., Choudhary, A.N., Samatova, N.F.: Lessons learned from exploring the backtracking paradigm on the GPU. In: Euro-Par 2011 Parallel Processing—17th International Conference, vol. 6853, pp. 425–437 (2011). https://doi.org/10.1007/978-3-642-23397-5_42
Kaldewey, T., Lohman, G.M., Müller, R., Volk, P.B.: GPU join processing revisited. In: Proceedings of the Eighth International Workshop on Data Management on New Hardware, DaMoN 2012, pp. 55–62 (2012). https://doi.org/10.1145/2236584.2236592
Kalinsky, O., Etsion, Y., Kimelfeld, B.: Flexible caching in trie joins. In: Proceedings of the 20th International Conference on Extending Database Technology, EDBT 2017, pp. 282–293 (2017). https://doi.org/10.5441/002/edbt.2017.26
Kemper, A., Kossmann, D., Wiesner, C.: Generalised hash teams for join and group-by. In: PVLDB, pp. 30–41 (1999). http://www.vldb.org/conf/1999/P3.pdf
Kersten, T., Leis, V., Kemper, A., Neumann, T., Pavlo, A., Boncz, P.A.: Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. PVLDB 11(13), 2209–2222 (2018)
Kim, C., Sedlar, E., Chhugani, J., Kaldewey, T., Nguyen, A.D., Blas, A.D., Lee, V.W., Satish, N., Dubey, P.: Sort vs. hash revisited: fast join implementation on modern multi-core CPUs. PVLDB 2(2), 1378–1389 (2009)
Lang, H., Leis, V., Albutiu, M., Neumann, T., Kemper, A.: Massively parallel NUMA-aware hash joins. In: Proceedings of the 1st International Workshop on In Memory Data Management and Analytics, IMDM 2013, pp. 1–12 (2013). http://www-db.in.tum.de/other/imdm2013/papers/Lang.pdf
Lin, X., Zhang, R., Wen, Z., Wang, H., Qi, J.: Efficient subgraph matching using GPUs. In: Databases Theory and Applications—25th Australasian Database Conference, ADC 2014, vol. 8506, pp. 74–85 (2014). https://doi.org/10.1007/978-3-319-08608-8_7
Manegold, S., Boncz, P.A., Kersten, M.L.: Optimizing main-memory join on modern hardware. IEEE Trans. Knowl. Data Eng. 14(4), 709–730 (2002). https://doi.org/10.1109/TKDE.2002.1019210
Neumann, T.: Efficiently compiling efficient query plans for modern hardware. PVLDB 4(9), 539–550 (2011)
Ngo, H.Q., Porat, E., Ré, C., Rudra, A.: Worst-case optimal join algorithms. In: PODS, pp. 37–48 (2012). https://doi.org/10.1145/2213556.2213565
Ngo, H.Q., Ré, C., Rudra, A.: Skew strikes back: new developments in the theory of join algorithms. SIGMOD 42(4), 5–16 (2013). https://doi.org/10.1145/2590989.2590991
Nvidia: CUDA toolkit documentation. https://docs.nvidia.com/cuda/
Paul, J., He, J., He, B.: GPL: A GPU-based pipelined query processing engine. In: SIGMOD, pp. 1935–1950 (2016). https://doi.org/10.1145/2882903.2915224
Pirk, H., Manegold, S., Kersten, M.L.: Accelerating foreign-key joins using asymmetric memory channels. In: International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures, ADMS 2011, pp. 27–35 (2011). http://www.adms-conf.org/p27-PIRK.pdf
Rui, R., Tu, Y.: Fast equi-join algorithms on GPUs: Design and implementation. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, pp. 17:1–17:12 (2017). https://doi.org/10.1145/3085504.3085521
Schneider, D.A., DeWitt, D.J.: Tradeoffs in processing complex join queries via hashing in multiprocessor database machines. In: PVLDB, pp. 469–480 (1990). http://www.vldb.org/conf/1990/P469.PDF
Schuh, S., Chen, X., Dittrich, J.: An experimental comparison of thirteen relational equi-joins in main memory. In: SIGMOD, pp. 1961–1976 (2016). https://doi.org/10.1145/2882903.2882917
Sioulas, P., Chrysogelos, P., Karpathiotakis, M., Appuswamy, R., Ailamaki, A.: Hardware-conscious hash-joins on GPUs. In: ICDE, pp. 698–709 (2019). https://doi.org/10.1109/ICDE.2019.00068
Veldhuizen, T.L.: Triejoin: A simple, worst-case optimal join algorithm. In: Proc. 17th International Conference on Database Theory (ICDT), pp. 96–106 (2014). https://doi.org/10.5441/002/icdt.2014.13
Viglas, S., Naughton, J.F., Burger, J.: Maximizing the output rate of multi-way join queries over streaming information sources. In: PVLDB, pp. 285–296 (2003). http://www.vldb.org/conf/2003/papers/S10P01.pdf
Wang, J., Yalamanchili, S.: Characterization and analysis of dynamic parallelism in unstructured GPU applications. In: 2014 IEEE International Symposium on Workload Characterization, IISWC 2014, pp. 51–60 (2014). https://doi.org/10.1109/IISWC.2014.6983039
Wang, L., Wang, Y., Owens, J.D.: Fast parallel subgraph matching on the GPU. In: HPDC (2016)
Wu, H., Zinn, D., Aref, M., Yalamanchili, S.: Multipredicate join algorithms for accelerating relational graph processing on GPUs. In: International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures, ADMS 2014, pp. 1–12 (2014). http://www.adms-conf.org/2014/adms14_wu.pdf
Wu, S., Li, F., Mehrotra, S., Ooi, B.C.: Query optimization for massively parallel data processing. In: ACM Symposium on Cloud Computing in conjunction with SOSP 2011, p. 12 (2011). https://doi.org/10.1145/2038916.2038928
Yabuta, M., Nguyen, A., Kato, S., Edahiro, M., Kawashima, H.: Relational joins on GPUs: a closer look. IEEE Trans. Parallel Distrib. Syst. 28(9), 2663–2673 (2017). https://doi.org/10.1109/TPDS.2017.2677451
Yuan, Y., Lee, R., Zhang, X.: The yin and yang of processing data warehousing queries on GPU devices. PVLDB 6(10), 817–828 (2013)
Zinn, D., Wu, H., Wang, J., Aref, M., Yalamanchili, S.: General-purpose join algorithms for large graph triangle listing on heterogeneous systems. In: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, GPGPU@PPoPP 2016, pp. 12–21 (2016). https://doi.org/10.1145/2884045.2884054
Acknowledgements
We thank the reviewers for their insightful suggestions. This work was supported by Alibaba Group through the Alibaba Innovative Research (AIR) Program and Grant 16209821 from the Hong Kong Research Grants Council.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Lai, Z., Sun, X., Luo, Q. et al. Accelerating multi-way joins on the GPU. The VLDB Journal 31, 529–553 (2022). https://doi.org/10.1007/s00778-021-00708-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-021-00708-y