Skip to main content

Advertisement

Log in

Fast and scalable inequality joins

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Inequality joins, which is to join relations with inequality conditions, are used in various applications. Optimizing joins has been the subject of intensive research ranging from efficient join algorithms such as sort-merge join, to the use of efficient indices such as \(B^+\)-tree, \(R^*\)-tree and Bitmap. However, inequality joins have received little attention and queries containing such joins are notably very slow. In this paper, we introduce fast inequality join algorithms based on sorted arrays and space-efficient bit-arrays. We further introduce a simple method to estimate the selectivity of inequality joins which is then used to optimize multiple predicate queries and multi-way joins. Moreover, we study an incremental inequality join algorithm to handle scenarios where data keeps changing. We have implemented a centralized version of these algorithms on top of PostgreSQL, a distributed version on top of Spark SQL, and an existing data cleaning system, Nadeef. By comparing our algorithms against well-known optimization techniques for inequality joins, we show our solution is more scalable and several orders of magnitude faster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

Notes

  1. We motivate the problem with a real-life story. Names and queries have been changed for confidentiality reasons.

  2. We experimentally show in Sect. 8.7 that our algorithm requires only 1 % of the input data to be accurate.

  3. http://github.com/daqcri/NADEEF.

  4. http://www.boost.org/

References

  1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Reading (1995)

    MATH  Google Scholar 

  2. Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: EDBT, pp. 99–110 (2010)

  3. Agrawal, D., Chawla, S., Elmagarmid, A.K., Ouzzani, Z.K.M., Papotti, P., Quiané-Ruiz, J., Tang, N., Zaki, M.J.: Road to freedom in big data analytics. In: EDBT, pp. 479–484 (2016)

  4. Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: SIGMOD, pp. 1383–1394 (2015)

  5. Bender, M.A., Hu, H.: An adaptive packed-memory array. TODS 32(4) 26:1–26:43 (2007)

  6. Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: ICDE, pp. 746–755 (2007)

  7. Böhm, C., Klump, G., Kriegel, H.-P.: XZ-Ordering: A space-filling curve for objects with spatial extension. In: SSD, pp. 75–90 (1999)

  8. Chan, C.-Y., Ioannidis, Y. E.: Bitmap index design and evaluation. In: SIGMOD, pp. 355–366 (1998)

  9. Chan, C.-Y., Ioannidis, Y.E.: An efficient bitmap encoding scheme for selection queries. In: SIGMOD, pp. 215–226 (1999)

  10. Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE, pp. 458–469 (2013)

  11. Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F. Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: SIGMOD (2013)

  12. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  13. DeWitt, D.J., Naughton, J.F., Schneider, D.A.: An evaluation of non-equijoin algorithms. In: VLDB, pp. 443–452 (1991)

  14. Dittrich, J., Quiané-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 515–529 (2010)

    Google Scholar 

  15. Ebaid, A., Elmagarmid, A.K., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J., Tang, N., Yin, S.: NADEEF: a generalized data cleaning system. PVLDB 6(12), 1218–1221 (2013)

    Google Scholar 

  16. Elmagarmid, A.K., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J., Tang, N., Yin, S.: NADEEF/ER: generic and interactive entity resolution. In: SIGMOD, pp. 1071–1074 (2014)

  17. Enderle, J., Hampel, M., Seidl, T.: Joining interval data in relational databases. In: SIGMOD, pp. 683–694 (2004)

  18. Gao, D., Jensen, C.S., Snodgrass, R.T., Soo, M.D.: Join operations in temporal databases. VLDB J. 14(1), 2–29 (2005)

    Article  Google Scholar 

  19. Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems. Pearson Education (2009)

  20. Govindaraju, N.K., Gray, J., Kumar, R., Manocha, D.: GPUTeraSort: high performance graphics co-processor sorting for large database management. In: SIGMOD, pp. 325–336 (2006)

  21. Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: SIGMOD, pp. 47–57 (1984)

  22. Hellerstein, J.M., Naughton, J.F., Pfeffer, A.: Generalized search trees for database systems. In: VLDB, pp. 562–573 (1995)

  23. Kemper, A., Kossmann, D., Wiesner, C.: Generalised hash teams for join and group-by. In: VLDB, pp. 30–41 (1999)

  24. Khayyat, Z., Ilyas, I.F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Yin, S.: BigDansing: a system for big data cleansing. In: SIGMOD, pp. 1215–1230 (2015)

  25. Khayyat, Z., Lucia, W., Singh, M., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Kalnis, P.: Lightning fast and space efficient inequality joins. PVLDB 8(13), 2074–2085 (2015)

    Google Scholar 

  26. Kiukkonen, N., Blom, J., Dousse, O., Gatica-Perez, D., Laurila, J.: Towards rich mobile phone datasets: lausanne data collection campaign. In: ICPS (2010)

  27. Knuth, D. E.: The Art of Computer Programming, Volume III: Sorting and Searching. Addison-Wesley, Reading (1973)

  28. Laurila, J.K., Gatica-Perez, D., Aad, I., Bornet, O., Do, T.-M.-T., Dousse, O., Eberle, J., Miettinen, M.: The mobile data challenge: big data for mobile computing research. In: Pervasive Computing (2012)

  29. Lohman, G., Mohan, C., Haas, L., Daniels, D., Lindsay, B., Selinger, P., Wilms, P.: Query processing in R*. In: Query Processing in Database Systems, pp. 31–47 (1985)

  30. Lopes Siqueira, T.L., Ciferri, R.R., Times, V.C., de Aguiar Ciferri, C.D.: A spatial bitmap-based index for geographical data warehouses. In: SAC, pp. 1336–1342 (2009)

  31. Mamoulis, N., Papadias, D.: Multiway spatial joins. TODS 26(4), 424–475 (2001)

    Article  MATH  Google Scholar 

  32. Morris, J., Ramesh, B.: Dynamic Partition Enhanced Inequality Joining Using a Value-count Index, 1 2011. US Patent 7,873,629 B1

  33. Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: SIGMOD, pp. 949–960 (2011)

  34. Schneider, D.A., DeWitt, D.J.: A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In: SIGMOD (1989)

  35. Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD, pp. 23–34 (1979)

  36. Stockinger, K., Wu, K.: Bitmap indices for data warehouses. Data Wareh OLAP Concepts Archit Solut 5, 157–178 (2007)

  37. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud, pp. 10–10 (2010)

  38. Zhang, X., Chen, L., Wang, M.: Efficient multi-way theta-join processing using MapReduce. PVLDB 5(11), 1184–1195 (2012)

    Google Scholar 

Download references

Acknowledgments

Portions of the research in this paper used the MDC Database made available by Idiap Research Institute, Switzerland and owned by Nokia.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zuhair Khayyat.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khayyat, Z., Lucia, W., Singh, M. et al. Fast and scalable inequality joins. The VLDB Journal 26, 125–150 (2017). https://doi.org/10.1007/s00778-016-0441-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-016-0441-6

Keywords

Navigation