The VLDB Journal

, Volume 26, Issue 1, pp 125–150 | Cite as

Fast and scalable inequality joins

  • Zuhair Khayyat
  • William Lucia
  • Meghna Singh
  • Mourad Ouzzani
  • Paolo Papotti
  • Jorge-Arnulfo Quiané-Ruiz
  • Nan Tang
  • Panos Kalnis
Special Issue Paper

Abstract

Inequality joins, which is to join relations with inequality conditions, are used in various applications. Optimizing joins has been the subject of intensive research ranging from efficient join algorithms such as sort-merge join, to the use of efficient indices such as \(B^+\)-tree, \(R^*\)-tree and Bitmap. However, inequality joins have received little attention and queries containing such joins are notably very slow. In this paper, we introduce fast inequality join algorithms based on sorted arrays and space-efficient bit-arrays. We further introduce a simple method to estimate the selectivity of inequality joins which is then used to optimize multiple predicate queries and multi-way joins. Moreover, we study an incremental inequality join algorithm to handle scenarios where data keeps changing. We have implemented a centralized version of these algorithms on top of PostgreSQL, a distributed version on top of Spark SQL, and an existing data cleaning system, Nadeef. By comparing our algorithms against well-known optimization techniques for inequality joins, we show our solution is more scalable and several orders of magnitude faster.

Keywords

Inequality join PostgreSQL Spark SQL Selectivity estimation Incremental 

References

  1. 1.
    Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Reading (1995)MATHGoogle Scholar
  2. 2.
    Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: EDBT, pp. 99–110 (2010)Google Scholar
  3. 3.
    Agrawal, D., Chawla, S., Elmagarmid, A.K., Ouzzani, Z.K.M., Papotti, P., Quiané-Ruiz, J., Tang, N., Zaki, M.J.: Road to freedom in big data analytics. In: EDBT, pp. 479–484 (2016)Google Scholar
  4. 4.
    Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: SIGMOD, pp. 1383–1394 (2015)Google Scholar
  5. 5.
    Bender, M.A., Hu, H.: An adaptive packed-memory array. TODS 32(4) 26:1–26:43 (2007)Google Scholar
  6. 6.
    Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: ICDE, pp. 746–755 (2007)Google Scholar
  7. 7.
    Böhm, C., Klump, G., Kriegel, H.-P.: XZ-Ordering: A space-filling curve for objects with spatial extension. In: SSD, pp. 75–90 (1999)Google Scholar
  8. 8.
    Chan, C.-Y., Ioannidis, Y. E.: Bitmap index design and evaluation. In: SIGMOD, pp. 355–366 (1998)Google Scholar
  9. 9.
    Chan, C.-Y., Ioannidis, Y.E.: An efficient bitmap encoding scheme for selection queries. In: SIGMOD, pp. 215–226 (1999)Google Scholar
  10. 10.
    Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE, pp. 458–469 (2013)Google Scholar
  11. 11.
    Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F. Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: SIGMOD (2013)Google Scholar
  12. 12.
    Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  13. 13.
    DeWitt, D.J., Naughton, J.F., Schneider, D.A.: An evaluation of non-equijoin algorithms. In: VLDB, pp. 443–452 (1991)Google Scholar
  14. 14.
    Dittrich, J., Quiané-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 515–529 (2010)Google Scholar
  15. 15.
    Ebaid, A., Elmagarmid, A.K., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J., Tang, N., Yin, S.: NADEEF: a generalized data cleaning system. PVLDB 6(12), 1218–1221 (2013)Google Scholar
  16. 16.
    Elmagarmid, A.K., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J., Tang, N., Yin, S.: NADEEF/ER: generic and interactive entity resolution. In: SIGMOD, pp. 1071–1074 (2014)Google Scholar
  17. 17.
    Enderle, J., Hampel, M., Seidl, T.: Joining interval data in relational databases. In: SIGMOD, pp. 683–694 (2004)Google Scholar
  18. 18.
    Gao, D., Jensen, C.S., Snodgrass, R.T., Soo, M.D.: Join operations in temporal databases. VLDB J. 14(1), 2–29 (2005)CrossRefGoogle Scholar
  19. 19.
    Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems. Pearson Education (2009)Google Scholar
  20. 20.
    Govindaraju, N.K., Gray, J., Kumar, R., Manocha, D.: GPUTeraSort: high performance graphics co-processor sorting for large database management. In: SIGMOD, pp. 325–336 (2006)Google Scholar
  21. 21.
    Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: SIGMOD, pp. 47–57 (1984)Google Scholar
  22. 22.
    Hellerstein, J.M., Naughton, J.F., Pfeffer, A.: Generalized search trees for database systems. In: VLDB, pp. 562–573 (1995)Google Scholar
  23. 23.
    Kemper, A., Kossmann, D., Wiesner, C.: Generalised hash teams for join and group-by. In: VLDB, pp. 30–41 (1999)Google Scholar
  24. 24.
    Khayyat, Z., Ilyas, I.F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Yin, S.: BigDansing: a system for big data cleansing. In: SIGMOD, pp. 1215–1230 (2015)Google Scholar
  25. 25.
    Khayyat, Z., Lucia, W., Singh, M., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Kalnis, P.: Lightning fast and space efficient inequality joins. PVLDB 8(13), 2074–2085 (2015)Google Scholar
  26. 26.
    Kiukkonen, N., Blom, J., Dousse, O., Gatica-Perez, D., Laurila, J.: Towards rich mobile phone datasets: lausanne data collection campaign. In: ICPS (2010)Google Scholar
  27. 27.
    Knuth, D. E.: The Art of Computer Programming, Volume III: Sorting and Searching. Addison-Wesley, Reading (1973)Google Scholar
  28. 28.
    Laurila, J.K., Gatica-Perez, D., Aad, I., Bornet, O., Do, T.-M.-T., Dousse, O., Eberle, J., Miettinen, M.: The mobile data challenge: big data for mobile computing research. In: Pervasive Computing (2012)Google Scholar
  29. 29.
    Lohman, G., Mohan, C., Haas, L., Daniels, D., Lindsay, B., Selinger, P., Wilms, P.: Query processing in R*. In: Query Processing in Database Systems, pp. 31–47 (1985)Google Scholar
  30. 30.
    Lopes Siqueira, T.L., Ciferri, R.R., Times, V.C., de Aguiar Ciferri, C.D.: A spatial bitmap-based index for geographical data warehouses. In: SAC, pp. 1336–1342 (2009)Google Scholar
  31. 31.
    Mamoulis, N., Papadias, D.: Multiway spatial joins. TODS 26(4), 424–475 (2001)CrossRefMATHGoogle Scholar
  32. 32.
    Morris, J., Ramesh, B.: Dynamic Partition Enhanced Inequality Joining Using a Value-count Index, 1 2011. US Patent 7,873,629 B1Google Scholar
  33. 33.
    Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: SIGMOD, pp. 949–960 (2011)Google Scholar
  34. 34.
    Schneider, D.A., DeWitt, D.J.: A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In: SIGMOD (1989)Google Scholar
  35. 35.
    Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD, pp. 23–34 (1979)Google Scholar
  36. 36.
    Stockinger, K., Wu, K.: Bitmap indices for data warehouses. Data Wareh OLAP Concepts Archit Solut 5, 157–178 (2007)Google Scholar
  37. 37.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud, pp. 10–10 (2010)Google Scholar
  38. 38.
    Zhang, X., Chen, L., Wang, M.: Efficient multi-way theta-join processing using MapReduce. PVLDB 5(11), 1184–1195 (2012)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.King Abdullah University of Science and TechnologyThuwalSaudi Arabia
  2. 2.Qatar Computing Research InstituteHBKUDohaQatar
  3. 3.Arizona State UniversityTempeUSA

Personalised recommendations