Fast and scalable inequality joins

Khayyat, Zuhair; Lucia, William; Singh, Meghna; Ouzzani, Mourad; Papotti, Paolo; Quiané-Ruiz, Jorge-Arnulfo; Tang, Nan; Kalnis, Panos

doi:10.1007/s00778-016-0441-6

Fast and scalable inequality joins

Special Issue Paper
Published: 07 September 2016

Volume 26, pages 125–150, (2017)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Zuhair Khayyat ORCID: orcid.org/0000-0003-3650-6997¹,
William Lucia²,
Meghna Singh²,
Mourad Ouzzani²,
Paolo Papotti ORCID: orcid.org/0000-0003-0651-4128³,
Jorge-Arnulfo Quiané-Ruiz²,
Nan Tang² &
…
Panos Kalnis¹

1236 Accesses
19 Citations
1 Altmetric
Explore all metrics

Abstract

Inequality joins, which is to join relations with inequality conditions, are used in various applications. Optimizing joins has been the subject of intensive research ranging from efficient join algorithms such as sort-merge join, to the use of efficient indices such as \(B^+\)-tree, \(R^*\)-tree and Bitmap. However, inequality joins have received little attention and queries containing such joins are notably very slow. In this paper, we introduce fast inequality join algorithms based on sorted arrays and space-efficient bit-arrays. We further introduce a simple method to estimate the selectivity of inequality joins which is then used to optimize multiple predicate queries and multi-way joins. Moreover, we study an incremental inequality join algorithm to handle scenarios where data keeps changing. We have implemented a centralized version of these algorithms on top of PostgreSQL, a distributed version on top of Spark SQL, and an existing data cleaning system, Nadeef. By comparing our algorithms against well-known optimization techniques for inequality joins, we show our solution is more scalable and several orders of magnitude faster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Article 12 April 2024

A survey on the evolution of stream processing systems

Article Open access 22 November 2023

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Article Open access 05 June 2020

Notes

We motivate the problem with a real-life story. Names and queries have been changed for confidentiality reasons.
We experimentally show in Sect. 8.7 that our algorithm requires only 1 % of the input data to be accurate.
http://github.com/daqcri/NADEEF.
http://www.boost.org/

References

Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Reading (1995)
MATH Google Scholar
Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: EDBT, pp. 99–110 (2010)
Agrawal, D., Chawla, S., Elmagarmid, A.K., Ouzzani, Z.K.M., Papotti, P., Quiané-Ruiz, J., Tang, N., Zaki, M.J.: Road to freedom in big data analytics. In: EDBT, pp. 479–484 (2016)
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: SIGMOD, pp. 1383–1394 (2015)
Bender, M.A., Hu, H.: An adaptive packed-memory array. TODS 32(4) 26:1–26:43 (2007)
Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: ICDE, pp. 746–755 (2007)
Böhm, C., Klump, G., Kriegel, H.-P.: XZ-Ordering: A space-filling curve for objects with spatial extension. In: SSD, pp. 75–90 (1999)
Chan, C.-Y., Ioannidis, Y. E.: Bitmap index design and evaluation. In: SIGMOD, pp. 355–366 (1998)
Chan, C.-Y., Ioannidis, Y.E.: An efficient bitmap encoding scheme for selection queries. In: SIGMOD, pp. 215–226 (1999)
Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE, pp. 458–469 (2013)
Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F. Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: SIGMOD (2013)
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
DeWitt, D.J., Naughton, J.F., Schneider, D.A.: An evaluation of non-equijoin algorithms. In: VLDB, pp. 443–452 (1991)
Dittrich, J., Quiané-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 515–529 (2010)
Google Scholar
Ebaid, A., Elmagarmid, A.K., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J., Tang, N., Yin, S.: NADEEF: a generalized data cleaning system. PVLDB 6(12), 1218–1221 (2013)
Google Scholar
Elmagarmid, A.K., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J., Tang, N., Yin, S.: NADEEF/ER: generic and interactive entity resolution. In: SIGMOD, pp. 1071–1074 (2014)
Enderle, J., Hampel, M., Seidl, T.: Joining interval data in relational databases. In: SIGMOD, pp. 683–694 (2004)
Gao, D., Jensen, C.S., Snodgrass, R.T., Soo, M.D.: Join operations in temporal databases. VLDB J. 14(1), 2–29 (2005)
Article Google Scholar
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems. Pearson Education (2009)
Govindaraju, N.K., Gray, J., Kumar, R., Manocha, D.: GPUTeraSort: high performance graphics co-processor sorting for large database management. In: SIGMOD, pp. 325–336 (2006)
Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: SIGMOD, pp. 47–57 (1984)
Hellerstein, J.M., Naughton, J.F., Pfeffer, A.: Generalized search trees for database systems. In: VLDB, pp. 562–573 (1995)
Kemper, A., Kossmann, D., Wiesner, C.: Generalised hash teams for join and group-by. In: VLDB, pp. 30–41 (1999)
Khayyat, Z., Ilyas, I.F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Yin, S.: BigDansing: a system for big data cleansing. In: SIGMOD, pp. 1215–1230 (2015)
Khayyat, Z., Lucia, W., Singh, M., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Kalnis, P.: Lightning fast and space efficient inequality joins. PVLDB 8(13), 2074–2085 (2015)
Google Scholar
Kiukkonen, N., Blom, J., Dousse, O., Gatica-Perez, D., Laurila, J.: Towards rich mobile phone datasets: lausanne data collection campaign. In: ICPS (2010)
Knuth, D. E.: The Art of Computer Programming, Volume III: Sorting and Searching. Addison-Wesley, Reading (1973)
Laurila, J.K., Gatica-Perez, D., Aad, I., Bornet, O., Do, T.-M.-T., Dousse, O., Eberle, J., Miettinen, M.: The mobile data challenge: big data for mobile computing research. In: Pervasive Computing (2012)
Lohman, G., Mohan, C., Haas, L., Daniels, D., Lindsay, B., Selinger, P., Wilms, P.: Query processing in R*. In: Query Processing in Database Systems, pp. 31–47 (1985)
Lopes Siqueira, T.L., Ciferri, R.R., Times, V.C., de Aguiar Ciferri, C.D.: A spatial bitmap-based index for geographical data warehouses. In: SAC, pp. 1336–1342 (2009)
Mamoulis, N., Papadias, D.: Multiway spatial joins. TODS 26(4), 424–475 (2001)
Article MATH Google Scholar
Morris, J., Ramesh, B.: Dynamic Partition Enhanced Inequality Joining Using a Value-count Index, 1 2011. US Patent 7,873,629 B1
Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: SIGMOD, pp. 949–960 (2011)
Schneider, D.A., DeWitt, D.J.: A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In: SIGMOD (1989)
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD, pp. 23–34 (1979)
Stockinger, K., Wu, K.: Bitmap indices for data warehouses. Data Wareh OLAP Concepts Archit Solut 5, 157–178 (2007)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud, pp. 10–10 (2010)
Zhang, X., Chen, L., Wang, M.: Efficient multi-way theta-join processing using MapReduce. PVLDB 5(11), 1184–1195 (2012)
Google Scholar

Download references

Acknowledgments

Portions of the research in this paper used the MDC Database made available by Idiap Research Institute, Switzerland and owned by Nokia.

Author information

Authors and Affiliations

King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Zuhair Khayyat & Panos Kalnis
Qatar Computing Research Institute, HBKU, Doha, Qatar
William Lucia, Meghna Singh, Mourad Ouzzani, Jorge-Arnulfo Quiané-Ruiz & Nan Tang
Arizona State University, Tempe, AZ, USA
Paolo Papotti

Authors

Zuhair Khayyat
View author publications
You can also search for this author in PubMed Google Scholar
William Lucia
View author publications
You can also search for this author in PubMed Google Scholar
Meghna Singh
View author publications
You can also search for this author in PubMed Google Scholar
Mourad Ouzzani
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Papotti
View author publications
You can also search for this author in PubMed Google Scholar
Jorge-Arnulfo Quiané-Ruiz
View author publications
You can also search for this author in PubMed Google Scholar
Nan Tang
View author publications
You can also search for this author in PubMed Google Scholar
Panos Kalnis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zuhair Khayyat.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khayyat, Z., Lucia, W., Singh, M. et al. Fast and scalable inequality joins. The VLDB Journal 26, 125–150 (2017). https://doi.org/10.1007/s00778-016-0441-6

Download citation

Received: 21 December 2015
Revised: 03 August 2016
Accepted: 19 August 2016
Published: 07 September 2016
Issue Date: February 2017
DOI: https://doi.org/10.1007/s00778-016-0441-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast and scalable inequality joins

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

A survey on the evolution of stream processing systems

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fast and scalable inequality joins

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

A survey on the evolution of stream processing systems

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation