Join processing with threshold-based filtering in MapReduce

Lee, Taewhi; Bae, Hye-Chan; Kim, Hyoung-Joo

doi:10.1007/s11227-014-1179-9

Join processing with threshold-based filtering in MapReduce

Published: 09 April 2014

Volume 69, pages 793–813, (2014)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Taewhi Lee¹,
Hye-Chan Bae² &
Hyoung-Joo Kim³

253 Accesses
4 Citations
Explore all metrics

Abstract

Data analytics, in particular those involving heterogeneous data, often require join operations on datasets collected from different sources. MapReduce, one of the most popular frameworks for large-scale data processing, is not suited for joining multiple datasets. This is because MapReduce often produces a large number of redundant intermediate results, irrespective of the size of the joined records. Although several existing approaches attempt to reduce the number of such redundant results using Bloom filters, they may be inefficient if large portions of records are joined or the number of distinct keys is large. To alleviate this problem, we propose a join processing method with threshold-based filtering in MapReduce, called TMFR-Join, which is an abbreviation for “Threshold-based Map-Filter-Reduce Join”. TMFR-Join applies filters according to their performance, which is estimated in terms of false-positive rates. It also provides a general framework for exploiting various filtering techniques that support certain desired operations. The experimental results indicate that the performance of TMFR-Join is close to that of the better of existing join processing techniques, both with and without filters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trends and Future Perspective Challenges in Big Data

Big data analytics: a survey

Article Open access 01 October 2015

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Article Open access 05 June 2020

References

Thusoo A, Antony S, Jain N, Murthy R, Shao Z, Borthakur D, Sarma JS, Liu H (2010) Data warehousing and analytics infrastructure at facebook. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD’10, pp 1013–1020
Gupta R, Gupta H, Nambiar U, Mohania M (2010) Efficiently querying archived data using hadoop. In: Proceedings of the 19th ACM international conference on information and knowledge management, CIKM’10, pp 1301–1304
Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th USENIX symposium on opearting systems design and implementation, OSDI’04, pp 137–150
Hadoop. http://hadoop.apache.org/. Accessed 3 April 2014
Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD’10, pp 975–986
Yang HC, Dasdan A, Hsiao RL, Parker DS (2007) Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data, SIGMOD’07, pp 1029–1040
Espinosa A, Hernandez P, Moure JC, Protasio J, Ripoll A (2012) Analysis and improvement of map-reduce data distribution in read mapping applications. J Supercomput 62(3):1305–1317
Article Google Scholar
Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426
Article MATH Google Scholar
Koutris P (2011) Bloom filters in distributed query execution. University of Washington, Washington. http://www.cs.washington.edu/education/courses/cse544/11wi/projects/koutris.pdf. Accessed 3 April 2014
Lee T, Kim K, Kim HJ (2012) Join processing using bloom filter in mapreduce. In: Proceedings of the 2012 ACM research in applied computation symposium, RACS’12, pp 100–105
Palla K (2009) A comparative analysis of join algorithms using the hadoop map/reduce framework. Master’s thesis, University of Edinburgh, Edinburgh
Zhang C, Wu L, Li J (2013) Efficient processing distributed joins with bloomfilter using mapreduce. Int J Grid Distrib Comput 6(3):43–58
Google Scholar
Tarkoma S, Rothenberg CE, Lagerspetz E (2012) Theory and practice of bloom filters for distributed systems. IEEE Commun Surv Tutor 14(1):131–155
Article Google Scholar
Bender MA, Farach-Colton M, Johnson R, Kraner R, Kuszmaul BC, Medjedovic D, Montes P, Shetty P, Spillane RP, Zadok E (2012) Don’t thrash: how to cache your hash on flash. Proc VLDB Endow 5(11):1627–1637
Article Google Scholar
Quislant R, Gutierrez E, Plata O, Zapata EL (2010), Interval filter: a locality-aware alternative to bloom filters for hardware membership queries by interval classification. In: Proceedings of the 11th international conference on intelligent data engineering and automated learning, IDEAL’10, pp 162–169
Lee KH, Lee YJ, Choi H, Chung YD, Moon B (2011) Parallel data processing with mapreduce: a survey. ACM SIGMOD Rec 40(4):11–20
Article Google Scholar
White T (2011) Hadoop: the definitive guide, 2nd edn. O’Reilly Media Inc., USA
Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment. In: Proceedings of the 13th international conference on extending database technology, EDBT’10, pp 99–110
Jiang D, Tung AKH, Chen G (2011) Map-join-reduce: toward scalable and efficient data analysis on large clusters. IEEE Trans Knowl Data Eng 23(9):1299–1311
Article Google Scholar
Mackert LF, Lohman GM (1986) R* optimizer validation and performance evaluation for distributed queries. In: Proceedings of the 12th international conference on very large data bases, VLDB’86, pp 149–159
Kemper A, Kossmann D, Wiesner C (1999), Generalized hash teams for join and group-by. In: Proceedings of the 25th international conference on very large data bases, VLDB’99, pp 30–41
Michael L, Nejdl W, Papapetrou O, Siberski W (2007) Improving distributed join efficiency with extended bloom filter operations. In: Proceedings of the 21st international conference on advanced networking and applications, AINA’07, pp 187–194
Ramesh S, Papapetrou O, Siberski W (2008) Optimizing distributed joins with bloom filters. Distributed computing and internet technology. In: Lecture notes in computer science, vol 5375, pp. 145–156
Papapetrou O, Siberski W, Nejdl W (2010) Cardinality estimation and dynamic length adaptation for bloom filters. Distrib Parallel Databases 28(2–3):119–156
Article Google Scholar
Herodotou H (2011) Hadoop performance models. In: Technical report CS-2011-05, Duke University, Durham. http://www.cs.duke.edu/starfish/files/hadoop-models.pdf. Accessed 3 April 2014
Cluster setup. http://hadoop.apache.org/docs/r0.19.1/cluster_setup.html. Accessed 3 April 2014
TPC-H benchmark. http://www.tpc.org/tpch/. Accessed 3 April 2014

Download references

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MSIP) (No. 20120005695), by ETRI R&D Program (“Development of Big Data Platform for Dual Mode Batch/Query Analytics, 14ZS1400”) funded by the Government of Korea, and by Samsung Electronics Co. Ltd.

Author information

Authors and Affiliations

BigData Software Platform Research Department, Electronics and Telecommunications Research Institute, 218 Gajeong-ro, Yuseong-gu, Daejeon , 305-700, Republic of Korea
Taewhi Lee
Media Solution Center, Samsung Electronics Co., Ltd., 129 Samsung-ro, Yeongtong-gu, Suwon-si, Gyeonggi-do , 443-742, Republic of Korea
Hye-Chan Bae
Department of Computer Science and Engineering, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul , 151-744, Republic of Korea
Hyoung-Joo Kim

Authors

Taewhi Lee
View author publications
You can also search for this author in PubMed Google Scholar
Hye-Chan Bae
View author publications
You can also search for this author in PubMed Google Scholar
Hyoung-Joo Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Taewhi Lee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, T., Bae, HC. & Kim, HJ. Join processing with threshold-based filtering in MapReduce. J Supercomput 69, 793–813 (2014). https://doi.org/10.1007/s11227-014-1179-9

Download citation

Published: 09 April 2014
Issue Date: August 2014
DOI: https://doi.org/10.1007/s11227-014-1179-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Join processing with threshold-based filtering in MapReduce

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Big data analytics: a survey

MongoDB Vs PostgreSQL: A comparative study on performance aspects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Join processing with threshold-based filtering in MapReduce

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Big data analytics: a survey

MongoDB Vs PostgreSQL: A comparative study on performance aspects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation