Skip to main content

A Theoretical and Experimental Comparison of Filter-Based Equijoins in MapReduce

  • Chapter
  • First Online:
Transactions on Large-Scale Data- and Knowledge-Centered Systems XXV

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 9620))

Abstract

MapReduce has become an increasingly popular framework for large-scale data processing. However, complex operations such as joins are quite expensive and require sophisticated techniques. In this paper, we review state-of-the-art strategies for joining several relations in a MapReduce environment and study their extension with filter-based approaches. The general objective of filters is to eliminate non-matching data as early as possible in order to reduce the I/O, communication and CPU costs. We examine the impact of systematically adding filters as early as possible in MapReduce join algorithms, both analytically with cost models and practically with evaluations. The study covers binary joins, multi-way joins and recursive joins, and addresses the case of large inputs that gives rise to the most intricate challenges.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 16.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Our study only considers conditions is based an equality operator (\(=\)), or equijoins.

References

  1. Afrati, F.N., Borkar, V., Carey, M., Polyzotis, N., Ullman, J.D.: Cluster computing, recursion and datalog. In: de Moor, O., Gottlob, G., Furche, T., Sellers, A. (eds.) Datalog 2010. LNCS, vol. 6702, pp. 120–144. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  2. Afrati, F.N., Borkar, V.R., Carey, M.J., Polyzotis, N., Ullman, J.D.: Map-reduce extensions and recursive queries. In: Proceedings of the International Conference on Extending Database Technology (EDBT), Uppsala, Sweden, pp. 1–8 (2011)

    Google Scholar 

  3. Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of the International Conference on Extending Database Technology (EDBT), Lausanne, Switzerland, pp. 99–110 (2010)

    Google Scholar 

  4. Ahmad, F.: Puma benchmarks and dataset downloads (2012). https://engineering.purdue.edu/~puma/datasets.htm. Accessed: 18 June 2015

  5. Apache: Flink. http://flink.apache.org. Accessed: 18 June 2015

  6. Apache: Hadoop. http://hadoop.apache.org/. Accessed: 18 June 2015

  7. Apache: Spark. https://spark.apache.org. Accessed: 18 June 2015

  8. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 975–986. ACM, New York (2010)

    Google Scholar 

  9. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  MATH  Google Scholar 

  10. Broder, A.Z., Mitzenmacher, M.: Survey: network applications of Bloom filters: a survey. Internet Math. 1(4), 485–509 (2003)

    Article  MathSciNet  Google Scholar 

  11. Bruno, N., Kwon, Y., Wu, M.C.: Advanced join strategies for large-scale distributed computation. Proc. VLDB Endow. 7(13), 1484–1495 (2014)

    Article  Google Scholar 

  12. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDBJ 21(2), 169–190 (2012)

    Article  Google Scholar 

  13. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the International Symposium on Operating System Design and Implementation (OSDI), San Francisco, California, pp. 137–150 (2004)

    Google Scholar 

  14. Doulkeridis, C., Nrvg, K.: A survey of large-scale analytical query processing in mapreduce. VLDB J. 23(3), 355–380 (2014)

    Article  Google Scholar 

  15. Facebook,: Facebook reports fourth quarter and full year 2013 results - facebook (2014). http://investor.fb.com/releasedetail.cfm?ReleaseID=821954. Accessed: 18 June 2015

  16. Hassan, M.A.H., Bamha, M.: Semi-join computation on distributed file systems using map-reduce-merge model. In: Proceedings of the Symposium on Applied Computing (SAC), Sierre, Switzerland, pp. 406–413 (2010)

    Google Scholar 

  17. Idreos, S., Liarou, E., Koubarakis, M.: Continuous multi-way joins over distributed hash tables. In: Proceedings of the EDBT, Nantes, France, pp. 594–605 (2008)

    Google Scholar 

  18. KVM: Kernel virtual machine. http://www.linux-kvm.org/page/Main_Page. Accessed: 18 June 2015

  19. Lam, C.: Hadoop in Action. Manning Publications, Greenwich (2010)

    Google Scholar 

  20. Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with mapreduce: a survey. SIGMOD Rec. 40(4), 11–20 (2012)

    Article  Google Scholar 

  21. Lee, T., Im, D.H., Kim, H., Kim, H.J.: Application of filters to multiway joins in MapReduce. Math. Probl. Eng. 2014, 11 (2014)

    Google Scholar 

  22. Lee, T., Kim, K., Kim, H.J.: Join processing using Bloom filter in MapReduce. In: Proceedings of the RACS, San Antonio, TX, USA, pp. 100–105 (2012)

    Google Scholar 

  23. Lee, T., Kim, K., Kim, H.J.: Exploiting bloom filters for efficient joins in MapReduce. Inf. Int. Interdisc. J. 16(8), 5869–5885 (2013)

    Google Scholar 

  24. Li, F., Ooi, B.C., Özsu, M.T., Wu, S.: Distributed data management using MapReduce. ACM Comput. Surv. 46(3), 31:1–31:42 (2014)

    Google Scholar 

  25. Liu, L., Yin, J., Gao, L.: Efficient social network data query processing on MapReduce. In: Proceedings of the Workshop on HotPlanet, Hong Kong, China, pp. 27–32 (2013)

    Google Scholar 

  26. Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: MRShare: sharing across multiple queries in MapReduce. Proc. Very Large Data Bases Endowment (PVLDB) 3(1), 494–505 (2010)

    Google Scholar 

  27. Okcan, A., Riedewald, M.: Processing theta-joins using mapreduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, pp. 949–960. ACM, New York (2011)

    Google Scholar 

  28. Oracle: Oracle vm virtualbox. https://www.virtualbox.org. Accessed: 18 June 2015

  29. Ordonez, C.: Optimizing recursive queries in SQL. In: Proceedings of the SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, pp. 834–839 (2005)

    Google Scholar 

  30. Phan, T.C., d’Orazio, L., Rigaux, P.: Toward intersection filter-based optimization for joins in mapreduce. In: Proceedings of the 2nd International Workshop on Cloud Intelligence, Cloud-I 2013, pp. 2:1–2:8. ACM, New York (2013)

    Google Scholar 

  31. Sakr, S., Liu, A., Batista, D., Alomari, M.: A survey of large scale data management approaches in cloud environments. IEEE Commun. Surv. Tutorials 13(3), 311–336 (2011)

    Article  Google Scholar 

  32. Sakr, S., Liu, A., Fayoumi, A.G.: The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 11:1–11:44 (2013)

    Article  Google Scholar 

  33. Shaw, M., Koutris, P., Howe, B., Suciu, D.: Optimizing large-scale Semi-Naive Datalog evaluation in Hadoop. In: Proceedings of the International Workshop on Datalog 2.0 (Datalog), Vienna, Austria, pp. 165–176 (2012)

    Google Scholar 

  34. Stratosphere: Next generation big data analytics platform. http://stratosphere.eu. Accessed: 18 June 2015

  35. Tan, K.L., Lu, H.: a note on the strategy space of multiway join query optimization problem in parallel systems. SIGMOD Rec. 20(4), 81–82 (1991)

    Article  Google Scholar 

  36. Ullman, J.D.: Principles of Database and Knowledge-Base Systems, vol. I. Computer Science Press, Rockville (1988)

    Google Scholar 

  37. White, T.: Hadoop: The Definitive Guide. O’Reilly, Sebastopol (2012)

    Google Scholar 

  38. Zhang, C., Li, J., Wu, L., Lin, M., Liu, W.: Sej: an even approach to multiway theta-joins using mapreduce. In: CGC 2012, pp. 73–80. IEEE Computer Society (2012)

    Google Scholar 

  39. Zhang, C., Wu, L., Li, J.: Optimizing distributed joins with bloom filters using MapReduce. In: Kim, T., Cho, H., Gervasi, O., Yau, S.S. (eds.) GDC, IESH and CGAG 2012. CCIS, vol. 351, pp. 88–95. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  40. Zhang, C., Wu, L., Li, J.: Efficient processing distributed joins with bloom filter using mapreduce. Int. J. Grid Distrib. Comput. (IJGDC) 6(3), 43–58 (2013)

    Google Scholar 

  41. Zhang, X., Chen, L., Wang, M.: Efficient multi-way theta-join processing using mapreduce. Proc. VLDB Endow. 5(11), 1184–1195 (2012)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thuong-Cang Phan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Phan, TC., d’Orazio, L., Rigaux, P. (2016). A Theoretical and Experimental Comparison of Filter-Based Equijoins in MapReduce. In: Hameurlain, A., Küng, J., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXV. Lecture Notes in Computer Science(), vol 9620. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49534-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-49534-6_2

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-49533-9

  • Online ISBN: 978-3-662-49534-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics