Skip to main content

MapReduce Join Across Geo-Distributed Data Centers

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1054))

Abstract

MapReduce is with no doubt the parallel computation paradigm which has managed to interpret and serve at best the need, expressed in any field, of running fast and accurate analyses on Big Data. The strength of MapReduce is its capability of exploiting the computing power of a cluster of resources, by distributing the load on multiple computing units, and of scaling with the number of computing units. Today many data analysis algorithms are available in the MapReduce form: Data Sorting, Data Indexing, Word Counting, Relations Joining to name just a few. These algorithms have been observed to work fine in computing context where the computing units (nodes) connect by way of high performing network links (in the order of Gigabits per second). Unfortunately, when it comes to run MapReduce on nodes that are geographically distant to each other the performance dramatically degrades. Basically, in such scenarios the cost for moving data among nodes connected via geographic links counterbalances the benefit of parallelization. In this paper the issues of running MapReduce Joins in a geo-distributed computing context are discussed. Furthermore, we propose to boost the performance of the Join algorithm by leveraging a hierarchical computing approach.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Afrati, F.N., Ullman, J.D.: Optimizing multiway joins in a map-reduce environment. IEEE Trans. Knowl. Data Eng. 23(9), 1282–1298 (2011). https://doi.org/10.1109/TKDE.2011.47

    Article  Google Scholar 

  2. Afrati, F., Dolev, S., Sharma, S., Ullman, J.: Meta-MapReduce: a technique for reducing communication in MapReduce computations. In: 17th International Symposium on Stabilization, Safety, and Security of Distributed Systems (Springer-SSS), Edmonton, Canada, August 2015

    Google Scholar 

  3. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 975–986. ACM, New York (2010). https://doi.org/10.1145/1807167.1807273

  4. Cavallo, M., Di Modica, G., Polito, C., Tomarchio, O.: Fragmenting Big Data to boost the performance of MapReduce in geographical computing contexts. In: The 3rd International Conference on Big Data Innovations and Applications (Innovate-Data 2017), Prague, Czech Republic, pp. 17–24, August 2017. https://doi.org/10.1109/Innovate-Data.2017.12

  5. Cavallo, M., Modica, G.D., Polito, C., Tomarchio, O.: A hierarchical Hadoop framework to handle Big Data in geo-distributed computing environments. Int. J. Inf. Technol. Syst. Approach (IJITSA) 11(1), 16–47 (2018). https://doi.org/10.4018/IJITSA.2018010102

    Article  Google Scholar 

  6. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation, OSDI 2004. USENIX Association (2004)

    Google Scholar 

  7. Dolev, S., Florissi, P., Gudes, E., Sharma, S., Singer, I.: A survey on geographically distributed big-data processing using mapreduce. IEEE Trans. Big Data 5(1), 60–80 (2019). https://doi.org/10.1109/TBDATA.2017.2723473

    Article  Google Scholar 

  8. Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book, 2nd edn. Prentice Hall Press, Upper Saddle River (2008)

    Google Scholar 

  9. Heintz, B., Chandra, A., Sitaraman, R., Weissman, J.: End-to-end optimization for geo-distributed MapReduce. IEEE Trans. Cloud Comput. 4(3), 293–306 (2016). https://doi.org/10.1109/TCC.2014.2355225

    Article  Google Scholar 

  10. Jayalath, C., Stephen, J., Eugster, P.: From the cloud to the atmosphere: running MapReduce across data centers. IEEE Trans. Comput. 63(1), 74–87 (2014). https://doi.org/10.1109/TC.2013.121

    Article  MathSciNet  MATH  Google Scholar 

  11. Kim, S., Won, J., Han, H., Eom, H., Yeom, H.Y.: Improving Hadoop performance in intercloud environments. SIGMETRICS Perform. Eval. Rev. 39(3), 107–109 (2011). https://doi.org/10.1145/2160803.2160873

    Article  Google Scholar 

  12. Luo, Y., Guo, Z., Sun, Y., Plale, B., Qiu, J., Li, W.W.: A hierarchical framework for cross-domain MapReduce execution. In: Proceedings of the Second International Workshop on Emerging Computational Methods for the Life Sciences, ECMLS 2011, pp. 15–22 (2011). https://doi.org/10.1145/1996023.1996026

  13. Mattess, M., Calheiros, R.N., Buyya, R.: Scaling MapReduce applications across hybrid clouds to meet soft deadlines. In: Proceedings of the 2013 IEEE 27th International Conference on Advanced Information Networking and Applications, AINA 2013, pp. 629–636 (2013). https://doi.org/10.1109/AINA.2013.51

  14. Rödiger, W., Idicula, S., Kemper, A., Neumann, T.: Flow-join: adaptive skew handling for distributed joins over high-speed networks. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 1194–1205, May 2016. https://doi.org/10.1109/ICDE.2016.7498324

  15. Rupprecht, L., Culhane, W., Pietzuch, P.: SquirrelJoin: network-aware distributed join processing with lazy partitioning. Proc. VLDB Endowment 10(11), 1250–1261 (2017). https://doi.org/10.14778/3137628.3137636

    Article  Google Scholar 

  16. Sarma, A.D., Afrati, F.N., Salihoglu, S., Ullman, J.D.: Upper and lower bounds on the cost of a map-reduce computation. Proc. VLDB Endowment 6(4), 277–288 (2013). https://doi.org/10.14778/2535570.2488334

    Article  Google Scholar 

  17. Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. 22(3), 183–236 (1990). https://doi.org/10.1145/96602.96604

    Article  Google Scholar 

  18. The Apache Software Foundation: The Apache Hadoop project (2011). http://hadoop.apache.org/

  19. Venner, J.: Pro Hadoop, 1st edn. Apress, Berkely (2009)

    Book  Google Scholar 

  20. Wang, L., et al.: G-hadoop: MapReduce across distributed data centers for data-intensive computing. Future Gener. Comput. Syst. 29(3), 739–750 (2013). https://doi.org/10.1016/j.future.2012.09.001. Special Section: Recent Developments in High Performance Computing and Security

    Article  Google Scholar 

  21. Yang, H., Dasdan, A., Hsiao, R., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD 2007, pp. 1029–1040 (2007). https://doi.org/10.1145/1247480.1247602

  22. Zhang, Q., et al.: Improving Hadoop service provisioning in a geographically distributed cloud. In: 2014 IEEE 7th International Conference on Cloud Computing (CLOUD 2014), pp. 432–439, June 2014. https://doi.org/10.1109/CLOUD.2014.65

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Orazio Tomarchio .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Di Modica, G., Tomarchio, O. (2019). MapReduce Join Across Geo-Distributed Data Centers. In: Younas, M., Awan, I., Benbernou, S. (eds) Big Data Innovations and Applications. Innovate-Data 2019. Communications in Computer and Information Science, vol 1054. Springer, Cham. https://doi.org/10.1007/978-3-030-27355-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27355-2_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27354-5

  • Online ISBN: 978-3-030-27355-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics