MapReduce Join Across Geo-Distributed Data Centers

Di Modica, Giuseppe; Tomarchio, Orazio

doi:10.1007/978-3-030-27355-2_2

MapReduce Join Across Geo-Distributed Data Centers

Giuseppe Di Modica¹⁰ &
Orazio Tomarchio¹⁰

Conference paper
First Online: 01 August 2019

474 Accesses
1 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1054))

Abstract

MapReduce is with no doubt the parallel computation paradigm which has managed to interpret and serve at best the need, expressed in any field, of running fast and accurate analyses on Big Data. The strength of MapReduce is its capability of exploiting the computing power of a cluster of resources, by distributing the load on multiple computing units, and of scaling with the number of computing units. Today many data analysis algorithms are available in the MapReduce form: Data Sorting, Data Indexing, Word Counting, Relations Joining to name just a few. These algorithms have been observed to work fine in computing context where the computing units (nodes) connect by way of high performing network links (in the order of Gigabits per second). Unfortunately, when it comes to run MapReduce on nodes that are geographically distant to each other the performance dramatically degrades. Basically, in such scenarios the cost for moving data among nodes connected via geographic links counterbalances the benefit of parallelization. In this paper the issues of running MapReduce Joins in a geo-distributed computing context are discussed. Furthermore, we propose to boost the performance of the Join algorithm by leveraging a hierarchical computing approach.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Afrati, F.N., Ullman, J.D.: Optimizing multiway joins in a map-reduce environment. IEEE Trans. Knowl. Data Eng. 23(9), 1282–1298 (2011). https://doi.org/10.1109/TKDE.2011.47
Article Google Scholar
Afrati, F., Dolev, S., Sharma, S., Ullman, J.: Meta-MapReduce: a technique for reducing communication in MapReduce computations. In: 17th International Symposium on Stabilization, Safety, and Security of Distributed Systems (Springer-SSS), Edmonton, Canada, August 2015
Google Scholar
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 975–986. ACM, New York (2010). https://doi.org/10.1145/1807167.1807273
Cavallo, M., Di Modica, G., Polito, C., Tomarchio, O.: Fragmenting Big Data to boost the performance of MapReduce in geographical computing contexts. In: The 3rd International Conference on Big Data Innovations and Applications (Innovate-Data 2017), Prague, Czech Republic, pp. 17–24, August 2017. https://doi.org/10.1109/Innovate-Data.2017.12
Cavallo, M., Modica, G.D., Polito, C., Tomarchio, O.: A hierarchical Hadoop framework to handle Big Data in geo-distributed computing environments. Int. J. Inf. Technol. Syst. Approach (IJITSA) 11(1), 16–47 (2018). https://doi.org/10.4018/IJITSA.2018010102
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation, OSDI 2004. USENIX Association (2004)
Google Scholar
Dolev, S., Florissi, P., Gudes, E., Sharma, S., Singer, I.: A survey on geographically distributed big-data processing using mapreduce. IEEE Trans. Big Data 5(1), 60–80 (2019). https://doi.org/10.1109/TBDATA.2017.2723473
Article Google Scholar
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book, 2nd edn. Prentice Hall Press, Upper Saddle River (2008)
Google Scholar
Heintz, B., Chandra, A., Sitaraman, R., Weissman, J.: End-to-end optimization for geo-distributed MapReduce. IEEE Trans. Cloud Comput. 4(3), 293–306 (2016). https://doi.org/10.1109/TCC.2014.2355225
Article Google Scholar
Jayalath, C., Stephen, J., Eugster, P.: From the cloud to the atmosphere: running MapReduce across data centers. IEEE Trans. Comput. 63(1), 74–87 (2014). https://doi.org/10.1109/TC.2013.121
Article MathSciNet MATH Google Scholar
Kim, S., Won, J., Han, H., Eom, H., Yeom, H.Y.: Improving Hadoop performance in intercloud environments. SIGMETRICS Perform. Eval. Rev. 39(3), 107–109 (2011). https://doi.org/10.1145/2160803.2160873
Article Google Scholar
Luo, Y., Guo, Z., Sun, Y., Plale, B., Qiu, J., Li, W.W.: A hierarchical framework for cross-domain MapReduce execution. In: Proceedings of the Second International Workshop on Emerging Computational Methods for the Life Sciences, ECMLS 2011, pp. 15–22 (2011). https://doi.org/10.1145/1996023.1996026
Mattess, M., Calheiros, R.N., Buyya, R.: Scaling MapReduce applications across hybrid clouds to meet soft deadlines. In: Proceedings of the 2013 IEEE 27th International Conference on Advanced Information Networking and Applications, AINA 2013, pp. 629–636 (2013). https://doi.org/10.1109/AINA.2013.51
Rödiger, W., Idicula, S., Kemper, A., Neumann, T.: Flow-join: adaptive skew handling for distributed joins over high-speed networks. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 1194–1205, May 2016. https://doi.org/10.1109/ICDE.2016.7498324
Rupprecht, L., Culhane, W., Pietzuch, P.: SquirrelJoin: network-aware distributed join processing with lazy partitioning. Proc. VLDB Endowment 10(11), 1250–1261 (2017). https://doi.org/10.14778/3137628.3137636
Article Google Scholar
Sarma, A.D., Afrati, F.N., Salihoglu, S., Ullman, J.D.: Upper and lower bounds on the cost of a map-reduce computation. Proc. VLDB Endowment 6(4), 277–288 (2013). https://doi.org/10.14778/2535570.2488334
Article Google Scholar
Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. 22(3), 183–236 (1990). https://doi.org/10.1145/96602.96604
Article Google Scholar
The Apache Software Foundation: The Apache Hadoop project (2011). http://hadoop.apache.org/
Venner, J.: Pro Hadoop, 1st edn. Apress, Berkely (2009)
Book Google Scholar
Wang, L., et al.: G-hadoop: MapReduce across distributed data centers for data-intensive computing. Future Gener. Comput. Syst. 29(3), 739–750 (2013). https://doi.org/10.1016/j.future.2012.09.001. Special Section: Recent Developments in High Performance Computing and Security
Article Google Scholar
Yang, H., Dasdan, A., Hsiao, R., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD 2007, pp. 1029–1040 (2007). https://doi.org/10.1145/1247480.1247602
Zhang, Q., et al.: Improving Hadoop service provisioning in a geographically distributed cloud. In: 2014 IEEE 7th International Conference on Cloud Computing (CLOUD 2014), pp. 432–439, June 2014. https://doi.org/10.1109/CLOUD.2014.65

Download references

Author information

Authors and Affiliations

Department of Electrical, Electronic and Computer Engineering, University of Catania, V.le A. Doria 6, 95125, Catania, Italy
Giuseppe Di Modica & Orazio Tomarchio

Authors

Giuseppe Di Modica
View author publications
You can also search for this author in PubMed Google Scholar
Orazio Tomarchio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Orazio Tomarchio .

Editor information

Editors and Affiliations

School of ECM, Oxford Brookes University, Oxford, UK
Muhammad Younas
Department of Informatics, University of Bradford , Bradford, UK
Irfan Awan
Universite Paris Descartes, Paris, France
Salima Benbernou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Di Modica, G., Tomarchio, O. (2019). MapReduce Join Across Geo-Distributed Data Centers. In: Younas, M., Awan, I., Benbernou, S. (eds) Big Data Innovations and Applications. Innovate-Data 2019. Communications in Computer and Information Science, vol 1054. Springer, Cham. https://doi.org/10.1007/978-3-030-27355-2_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-27355-2_2
Published: 01 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27354-5
Online ISBN: 978-3-030-27355-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics