Abstract
MapReduce is with no doubt the parallel computation paradigm which has managed to interpret and serve at best the need, expressed in any field, of running fast and accurate analyses on Big Data. The strength of MapReduce is its capability of exploiting the computing power of a cluster of resources, by distributing the load on multiple computing units, and of scaling with the number of computing units. Today many data analysis algorithms are available in the MapReduce form: Data Sorting, Data Indexing, Word Counting, Relations Joining to name just a few. These algorithms have been observed to work fine in computing context where the computing units (nodes) connect by way of high performing network links (in the order of Gigabits per second). Unfortunately, when it comes to run MapReduce on nodes that are geographically distant to each other the performance dramatically degrades. Basically, in such scenarios the cost for moving data among nodes connected via geographic links counterbalances the benefit of parallelization. In this paper the issues of running MapReduce Joins in a geo-distributed computing context are discussed. Furthermore, we propose to boost the performance of the Join algorithm by leveraging a hierarchical computing approach.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Afrati, F.N., Ullman, J.D.: Optimizing multiway joins in a map-reduce environment. IEEE Trans. Knowl. Data Eng. 23(9), 1282–1298 (2011). https://doi.org/10.1109/TKDE.2011.47
Afrati, F., Dolev, S., Sharma, S., Ullman, J.: Meta-MapReduce: a technique for reducing communication in MapReduce computations. In: 17th International Symposium on Stabilization, Safety, and Security of Distributed Systems (Springer-SSS), Edmonton, Canada, August 2015
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 975–986. ACM, New York (2010). https://doi.org/10.1145/1807167.1807273
Cavallo, M., Di Modica, G., Polito, C., Tomarchio, O.: Fragmenting Big Data to boost the performance of MapReduce in geographical computing contexts. In: The 3rd International Conference on Big Data Innovations and Applications (Innovate-Data 2017), Prague, Czech Republic, pp. 17–24, August 2017. https://doi.org/10.1109/Innovate-Data.2017.12
Cavallo, M., Modica, G.D., Polito, C., Tomarchio, O.: A hierarchical Hadoop framework to handle Big Data in geo-distributed computing environments. Int. J. Inf. Technol. Syst. Approach (IJITSA) 11(1), 16–47 (2018). https://doi.org/10.4018/IJITSA.2018010102
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation, OSDI 2004. USENIX Association (2004)
Dolev, S., Florissi, P., Gudes, E., Sharma, S., Singer, I.: A survey on geographically distributed big-data processing using mapreduce. IEEE Trans. Big Data 5(1), 60–80 (2019). https://doi.org/10.1109/TBDATA.2017.2723473
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book, 2nd edn. Prentice Hall Press, Upper Saddle River (2008)
Heintz, B., Chandra, A., Sitaraman, R., Weissman, J.: End-to-end optimization for geo-distributed MapReduce. IEEE Trans. Cloud Comput. 4(3), 293–306 (2016). https://doi.org/10.1109/TCC.2014.2355225
Jayalath, C., Stephen, J., Eugster, P.: From the cloud to the atmosphere: running MapReduce across data centers. IEEE Trans. Comput. 63(1), 74–87 (2014). https://doi.org/10.1109/TC.2013.121
Kim, S., Won, J., Han, H., Eom, H., Yeom, H.Y.: Improving Hadoop performance in intercloud environments. SIGMETRICS Perform. Eval. Rev. 39(3), 107–109 (2011). https://doi.org/10.1145/2160803.2160873
Luo, Y., Guo, Z., Sun, Y., Plale, B., Qiu, J., Li, W.W.: A hierarchical framework for cross-domain MapReduce execution. In: Proceedings of the Second International Workshop on Emerging Computational Methods for the Life Sciences, ECMLS 2011, pp. 15–22 (2011). https://doi.org/10.1145/1996023.1996026
Mattess, M., Calheiros, R.N., Buyya, R.: Scaling MapReduce applications across hybrid clouds to meet soft deadlines. In: Proceedings of the 2013 IEEE 27th International Conference on Advanced Information Networking and Applications, AINA 2013, pp. 629–636 (2013). https://doi.org/10.1109/AINA.2013.51
Rödiger, W., Idicula, S., Kemper, A., Neumann, T.: Flow-join: adaptive skew handling for distributed joins over high-speed networks. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 1194–1205, May 2016. https://doi.org/10.1109/ICDE.2016.7498324
Rupprecht, L., Culhane, W., Pietzuch, P.: SquirrelJoin: network-aware distributed join processing with lazy partitioning. Proc. VLDB Endowment 10(11), 1250–1261 (2017). https://doi.org/10.14778/3137628.3137636
Sarma, A.D., Afrati, F.N., Salihoglu, S., Ullman, J.D.: Upper and lower bounds on the cost of a map-reduce computation. Proc. VLDB Endowment 6(4), 277–288 (2013). https://doi.org/10.14778/2535570.2488334
Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. 22(3), 183–236 (1990). https://doi.org/10.1145/96602.96604
The Apache Software Foundation: The Apache Hadoop project (2011). http://hadoop.apache.org/
Venner, J.: Pro Hadoop, 1st edn. Apress, Berkely (2009)
Wang, L., et al.: G-hadoop: MapReduce across distributed data centers for data-intensive computing. Future Gener. Comput. Syst. 29(3), 739–750 (2013). https://doi.org/10.1016/j.future.2012.09.001. Special Section: Recent Developments in High Performance Computing and Security
Yang, H., Dasdan, A., Hsiao, R., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD 2007, pp. 1029–1040 (2007). https://doi.org/10.1145/1247480.1247602
Zhang, Q., et al.: Improving Hadoop service provisioning in a geographically distributed cloud. In: 2014 IEEE 7th International Conference on Cloud Computing (CLOUD 2014), pp. 432–439, June 2014. https://doi.org/10.1109/CLOUD.2014.65
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Di Modica, G., Tomarchio, O. (2019). MapReduce Join Across Geo-Distributed Data Centers. In: Younas, M., Awan, I., Benbernou, S. (eds) Big Data Innovations and Applications. Innovate-Data 2019. Communications in Computer and Information Science, vol 1054. Springer, Cham. https://doi.org/10.1007/978-3-030-27355-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-27355-2_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27354-5
Online ISBN: 978-3-030-27355-2
eBook Packages: Computer ScienceComputer Science (R0)