, Volume 100, Issue 1, pp 21–46 | Cite as

GEODIS: towards the optimization of data locality-aware job scheduling in geo-distributed data centers

  • Moïse W. Convolbo
  • Jerry Chou
  • Ching-Hsien HsuEmail author
  • Yeh Ching Chung


Today, data-intensive applications rely on geographically distributed systems to leverage data collection, storing and processing. Data locality has been seen as a prominent technique to improve application performance and reduce the impact of network latency by scheduling jobs directly in the nodes hosting the data to be processed. MapReduce and Dryad are examples of frameworks which exploit locality by splitting jobs into multiple tasks that are dispatched to process portions of data locally. However, as the ecosystem of big data analysis has shifted from single clusters to span geo-distributed data centers, it is unavoidable that data may still be transferred through the network in order reduce the schedule length. Nevertheless, there is a lack of mechanism to efficiently blend data locality and inter-data center data transfer requirement in the existing scheduling techniques to address data-intensive processing across dispersed data centers. Therefore, the objective of this work is to propose and solve the makespan optimization problem for data-intensive job scheduling on geo-distributed data centers. To this end, we first formulate the task placement and the data access as a linear programming and use the GLPK solver to solve it. We then present a low complexity heuristic scheduling algorithm called GeoDis which allows data locality to cope with the data transfer requirement to achieve a greater performance on the makespan. The experiments with various realistic traces and synthetic generated workload show that GeoDis can reduce makespan of processing jobs by 44% as compared to the state-of-the-art algorithms and remain within \(91\%\) closer to the optimal solution by the LP solver.


Geo-distributed Data center Scheduling Data locality Batch jobs Big data analysis 

Mathematics Subject Classification

90C05 Linear programming 90C27 Combinatorial optimization 90C46 Optimality conditions, duality 


  1. 1.
    Abad CL, Lu Y, Campbell RH (2011) Dare: adaptive data replication for efficient cluster scheduling. In: 2011 IEEE international conference on cluster computing, pp 159–168. doi: 10.1109/CLUSTER.2011.26
  2. 2.
    Abawajy JH, Deris MM (2014) Data replication approach with consistency guarantee for data grid. IEEE Trans Comput 63(12):2975–2987. doi: 10.1109/TC.2013.183 MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    AWS: Amazon Web Service (2006).
  4. 4.
    Ananthanarayanan G, Ghodsi A, Shenker S, Stoica I (2013) Effective straggler mitigation: attack of the clones. In: Presented as part of the 10th USENIX symposium on networked systems design and implementation (NSDI 13). USENIX, Lombard, IL, pp 185–198.
  5. 5.
    Ananthanarayanan G, Kandula S, Greenberg A, Stoica I, Lu Y, Saha B, Harris E (2010) Reining in the outliers in map-reduce clusters using mantri. In: Proceedings of the 9th USENIX conference on operating systems design and implementation, OSDI’10. USENIX Association, Berkeley, CA, USA, pp 265–278.
  6. 6.
    Anikode LR, Tang B (2011) Integrating scheduling and replication in data grids with performance guarantee. In: Global telecommunications conference (GLOBECOM 2011), 2011 IEEE, pp 1–6. doi: 10.1109/GLOCOM.2011.6134492
  7. 7.
    Breslau L, Cao P, Fan L, Phillips G, Shenker S (1999) Web caching and Zipf-like distributions: evidence and implications. In: INFOCOM ’99. Eighteenth annual joint conference of the IEEE computer and communications societies. Proceedings. IEEE, vol 1, pp 126–134. doi: 10.1109/INFCOM.1999.749260
  8. 8.
    Cameron DG, Carvajal-Schiaffino R, Millar AP, Nicholson C, Stockinger K, Zini F (2003) Evaluating scheduling and replica optimisation strategies in optorsim. In: Proceedings. First Latin American Web Congress, pp 52–59 (2003). doi: 10.1109/GRID.2003.1261698
  9. 9.
    Cardosa M, Wang C, Nangia A, Chandra A, Weissman J (2011) Exploring MapReduce efficiency with highly-distributed data. In: Proceedings of the second international workshop on MapReduce and its applications, MapReduce ’11, ACM, New York, NY, USA, pp 27–34. doi: 10.1145/1996092.1996100
  10. 10.
    Cavallo M, Modica GD, Polito C, Tomarchio O (2016) Application profiling in hierarchical Hadoop for geo-distributed computing environments. In: 2016 IEEE symposium on computers and communication (ISCC), pp 555–560. doi: 10.1109/ISCC.2016.7543796
  11. 11.
    Chen W, Paik I, Li Z (2016) Cost-aware streaming workflow allocation on geo-distributed data centers. IEEE Trans Comput 66(2):256–271. doi: 10.1109/TC.2016.2595579 MathSciNetzbMATHGoogle Scholar
  12. 12.
    Chen Y, Ganapathi A, Griffith R, Katz R (2011) The case for evaluating mapreduce performance using workload suites. In: 2011 IEEE 19th annual international symposium on modelling, analysis, and simulation of computer and telecommunication systems, pp 390–399. doi: 10.1109/MASCOTS.2011.12
  13. 13.
    Cheng D, Rao J, Guo Y, Jiang C, Zhou X (2017) Improving performance of heterogeneous mapreduce clusters with adaptive task tuning. IEEE Trans Parallel Distrib Syst 28(3):774–786. doi: 10.1109/TPDS.2016.2594765 CrossRefGoogle Scholar
  14. 14.
    Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. doi: 10.1145/1327452.1327492 CrossRefGoogle Scholar
  15. 15.
    Elghirani A, Subrata R, Zomaya AY (2007) Intelligent scheduling and replication in datagrids: a synergistic approach. In: Seventh IEEE international symposium on cluster computing and the grid (CCGrid ’07), pp 179–182. doi: 10.1109/CCGRID.2007.65
  16. 16.
    Garg N, Kumar A, Pandit V (2007) Order scheduling models: hardness and algorithms. In: Proceedings of the 27th international conference on foundations of software technology and theoretical computer science, FSTTCS’07, Springer, Berlin, pp 96–107Google Scholar
  17. 17.
    Google Compute Engine (2011).
  18. 18.
    Greenberg A, Hamilton J, Maltz DA, Patel P (2008) The cost of a cloud: research problems in data center networks. SIGCOMM Comput Commun Rev 39(1):68–73. doi: 10.1145/1496091.1496103 CrossRefGoogle Scholar
  19. 19.
    Apache Hadoop Project (2013).
  20. 20.
    Heintz B, Chandra A, Sitaraman RK, Weissman J (2016) End-to-end optimization for geo-distributed mapreduce. IEEE Trans Cloud Comput 4(3):293–306. doi: 10.1109/TCC.2014.2355225 CrossRefGoogle Scholar
  21. 21.
    Herodotou H, Dong F, Babu S (2011) No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: Proceedings of the 2nd ACM symposium on cloud computing, SOCC ’11, ACM, New York, NY, USA, pp 18:1–18:14. doi: 10.1145/2038916.2038934
  22. 22.
    Hu Z, Li B, Luo J (2016) Flutter: scheduling tasks closer to data across geo-distributed datacenters. In: IEEE INFOCOM 2016 - the 35th annual IEEE international conference on computer communications, pp 1–9. doi: 10.1109/INFOCOM.2016.7524469
  23. 23.
    Hung CC, Golubchik L, Yu M (2015) Scheduling jobs across geo-distributed datacenters. In: Proceedings of the sixth acm symposium on cloud computing, SoCC ’15, ACM, New York, NY, USA , pp 111–124. doi: 10.1145/2806777.2806780
  24. 24.
    Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2007 Eurosys conference. Association for Computing Machinery, Inc., Lisbon, Portugal.
  25. 25.
    Jalaparti V, Ballani H, Costa P, Karagiannis T, Rowstron A (2012) Bridging the tenant-provider gap in cloud services. In: Proceedings of the third ACM symposium on cloud computing, SoCC ’12, ACM, New York, NY, USA, pp 10:1–10:14. doi: 10.1145/2391229.2391239
  26. 26.
    Jalaparti V, Bodik P, Menache I, Rao S, Makarychev K, Caesar M (2015) Network-aware scheduling for data-parallel jobs: plan when you can. SIGCOMM Comput Commun Rev 45(4):407–420. doi: 10.1145/2829988.2787488 CrossRefGoogle Scholar
  27. 27.
    Jin Y, Gao Y, Qian Z, Zhai M, Peng H, Lu S (2016) Workload-aware scheduling across geo-distributed data centers. In: 2016 IEEE Trustcom/BigDataSE/ISPA, pp 1455–1462. doi: 10.1109/TrustCom.2016.0228
  28. 28.
    Jolfaei F, Haghighat AT (2012) The impact of bandwidth and storage space on job scheduling and data replication strategies in data grids. In: Computing technology and information management (ICCM), 2012 8th international conference on, vol 1, pp 283–288Google Scholar
  29. 29.
    Kloudas K, Mamede M, Preguiça N, Rodrigues R (2015) Pixida: optimizing data parallel jobs in wide-area data analytics. Proc VLDB Endow 9(2):72–83. doi: 10.14778/2850578.2850582 CrossRefGoogle Scholar
  30. 30.
    Koshiba Y, Chen W, Yamada Y, Tanaka T, Paik I (2015) Investigation of network traffic in geo-distributed data centers. In: 2015 IEEE 7th international conference on awareness science and technology (iCAST), pp 174–179 (2015). doi: 10.1109/ICAwST.2015.7314042
  31. 31.
    Kwok YK, Ahmad I (1999) Fastest: a practical low-complexity algorithm for compile-time assignment of parallel programs to multiprocessors. IEEE Trans Parallel Distrib Syst 10(2):147–159. doi: 10.1109/71.752781 CrossRefGoogle Scholar
  32. 32.
    Lee YC, Zomaya AY (2007) Practical scheduling of bag-of-tasks applications on grids with dynamic resilience. IEEE Trans Comput 56(6):815–825. doi: 10.1109/TC.2007.1042 MathSciNetCrossRefGoogle Scholar
  33. 33.
    Li P, Guo S, Miyazaki T, Liao X, Jin H, Zomaya A, Wang K (2016) Traffic-aware geo-distributed big data analytics with predictable job completion time. IEEE Trans Parallel Distrib Syst 28(6):1785–1796. doi: 10.1109/TPDS.2016.2626285 CrossRefGoogle Scholar
  34. 34.
    Li P, Guo S, Yu S, Zhuang W (2015) Cross-cloud mapreduce for big data. IEEE Trans Cloud Comput 26(3):1–14. doi: 10.1109/TCC.2015.2474385
  35. 35.
    Li S, Lu Q, Zhang W, Zhu L (2015) A mapreduce cluster deployment optimization framework with geo-distributed data. In: 2015 IEEE 12th Intl Conf on ubiquitous intelligence and computing and 2015 IEEE 12th intl conf on autonomic and trusted computing and 2015 IEEE 15th intl conf on scalable computing and communications and its associated workshops (UIC-ATC-ScalCom), pp 943–949. doi: 10.1109/UIC-ATC-ScalCom-CBDCom-IoP.2015.179
  36. 36.
    Li, W., Yang, Y., Yuan, D.: A novel cost-effective dynamic data replication strategy for reliability in cloud data centres. In: Dependable, autonomic and secure computing (DASC), 2011 IEEE ninth international conference on, pp 496–502. doi: 10.1109/DASC.2011.95
  37. 37.
    Liao X, Gao Z, Ji W, Wang Y (2015) An enforcement of real time scheduling in spark streaming. In: Green computing conference and sustainable computing conference (IGSC), 2015 sixth international, pp 1–6. doi: 10.1109/IGCC.2015.7393730
  38. 38.
    Lin W, Qian Z, Xu J, Yang S, Zhou J, Zhou L (2016) Streamscope: continuous reliable distributed processing of big data streams. In: 13th USENIX symposium on networked systems design and implementation (NSDI 16), USENIX Association, Santa Clara, CA, pp 439–453.
  39. 39.
    Makhorin A (2012) Gnu linear programming kit, version 4.52.
  40. 40.
    Mandal A, Xin Y, Baldine I, Ruth P, Heerman C, Chase J, Orlikowski V, Yumerefendi A (2011) Provisioning and evaluating multi-domain networked clouds for Hadoop-based applications. In: 2011 IEEE third international conference on cloud computing technology and science, pp 690–697. doi: 10.1109/CloudCom.2011.107
  41. 41.
    Microsoft Azure (2010).
  42. 42.
    Nguyen VH, Tuong NH, Tran VH, Thoai N (2013) An MILP-based makespan minimization model for single-machine scheduling problem with splitable jobs and availability constraints. In: Computing, management and telecommunications (ComManTel), 2013 international conference on, pp 397–400. doi: 10.1109/ComManTel.2013.6482427
  43. 43.
    Pu Q, Ananthanarayanan G, Bodik P, Kandula S, Akella A, Bahl P, Stoica I (2015) Low latency geo-distributed data analytics. SIGCOMM Comput Commun Rev 45(4):421–434. doi: 10.1145/2829988.2787505 CrossRefGoogle Scholar
  44. 44.
    Pu Q, Ananthanarayanan G, Bodik P, Kandula S, Akella A, Bahl P, Stoica I (2015) Low latency geo-distributed data analytics. In: Proceedings of the 2015 ACM conference on special interest group on data communication, SIGCOMM ’15, ACM, New York, NY, USA, pp 421–434. doi: 10.1145/2785956.2787505
  45. 45.
    Rackspace (1998).
  46. 46.
    Schrage L (1968) A proof of the optimality of the shortest remaining processing time discipline. Oper Res 16(3):687–690.
  47. 47.
    Sih GC, Lee EA (1993) A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Trans Parallel Distrib Syst 4(2):175–187. doi: 10.1109/71.207593 CrossRefGoogle Scholar
  48. 48.
    Sooezi N, Abrishami S, Lotfian M (2015) Scheduling data-driven workflows in multi-cloud environment. In: 2015 IEEE 7th international conference on cloud computing technology and science (CloudCom), pp 163–167. doi: 10.1109/CloudCom.2015.95
  49. 49.
    Apache Spark? (2013).
  50. 50.
    Toosi AN, Buyya R (2015) A fuzzy logic-based controller for cost and energy efficient load balancing in geo-distributed data centers. In: 2015 IEEE/ACM 8th international conference on utility and cloud computing (UCC), pp 186–194. doi: 10.1109/UCC.2015.35
  51. 51.
    Tripathi R, Vignesh S, Tamarapalli V, Medhi D (2017) Cost efficient design of fault tolerant geo-distributed data centers. IEEE Trans Network Service Manag 14(2):289–301. doi: 10.1109/TNSM.2017.2691007 CrossRefGoogle Scholar
  52. 52.
    Tudoran R, Costan A, Antoniu G (2016) Overflow: multi-site aware big data management for scientific workflows on clouds. IEEE Trans Cloud Comput 4(1):76–89. doi: 10.1109/TCC.2015.2440254 CrossRefGoogle Scholar
  53. 53.
    Venugopal S, Buyya R (2008) An SCP-based heuristic approach for scheduling distributed data-intensive applications on global grids. J Parallel Distrib Comput 68(4):471–487. doi: 10.1016/j.jpdc.2007.07.004 CrossRefzbMATHGoogle Scholar
  54. 54.
    Vulimiri A, Curino C, Godfrey PB, Jungblut T, Karanasos K, Padhye J, Varghese G (2015) Wanalytics: geo-distributed analytics for a data intensive world. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, SIGMOD ’15, ACM, New York, NY, USA, pp 1087–1092. doi: 10.1145/2723372.2735365
  55. 55.
    Vulimiri A, Curino C, Godfrey PB, Jungblut T, Padhye J, Varghese G (2015) Global analytics in the face of bandwidth and regulatory constraints. In: 12th usenix symposium on networked systems design and implementation (NSDI 15), USENIX Association, Oakland, CA, pp 323–336.
  56. 56.
    Wang L, Tao J, Ranjan R, Marten H, Streit A, Chen J, Chen D (2013) G-Hadoop: mapreduce across distributed data centers for data-intensive computing. Future Gener Comput Syst 29(3):739–750. doi: 10.1016/j.future.2012.09.001. Special section: recent developments in high performance computing and security
  57. 57.
    Zarina M, Ahmad F, bin Mohd Rose AN, Nordin M, Deris MM (2013) Job scheduling for dynamic data replication strategy in heterogeneous federation data grid systems. In: Informatics and applications (ICIA), 2013 second international conference on, pp 203–206. doi: 10.1109/ICoIA.2013.6650256

Copyright information

© Springer-Verlag GmbH Austria 2017

Authors and Affiliations

  • Moïse W. Convolbo
    • 1
  • Jerry Chou
    • 1
  • Ching-Hsien Hsu
    • 2
    • 3
    Email author
  • Yeh Ching Chung
    • 1
  1. 1.National Tsing Hua UniversityHsinchuTaiwan
  2. 2.School of Mathematics and Big DataFoshan UniversityFoshanChina
  3. 3.Chung Hua UniversityHsinchuTaiwan

Personalised recommendations