Skip to main content
Log in

GEODIS: towards the optimization of data locality-aware job scheduling in geo-distributed data centers

  • Published:
Computing Aims and scope Submit manuscript

Abstract

Today, data-intensive applications rely on geographically distributed systems to leverage data collection, storing and processing. Data locality has been seen as a prominent technique to improve application performance and reduce the impact of network latency by scheduling jobs directly in the nodes hosting the data to be processed. MapReduce and Dryad are examples of frameworks which exploit locality by splitting jobs into multiple tasks that are dispatched to process portions of data locally. However, as the ecosystem of big data analysis has shifted from single clusters to span geo-distributed data centers, it is unavoidable that data may still be transferred through the network in order reduce the schedule length. Nevertheless, there is a lack of mechanism to efficiently blend data locality and inter-data center data transfer requirement in the existing scheduling techniques to address data-intensive processing across dispersed data centers. Therefore, the objective of this work is to propose and solve the makespan optimization problem for data-intensive job scheduling on geo-distributed data centers. To this end, we first formulate the task placement and the data access as a linear programming and use the GLPK solver to solve it. We then present a low complexity heuristic scheduling algorithm called GeoDis which allows data locality to cope with the data transfer requirement to achieve a greater performance on the makespan. The experiments with various realistic traces and synthetic generated workload show that GeoDis can reduce makespan of processing jobs by 44% as compared to the state-of-the-art algorithms and remain within \(91\%\) closer to the optimal solution by the LP solver.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Abad CL, Lu Y, Campbell RH (2011) Dare: adaptive data replication for efficient cluster scheduling. In: 2011 IEEE international conference on cluster computing, pp 159–168. doi:10.1109/CLUSTER.2011.26

  2. Abawajy JH, Deris MM (2014) Data replication approach with consistency guarantee for data grid. IEEE Trans Comput 63(12):2975–2987. doi:10.1109/TC.2013.183

    Article  MathSciNet  MATH  Google Scholar 

  3. AWS: Amazon Web Service (2006). http://aws.amazon.com

  4. Ananthanarayanan G, Ghodsi A, Shenker S, Stoica I (2013) Effective straggler mitigation: attack of the clones. In: Presented as part of the 10th USENIX symposium on networked systems design and implementation (NSDI 13). USENIX, Lombard, IL, pp 185–198. https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/ananthanarayanan

  5. Ananthanarayanan G, Kandula S, Greenberg A, Stoica I, Lu Y, Saha B, Harris E (2010) Reining in the outliers in map-reduce clusters using mantri. In: Proceedings of the 9th USENIX conference on operating systems design and implementation, OSDI’10. USENIX Association, Berkeley, CA, USA, pp 265–278. http://dl.acm.org/citation.cfm?id=1924943.1924962

  6. Anikode LR, Tang B (2011) Integrating scheduling and replication in data grids with performance guarantee. In: Global telecommunications conference (GLOBECOM 2011), 2011 IEEE, pp 1–6. doi:10.1109/GLOCOM.2011.6134492

  7. Breslau L, Cao P, Fan L, Phillips G, Shenker S (1999) Web caching and Zipf-like distributions: evidence and implications. In: INFOCOM ’99. Eighteenth annual joint conference of the IEEE computer and communications societies. Proceedings. IEEE, vol 1, pp 126–134. doi:10.1109/INFCOM.1999.749260

  8. Cameron DG, Carvajal-Schiaffino R, Millar AP, Nicholson C, Stockinger K, Zini F (2003) Evaluating scheduling and replica optimisation strategies in optorsim. In: Proceedings. First Latin American Web Congress, pp 52–59 (2003). doi:10.1109/GRID.2003.1261698

  9. Cardosa M, Wang C, Nangia A, Chandra A, Weissman J (2011) Exploring MapReduce efficiency with highly-distributed data. In: Proceedings of the second international workshop on MapReduce and its applications, MapReduce ’11, ACM, New York, NY, USA, pp 27–34. doi:10.1145/1996092.1996100

  10. Cavallo M, Modica GD, Polito C, Tomarchio O (2016) Application profiling in hierarchical Hadoop for geo-distributed computing environments. In: 2016 IEEE symposium on computers and communication (ISCC), pp 555–560. doi:10.1109/ISCC.2016.7543796

  11. Chen W, Paik I, Li Z (2016) Cost-aware streaming workflow allocation on geo-distributed data centers. IEEE Trans Comput 66(2):256–271. doi:10.1109/TC.2016.2595579

    MathSciNet  MATH  Google Scholar 

  12. Chen Y, Ganapathi A, Griffith R, Katz R (2011) The case for evaluating mapreduce performance using workload suites. In: 2011 IEEE 19th annual international symposium on modelling, analysis, and simulation of computer and telecommunication systems, pp 390–399. doi:10.1109/MASCOTS.2011.12

  13. Cheng D, Rao J, Guo Y, Jiang C, Zhou X (2017) Improving performance of heterogeneous mapreduce clusters with adaptive task tuning. IEEE Trans Parallel Distrib Syst 28(3):774–786. doi:10.1109/TPDS.2016.2594765

    Article  Google Scholar 

  14. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. doi:10.1145/1327452.1327492

    Article  Google Scholar 

  15. Elghirani A, Subrata R, Zomaya AY (2007) Intelligent scheduling and replication in datagrids: a synergistic approach. In: Seventh IEEE international symposium on cluster computing and the grid (CCGrid ’07), pp 179–182. doi:10.1109/CCGRID.2007.65

  16. Garg N, Kumar A, Pandit V (2007) Order scheduling models: hardness and algorithms. In: Proceedings of the 27th international conference on foundations of software technology and theoretical computer science, FSTTCS’07, Springer, Berlin, pp 96–107

  17. Google Compute Engine (2011). https://cloud.google.com/compute/

  18. Greenberg A, Hamilton J, Maltz DA, Patel P (2008) The cost of a cloud: research problems in data center networks. SIGCOMM Comput Commun Rev 39(1):68–73. doi:10.1145/1496091.1496103

    Article  Google Scholar 

  19. Apache Hadoop Project (2013). http://hadoop.apache.org

  20. Heintz B, Chandra A, Sitaraman RK, Weissman J (2016) End-to-end optimization for geo-distributed mapreduce. IEEE Trans Cloud Comput 4(3):293–306. doi:10.1109/TCC.2014.2355225

    Article  Google Scholar 

  21. Herodotou H, Dong F, Babu S (2011) No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: Proceedings of the 2nd ACM symposium on cloud computing, SOCC ’11, ACM, New York, NY, USA, pp 18:1–18:14. doi:10.1145/2038916.2038934

  22. Hu Z, Li B, Luo J (2016) Flutter: scheduling tasks closer to data across geo-distributed datacenters. In: IEEE INFOCOM 2016 - the 35th annual IEEE international conference on computer communications, pp 1–9. doi:10.1109/INFOCOM.2016.7524469

  23. Hung CC, Golubchik L, Yu M (2015) Scheduling jobs across geo-distributed datacenters. In: Proceedings of the sixth acm symposium on cloud computing, SoCC ’15, ACM, New York, NY, USA , pp 111–124. doi:10.1145/2806777.2806780

  24. Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2007 Eurosys conference. Association for Computing Machinery, Inc., Lisbon, Portugal. http://research.microsoft.com/apps/pubs/default.aspx?id=63785

  25. Jalaparti V, Ballani H, Costa P, Karagiannis T, Rowstron A (2012) Bridging the tenant-provider gap in cloud services. In: Proceedings of the third ACM symposium on cloud computing, SoCC ’12, ACM, New York, NY, USA, pp 10:1–10:14. doi:10.1145/2391229.2391239

  26. Jalaparti V, Bodik P, Menache I, Rao S, Makarychev K, Caesar M (2015) Network-aware scheduling for data-parallel jobs: plan when you can. SIGCOMM Comput Commun Rev 45(4):407–420. doi:10.1145/2829988.2787488

    Article  Google Scholar 

  27. Jin Y, Gao Y, Qian Z, Zhai M, Peng H, Lu S (2016) Workload-aware scheduling across geo-distributed data centers. In: 2016 IEEE Trustcom/BigDataSE/ISPA, pp 1455–1462. doi:10.1109/TrustCom.2016.0228

  28. Jolfaei F, Haghighat AT (2012) The impact of bandwidth and storage space on job scheduling and data replication strategies in data grids. In: Computing technology and information management (ICCM), 2012 8th international conference on, vol 1, pp 283–288

  29. Kloudas K, Mamede M, Preguiça N, Rodrigues R (2015) Pixida: optimizing data parallel jobs in wide-area data analytics. Proc VLDB Endow 9(2):72–83. doi:10.14778/2850578.2850582

    Article  Google Scholar 

  30. Koshiba Y, Chen W, Yamada Y, Tanaka T, Paik I (2015) Investigation of network traffic in geo-distributed data centers. In: 2015 IEEE 7th international conference on awareness science and technology (iCAST), pp 174–179 (2015). doi:10.1109/ICAwST.2015.7314042

  31. Kwok YK, Ahmad I (1999) Fastest: a practical low-complexity algorithm for compile-time assignment of parallel programs to multiprocessors. IEEE Trans Parallel Distrib Syst 10(2):147–159. doi:10.1109/71.752781

    Article  Google Scholar 

  32. Lee YC, Zomaya AY (2007) Practical scheduling of bag-of-tasks applications on grids with dynamic resilience. IEEE Trans Comput 56(6):815–825. doi:10.1109/TC.2007.1042

    Article  MathSciNet  Google Scholar 

  33. Li P, Guo S, Miyazaki T, Liao X, Jin H, Zomaya A, Wang K (2016) Traffic-aware geo-distributed big data analytics with predictable job completion time. IEEE Trans Parallel Distrib Syst 28(6):1785–1796. doi:10.1109/TPDS.2016.2626285

    Article  Google Scholar 

  34. Li P, Guo S, Yu S, Zhuang W (2015) Cross-cloud mapreduce for big data. IEEE Trans Cloud Comput 26(3):1–14. doi:10.1109/TCC.2015.2474385

  35. Li S, Lu Q, Zhang W, Zhu L (2015) A mapreduce cluster deployment optimization framework with geo-distributed data. In: 2015 IEEE 12th Intl Conf on ubiquitous intelligence and computing and 2015 IEEE 12th intl conf on autonomic and trusted computing and 2015 IEEE 15th intl conf on scalable computing and communications and its associated workshops (UIC-ATC-ScalCom), pp 943–949. doi:10.1109/UIC-ATC-ScalCom-CBDCom-IoP.2015.179

  36. Li, W., Yang, Y., Yuan, D.: A novel cost-effective dynamic data replication strategy for reliability in cloud data centres. In: Dependable, autonomic and secure computing (DASC), 2011 IEEE ninth international conference on, pp 496–502. doi:10.1109/DASC.2011.95

  37. Liao X, Gao Z, Ji W, Wang Y (2015) An enforcement of real time scheduling in spark streaming. In: Green computing conference and sustainable computing conference (IGSC), 2015 sixth international, pp 1–6. doi:10.1109/IGCC.2015.7393730

  38. Lin W, Qian Z, Xu J, Yang S, Zhou J, Zhou L (2016) Streamscope: continuous reliable distributed processing of big data streams. In: 13th USENIX symposium on networked systems design and implementation (NSDI 16), USENIX Association, Santa Clara, CA, pp 439–453. https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/lin

  39. Makhorin A (2012) Gnu linear programming kit, version 4.52. http://www.gnu.org/software/glpk/glpk.html

  40. Mandal A, Xin Y, Baldine I, Ruth P, Heerman C, Chase J, Orlikowski V, Yumerefendi A (2011) Provisioning and evaluating multi-domain networked clouds for Hadoop-based applications. In: 2011 IEEE third international conference on cloud computing technology and science, pp 690–697. doi:10.1109/CloudCom.2011.107

  41. Microsoft Azure (2010). https://azure.microsoft.com/

  42. Nguyen VH, Tuong NH, Tran VH, Thoai N (2013) An MILP-based makespan minimization model for single-machine scheduling problem with splitable jobs and availability constraints. In: Computing, management and telecommunications (ComManTel), 2013 international conference on, pp 397–400. doi:10.1109/ComManTel.2013.6482427

  43. Pu Q, Ananthanarayanan G, Bodik P, Kandula S, Akella A, Bahl P, Stoica I (2015) Low latency geo-distributed data analytics. SIGCOMM Comput Commun Rev 45(4):421–434. doi:10.1145/2829988.2787505

    Article  Google Scholar 

  44. Pu Q, Ananthanarayanan G, Bodik P, Kandula S, Akella A, Bahl P, Stoica I (2015) Low latency geo-distributed data analytics. In: Proceedings of the 2015 ACM conference on special interest group on data communication, SIGCOMM ’15, ACM, New York, NY, USA, pp 421–434. doi:10.1145/2785956.2787505

  45. Rackspace (1998). https://www.rackspace.com/

  46. Schrage L (1968) A proof of the optimality of the shortest remaining processing time discipline. Oper Res 16(3):687–690. https://www.rackspace.com/

  47. Sih GC, Lee EA (1993) A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Trans Parallel Distrib Syst 4(2):175–187. doi:10.1109/71.207593

    Article  Google Scholar 

  48. Sooezi N, Abrishami S, Lotfian M (2015) Scheduling data-driven workflows in multi-cloud environment. In: 2015 IEEE 7th international conference on cloud computing technology and science (CloudCom), pp 163–167. doi:10.1109/CloudCom.2015.95

  49. Apache Spark? (2013). http://spark.apache.org/

  50. Toosi AN, Buyya R (2015) A fuzzy logic-based controller for cost and energy efficient load balancing in geo-distributed data centers. In: 2015 IEEE/ACM 8th international conference on utility and cloud computing (UCC), pp 186–194. doi:10.1109/UCC.2015.35

  51. Tripathi R, Vignesh S, Tamarapalli V, Medhi D (2017) Cost efficient design of fault tolerant geo-distributed data centers. IEEE Trans Network Service Manag 14(2):289–301. doi:10.1109/TNSM.2017.2691007

    Article  Google Scholar 

  52. Tudoran R, Costan A, Antoniu G (2016) Overflow: multi-site aware big data management for scientific workflows on clouds. IEEE Trans Cloud Comput 4(1):76–89. doi:10.1109/TCC.2015.2440254

    Article  Google Scholar 

  53. Venugopal S, Buyya R (2008) An SCP-based heuristic approach for scheduling distributed data-intensive applications on global grids. J Parallel Distrib Comput 68(4):471–487. doi:10.1016/j.jpdc.2007.07.004

    Article  MATH  Google Scholar 

  54. Vulimiri A, Curino C, Godfrey PB, Jungblut T, Karanasos K, Padhye J, Varghese G (2015) Wanalytics: geo-distributed analytics for a data intensive world. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, SIGMOD ’15, ACM, New York, NY, USA, pp 1087–1092. doi:10.1145/2723372.2735365

  55. Vulimiri A, Curino C, Godfrey PB, Jungblut T, Padhye J, Varghese G (2015) Global analytics in the face of bandwidth and regulatory constraints. In: 12th usenix symposium on networked systems design and implementation (NSDI 15), USENIX Association, Oakland, CA, pp 323–336. https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/vulimiri

  56. Wang L, Tao J, Ranjan R, Marten H, Streit A, Chen J, Chen D (2013) G-Hadoop: mapreduce across distributed data centers for data-intensive computing. Future Gener Comput Syst 29(3):739–750. doi:10.1016/j.future.2012.09.001. Special section: recent developments in high performance computing and security

  57. Zarina M, Ahmad F, bin Mohd Rose AN, Nordin M, Deris MM (2013) Job scheduling for dynamic data replication strategy in heterogeneous federation data grid systems. In: Informatics and applications (ICIA), 2013 second international conference on, pp 203–206. doi:10.1109/ICoIA.2013.6650256

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ching-Hsien Hsu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Convolbo, M.W., Chou, J., Hsu, CH. et al. GEODIS: towards the optimization of data locality-aware job scheduling in geo-distributed data centers. Computing 100, 21–46 (2018). https://doi.org/10.1007/s00607-017-0564-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-017-0564-7

Keywords

Mathematics Subject Classification

Navigation