The Journal of Supercomputing

, Volume 72, Issue 7, pp 2537–2564 | Cite as

DCCP: an effective data placement strategy for data-intensive computations in distributed cloud computing systems

  • Tao Wang
  • Shihong Yao
  • Zhengquan XuEmail author
  • Shan Jia


Cloud computing systems provide high-performance computing resources and distributed storage space to deal with data-intensive computations. Data scheduling between data centers is becoming indispensable for the cloud computing systems in which a mass of large datasets is stored at different data centers and inter-center data accesses are needed in data analytics. However, the performance of data scheduling is highly dependent upon the rationality of data placement. Data placement is a key optimization method for reducing data scheduling between data centers and realizing statistical I/O load balancing, accordingly reducing the mean computation execution time. This paper proposes a data placement strategy, DCCP, which is developed based on dynamic computation correlation. DCCP places the datasets with high dynamic computation correlations at the same data center considering the I/O load and the capacity load of data centers; when computations are scheduled for this data center, most of the datasets they process are stored locally, and thus the mean computation execution time can be reduced. Evidence from a large number of experiments proves that the DCCP can achieve the statistical I/O load balancing and the capacity load balancing of data centers, thus reducing the total data scheduling between data centers as much as possible at a very low time complexity, even as the numbers of datasets and data centers increase.


Cloud computing Data placement Data scheduling I/O load balancing Dynamic computation correlation 



The research work reported in this paper is supported by the National Basic Research Program of China (No: 2011CB302306), the National Natural Science Foundation of China (No: 41271398) and the National Natural Science Foundation of China under Grant (No: 61402421).


  1. 1.
    Zheng ZG, Wang P, Liu J et al (2015) Real-time big data processing framework: challenges and solutions. Appl Math Inf Sci 9(6):2217–2237MathSciNetGoogle Scholar
  2. 2.
    Pan Y, Zhang J (2012) Parallel programming on cloud computing platforms—challenges and solutions. J Converg 3(4):23–28MathSciNetGoogle Scholar
  3. 3.
    Deelman E, Chervenak A (2008) Data management challenges of data-intensive scientific workflows. In: Proceedings of the 8th IEEE international symposium on cluster computing and the grid (CCGRID’08), Lyon, pp 687–692Google Scholar
  4. 4.
    Mahajan K, Makroo A, Dahiya D (2013) Round Robin with server affinity: a VM load balancing algorithm for cloud based infrastructure. J Inf Process Syst 9(3):379–394CrossRefGoogle Scholar
  5. 5.
    Li X, Mitton N, Nayak A et al (2012) Achieving load awareness in position-based wireless ad hoc routing. J Converg 3(3):17–22Google Scholar
  6. 6.
    Qin X (2008) Performance comparisons of load balancing algorithms for I/O-intensive workloads on clusters. J Netw Comput Appl 31(1):32–46CrossRefGoogle Scholar
  7. 7.
    Qin X, Jiang H, Manzanares A, Ruan X, Yin S (2009) Dynamic load balancing for I/O-intensive applications on clusters. ACM Trans Storage 5(3):9–46CrossRefGoogle Scholar
  8. 8.
    Maguluri ST, Srikant R, Ying L (2012) Stochastic models of load balancing and scheduling in cloud computing clusters. In: Proceedings of the 30th IEEE international conference on computer communications (INFOCOM), Shanghai, pp 702–710Google Scholar
  9. 9.
    Goel N, Shyamasundar RK (2012) An executional framework for BPMN using Orc. J Converg 3(1):29–36Google Scholar
  10. 10.
    Kosar T, Livny M (2004) Stork: making data placement a first class citizen in the grid. In: Proceedings of 24th international conference on distributed computing systems (ICDCS 2004). Keio University, Japan, pp 342–349Google Scholar
  11. 11.
    Ahmad I, Karlapalem K, Kwok Y, So S (2002) Evolutionary algorithms for allocating data in distributed database systems. Distrib Parallel Databases 11(1):5–32CrossRefzbMATHGoogle Scholar
  12. 12.
    Guo J, Wang Y, Tang KS (2008) Evolutionary optimization of file assignment for a large-scale video-on-demand system. IEEE Trans Knowl Data Eng 20(6):836–850CrossRefGoogle Scholar
  13. 13.
    Uysal M, Ulus T (2007) A threshold based dynamic data allocation algorithm—a Markov chain model approach. J Appl Sci 7(2):165–174CrossRefGoogle Scholar
  14. 14.
    Brinkmann A, Effert S, Scheideler C (2007) Dynamic and redundant data placement. In: Proceedings of the 27th international conference on distributed computing systems (ICDCS’07), Toronto, pp 29–39Google Scholar
  15. 15.
    Lee L, Scheuermann P, Vingralek R (2000) File assignment in parallel I/O systems with minimal variance of service time. IEEE Trans Comput 49(2):127–140CrossRefGoogle Scholar
  16. 16.
    Madathil D K, Thota R B, Paul P (2008) A static data placement strategy towards perfect load-balancing for distributed storage clusters. In: Proceedings of the 22nd IEEE international symposium on parallel and distributed processing (IPDPS 2008), Miami, pp 1–8Google Scholar
  17. 17.
    Park S, Jung IY, Eom H, Yeom HY (2013) An analysis of replication enhancement for a high availability cluster. J Inf Process Syst 9(2):205–216CrossRefGoogle Scholar
  18. 18.
    Zhu C, Zhu Q, Zuzarte C et al (2013) Developing a dynamic materialized view index for efficiently discovering usable views for progressive queries. J Inf Process Syst 9(4):511–537CrossRefGoogle Scholar
  19. 19.
    Bohannon P, Fan W, Geerts F (2007) Conditional functional dependencies for data cleaning. In: Proceedings of the 23rd IEEE international conference on data engineering (ICDE2007), Istanbul, pp 746–755Google Scholar
  20. 20.
    Geert M, Monique S, Wilfried L (2012) Managing data dependencies in service compositions. J Syst Softw 85(11):2604–2628CrossRefGoogle Scholar
  21. 21.
    Doraimani S, Iamnitchi A (2008) File grouping for scientific data management: lessons from experimenting with real traces. In: Proceedings of the 17th ACM international symposium on high performance distributed computing (HPDC-17), Boston, pp 153–164Google Scholar
  22. 22.
    Fedak G, He H, Cappello F (2008) BitDew: a programmable environment for large-scale data management and distribution. In: Proceedings of the 2008 ACM/IEEE conference on supercomputing (SC’08), Austin, pp 1–12Google Scholar
  23. 23.
    Agarwal S, Dunagan J, Jain N (2010) Volley: automated data placement for geo-distributed cloud services. In: Proceedings of the 7th USENIX symposium on networked systems design and implementation (NSDI’10), San Jose, pp 17–32Google Scholar
  24. 24.
    Yuan D, Yang Y, Liu X, Chen J (2010) A data placement strategy in scientific cloud workflows. Future Gener Comput Syst 26(8):1200–1214CrossRefGoogle Scholar
  25. 25.
    Zheng P, Cui L, Wang H, Xu M (2010) A data placement strategy for data-intensive applications in Cloud. Chin J Comput 33(8):1472–1480CrossRefGoogle Scholar
  26. 26.
    Nukarapu DT, Bin T, Wang L (2011) Data replication in data intensive scientific applications with performance guarantee. IEEE Trans Parallel Distrib Syst 22(8):1299–1306CrossRefGoogle Scholar
  27. 27.
    Kosar T, Livny M (2005) A framework for reliable and efficient data placement in distributed computing systems. J Parallel Distrib Comput 65:1146–1157CrossRefGoogle Scholar
  28. 28.
    Ranganathan K, Foster I (2002) Decoupling computation and data scheduling in distributed data-intensive applications. In: Proceedings of 11th IEEE international symposium on high performance distributed computing (HPDC-11), Edinburgh, pp 352–358Google Scholar
  29. 29.
    Jeong D, Ji S-Y, Suma EA et al (2015) Designing a collaborative visual analytics system to support users’ continuous analytical processes. Human-centric Comput Inf Sci 5(5):1–20Google Scholar
  30. 30.
    Kim H, Lee S-H, Sohn M-K et al (2014) Illumination invariant head pose estimation using random forests classifier and binary pattern run length matrix. Human-centric Comput Inf Sci 4:9CrossRefGoogle Scholar
  31. 31.
    Li R, Feng W, Wang H (2014) A new parameter estimation method for a zipf-like distribution for geospatial data access. ETRI J 36(1):134–140MathSciNetCrossRefGoogle Scholar
  32. 32.
    Albayram Y, Khan MMH, Bamis A et al (2015) Designing challenge questions for location-based authentication systems: a real-life study. Human-centric Comput Inf Sci 5:17CrossRefGoogle Scholar
  33. 33.
    Li R, Zhang Y, Xu Z (2013) A Load-balancing method for network GISs in a heterogeneous cluster-based system using access density. Future Gener Comput Syst 29(22):528–535CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Tao Wang
    • 1
    • 2
  • Shihong Yao
    • 1
    • 2
  • Zhengquan Xu
    • 1
    • 2
    Email author
  • Shan Jia
    • 1
    • 2
  1. 1.State Key Laboratory of Information Engineering in Surveying Mapping and Remote SensingWuhan UniversityWuhanPeople’s Republic of China
  2. 2.Collaborative Innovation Center for Geospatial TechnologyWuhanPeople’s Republic of China

Personalised recommendations