Cluster Computing

, Volume 17, Issue 2, pp 371–387 | Cite as

DA-TC: a novel application execution model in multicluster systems

  • Zhifeng Yun
  • Zhou Lei
  • Gabrielle Allen
  • Daniel S. Katz
  • J. Ramanujam
Article

Abstract

The availability of a large number of separate clusters has given rise to the field of multicluster systems in which these resources are coupled to obtain their combined benefits to solve large-scale compute-intensive applications. However, it is challenging to achieve automatic load balancing of the jobs across these participating autonomic systems. We developed a novel user space execution model named DA-TC to address the workload allocation techniques for the applications with large number of sequential jobs in multicluster systems. Through this model, we can achieve dynamic load balancing for task assignment, and slower resources become beneficial factors rather than bottlenecks for application execution. The effectiveness of this strategy is demonstrated through theoretical analysis. This model is also evaluated through extensive experimental studies and the results show that when compared with the traditional method, the proposed DA-TC model can significantly improve the performance of application execution in terms of application turnaround time and system reliability in multicluster circumstances.

Keywords

Scheduling Execution management Load balancing Cluster computing Multi-clusters Distributed systems 

Notes

Acknowledgements

We thank the reviewers for their feedback and suggestions that have helped us improve the presentation of the paper. This work is supported in part by the U.S. Department of Energy (DOE) under Award Number DE-FG02-04ER46136, by the Louisiana Board of Regents under contract number DOE/LEQSF (2004-07), by the U.S. National Science Foundation through awards 0811457, 0926687 and 1059417, and by the U.S. Army through contract W911NF-10-1-0004. Portions of this research were conducted using computational resources provided by the Louisiana Optical Network Initiative (http://www.loni.org).

References

  1. 1.
    Abawajy, J., Dandamudi, S.: Parallel job scheduling on multicluster computing system. In: Proceedings of 2003 IEEE International Conference on Cluster Computing, pp. 11–18. IEEE Computer Society Press, Los Alamitos (2003). doi:10.1109/CLUSTR.2003.1253294 Google Scholar
  2. 2.
    Aumage, O.: Heterogeneous multi-cluster networking with the Madeleine III communication library. In: Proceedings of 2002 IEEE International Parallel and Distributed Processing Symposium, vol. 2, pp. 85–96. IEEE Computer Society Press, Los Alamitos (2002). doi:10.1109/IPDPS.2002.1015658 Google Scholar
  3. 3.
    Banawan, S.A., Zeidat, N.M.: A comparative study of load sharing in heterogeneous multicomputer systems. In: Proceedings of the 25th Annual Symposium on Simulation, ANSS’92, pp. 22–31. IEEE Computer Society Press, Los Alamitos (1992). http://portal.acm.org/citation.cfm?id=306902.306921 CrossRefGoogle Scholar
  4. 4.
    Banen, S., Bucur, A.I.D., Epema, D.H.J.: A measurement-based simulation study of processor co-allocation in multicluster systems. In: Scheduling Strategies for Parallel Processing, pp. 105–128. Springer, Berlin (2003) CrossRefGoogle Scholar
  5. 5.
    Barreto, M., Avila, R., Navaux, P.: The multicluster model to the integrated use of multiple workstation clusters. In: Proc. of the 3rd Workshop on Personal Computer-based Networks of Workstations, pp. 71–80 (2000) Google Scholar
  6. 6.
    Berten, V., Goossens, J., Jeannot, E.: On the distribution of sequential jobs in random brokering for heterogeneous computational grids. IEEE Trans. Parallel Distrib. Syst. 17(2), 113–124 (2006) CrossRefGoogle Scholar
  7. 7.
    Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods. Athena Scientific, Nashua (1996) Google Scholar
  8. 8.
    Brevik, J., Nurmi, D., Wolski, R.: Predicting bounds on queuing delay for batch-scheduled parallel machines. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’06, pp. 110–118. ACM Press, New York (2006). doi:10.1145/1122971.1122989 Google Scholar
  9. 9.
    Chow, Y.C., Kohler, W.: Models for dynamic load balancing in a heterogeneous multiple processor system. IEEE Trans. Comput. C-28(5), 354–361 (1979). doi:10.1109/TC.1979.1675365 CrossRefMathSciNetGoogle Scholar
  10. 10.
    Chu, M., Fan, K., Mahlke, S.: Region-based hierarchical operation partitioning for multicluster processors. In: Proc. of the SIGPLAN’03 Conference on Programming Language Design and Implementation, pp. 300–311 (2003) Google Scholar
  11. 11.
    Deelman, E., Singh, G., Su, M.H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., Laity, A.C., Jacob, J.C., Katz, D.S.: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci. Program. 13(3), 219–237 (2005) Google Scholar
  12. 12.
    Downey, A.: Using queue time predictions for processor allocation. In: Feitelson, D., Rudolph, L. (eds.) Job Scheduling Strategies for Parallel Processing. Lecture Notes in Computer Science, vol. 1291, pp. 35–57. Springer, Berlin/Heidelberg (1997) CrossRefGoogle Scholar
  13. 13.
    Downey, A.B.: Predicting queue times on space-sharing parallel computers. In: Proceedings of the 11th International Symposium on Parallel Processing, IPPS’97, pp. 209–218. IEEE Computer Society Press, Los Alamitos (1997). http://portal.acm.org/citation.cfm?id=645607.661350 CrossRefGoogle Scholar
  14. 14.
    Foster, I., Kesselman, C., Tuecke, S.: The anatomy of the grid: enabling scalable virtual organizations. Int. J. High Perform. Comput. Appl. 15, 200–222 (2001). doi:10.1177/109434200101500302. http://portal.acm.org/citation.cfm?id=1080644.1080667 CrossRefGoogle Scholar
  15. 15.
    Frey, J., Tannenbaum, T., Livny, M., Foster, I., Tuecke, S.: Condor-g: a computation management agent for multi-institutional grids. In: Proceedings of 10th IEEE International Symposium on High Performance Distributed Computing, pp. 55–63 (2001). doi:10.1109/HPDC.2001.945176 CrossRefGoogle Scholar
  16. 16.
    He, L., Jarvis, S.A., Spooner, D.P., Chen, X., Nudd, G.R.: Hybrid performance-based workload management for multiclusters and grids. IEE Proc., Softw. 151(5), 224–231 (2004) CrossRefGoogle Scholar
  17. 17.
    He, L., Jarvis, S.A., Spooner, D.P., Jiang, H., Dillenberger, D.N., Nudd, G.R.: Allocating non-real-time and soft real-time jobs in multiclusters. IEEE Trans. Parallel Distrib. Syst. 17, 99–112 (2006). doi:10.1109/TPDS.2006.18 CrossRefGoogle Scholar
  18. 18.
    He, L., Jarvis, S.A., Spooner, D.P., Nudd, G.R.: Optimising static workload allocation in multiclusters. In: Proceedings of 18th IEEE International Parallel and Distributed Processing Symposium (IPDPS’04), pp. 26–30. IEEE Computer Society Press, Los Alamitos (2004) Google Scholar
  19. 19.
  20. 20.
    Kee, Y.S., Kesselman, C., Nurmi, D., Wolski, R.: Enabling personal clusters on demand for batch resources using commodity software. In: Parallel and Distributed Processing Symposium, International, pp. 1–7 (2008). doi:10.1109/IPDPS.2008.4536167 Google Scholar
  21. 21.
    Khalid, O., Anthony, R.J., Nilsson, P., Keahey, K., Schulz, M., Parrot, K., Petridis, M.: Enabling and optimizing pilot jobs using xen based virtual machines for the hpc grid applications. In: VTDC’09: Proceedings of the 3rd International Workshop on Virtualization Technologies in Distributed Computing, pp. 1–8. ACM Press, New York (2009) CrossRefGoogle Scholar
  22. 22.
    Kleinrock, L.: Queueing System. Wiley, New York (1975) Google Scholar
  23. 23.
    Leslie, R., McKenzie, S.: Evaluation of loadsharing algorithms for heterogeneous distributed systems. Comput. Commun. 22(4), 376–389 (1999). http://dblp.uni-trier.de/db/journals/comcom/comcom22.html#LeslieM99 CrossRefGoogle Scholar
  24. 24.
    MacLaren, J.: Harc: the highly-available resource co-allocator. In: Proceedings of the 2007 OTM Confederated International Conference on the Move to Meaningful Internet Systems: CoopIS, DOA, ODBASE, GADA, and is, vol. Part II, OTM’07, pp. 1385–1402. Springer, Berlin, Heidelberg (2007). http://portal.acm.org/citation.cfm?id=1784707.1784731 CrossRefGoogle Scholar
  25. 25.
    Nelson, R.: Probability, Stochastic Processes, and Queueing Theory. Springer, Berlin (1995) CrossRefMATHGoogle Scholar
  26. 26.
    Nilsson, P.: Experience from a pilot based system for atlas. J. Phys. Conf. Ser. 119(6), 062,038 (2008) CrossRefMathSciNetGoogle Scholar
  27. 27.
    Nurmi, D., Brevik, J., Wolski, R.: Qbets: queue bounds estimation from time series. In: Proceedings of Job Scheduling Strategies for Parallel Processing (JSSPP), pp. 76–101 (2007) Google Scholar
  28. 28.
    Nurmi, D., Wolski, R., Brevik, J.: Probabilistic advanced reservations for batch-scheduled parallel machines. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’08, pp. 289–290. ACM Press, New York (2008). doi:10.1145/1345206.1345260 Google Scholar
  29. 29.
    Nurmi, D.C., Wolski, R., Brevik, J.: Varq: virtual advance reservations for queues. In: Proceedings of the 17th International Symposium on High Performance Distributed Computing, HPDC’08, pp. 75–86. ACM Press, New York (2008). doi:10.1145/1383422.1383433 CrossRefGoogle Scholar
  30. 30.
  31. 31.
    Sfiligoi, I.: Making science in the grid world: using glideins to maximize scientific output. In: Nuclear Science Symposium Conference Record, 2007. NSS’07, vol. 2, pp. 1107–1109. IEEE Press, New York (2007) CrossRefGoogle Scholar
  32. 32.
    Smith, W., Taylor, V.E., Foster, I.T.: Using run-time predictions to estimate queue wait times and improve scheduler performance. In: Proceedings of the Job Scheduling Strategies for Parallel Processing, IPPS/SPDP’99/JSSPP’99, pp. 202–219. Springer, London (1999). http://portal.acm.org/citation.cfm?id=646380.689540 CrossRefGoogle Scholar
  33. 33.
    Snell, Q., Clement, M.J., Jackson, D.B., Gregory, C.: The performance impact of advance reservation meta-scheduling. In: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing, IPDPS’00/JSSPP’00, pp. 137–153. Springer, London (2000). http://portal.acm.org/citation.cfm?id=646381.689675 CrossRefGoogle Scholar
  34. 34.
    Tang, X., Chanson, S.T.: Optimizing static job scheduling in a network of heterogeneous computers. In: International Conference on Parallel Processing, p. 373. IEEE Computer Society Press, Los Alamitos (2000). doi:10.1109/ICPP.2000.876153 Google Scholar
  35. 35.
    Thain, D., Tannenbaum, T., Livny, M.: Condor and the grid. In: Berman, F., Fox, G., Hey, T. (eds.) Grid Computing: Making the Global Infrastructure a Reality. Wiley, New York (2002) Google Scholar
  36. 36.
    Tsaregorodtsev, A., Garonne, V., Stokes-Rees, I.: DIRAC: A scalable lightweight architecture for high throughput computing. In: IEEE/ACM International Workshop on Grid Computing, pp. 19–25 (2004) CrossRefGoogle Scholar
  37. 37.
    Walker, E., Gardner, J., Litvin, V., Turner, E.: Creating personal adaptive clusters for managing scientific jobs in a distributed computing environment. In: IEEE Workshop on Challenges of Large Applications in Distributed Environments, Paris, France, pp. 95–103 (2006). doi:10.1109/CLADE.2006.1652061 Google Scholar
  38. 38.
    Xie, M., Yun, Z., Lei, Z., Allen, G.: Cluster abstraction: towards uniform resource description and access in multicluster grid. In: International Multi-Symposiums on Computer and Computational Sciences, pp. 220–227 (2007) Google Scholar
  39. 39.
    Xu, M.: Effective metacomputing using LSF multicluster. In: Proceedings of IEEE International Symposium on Cluster Computing and the Grid(CCGrid01) (2001) Google Scholar
  40. 40.
    Yoshimoto, K., Kovatch, P.A., Andrews, P.: Co-scheduling with user-settable reservations. In: Proceedings of Job Scheduling Strategies for Parallel Processing (JSSPP), pp. 146–156 (2005) CrossRefGoogle Scholar
  41. 41.
    Zhang, Y., Koelbel, C., Cooper, K.: Batch queue resource scheduling for workflow applications. In: IEEE International Conference on Cluster Computing and Workshops (CLUSTER’09), pp. 1–10 (2009). doi:10.1109/CLUSTR.2009.5289186 Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Zhifeng Yun
    • 1
  • Zhou Lei
    • 2
  • Gabrielle Allen
    • 3
  • Daniel S. Katz
    • 4
  • J. Ramanujam
    • 3
  1. 1.Center for Computation and TechnologyLouisiana State UniversityBaton RougeUSA
  2. 2.School of Computer Engineering and ScienceShanghai UniversityShanghaiChina
  3. 3.School of Electrical Engineering and Computer ScienceLouisiana State UniversityBaton RougeUSA
  4. 4.Computation InstituteUniversity of Chicago & Argonne National LaboratoryChicagoUSA

Personalised recommendations