Skip to main content
Log in

Distributed Late-binding Scheduling and Cooperative Data Caching

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Pull-based overlays are used in some of today’s largest computational grids. Job agents are submitted to resources with the duty of retrieving real workload from a central queue at runtime and executing it. This model helps overcome the problems of direct job submission in the highly complex grid environments, namely, heterogeneity, imprecise status information, relatively high failure rates and slow adaptation to changes of grid conditions or user priorities. This article presents a distributed scheduling architecture for such late-binding overlays. In this architecture, execution nodes share a distributed hash table and cooperatively perform job assignment. As our experiments prove, scalability problems of centralized matching are avoided, achieving low and predictable scheduling overheads even for execution of large workflows, and total turnaround times are improved. This is in line with the predictions of a theoretical model of grid workflow execution that the article also discusses. Scalability makes fine-grained scheduling possible and enables new functionalities, like a distributed data cache shared by the execution nodes, which helps alleviate the commonly congested storage services. In addition, we show that our system is more resilient to problems like communication breakdowns between computation centres. Moreover, the new architecture is better prepared to deal with demanding scenarios like intense demand of popular data files or remote data processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. WLCG: Worldwide LHC computing grid. http://wlcg.web.cern.ch (2014)

  2. Altunay, M., Avery, P., Blackburn, K., Bockelman, B., Ernst, M., Fraser, D., Quick, R., Gardner, R., Goasguen, S., Levshina, T., et al.: A science driven production cyberinfrastructure – the open science grid. J. Grid Comput. 9(2), 201–218 (2011)

    Article  Google Scholar 

  3. Andreetto, P., Andreozzi, S., Avellino, G., Beco, S., Cavallini, A., Cecchi, M., Ciaschini, V., Dorise, A., Giacomini, F., Gianelle, A., et al.: The GLite Workload Management System. In: J. Phys.: Conf. Ser., vol. 119, p. 062007. IOP Publishing (2008)

  4. Balcas, J., Belforte, S., Bockelman, B., Colling, D., Gutsche, O., Hufnagel, D., Khan, F., Larson, K., Letts, J., Mascheroni, M., et al.: Using the GlideinWMS System as a Common Resource Provisioning Layer in CMS. In: Journal of Physics: Conference Series, vol. 664, p. 062031. IOP Publishing (2015)

  5. Bencivenni, M., Bonifazi, F., Carbone, A., Chierici, A., D’Apice, A., De Girolamo, D., Donatelli, M., Donvito, G., Fella, A., Furano, F., et al.: A comparison of data-access platforms for the computing of Large Hadron Collider experiments. IEEE Trans. Nucl. Sci. 55(3), 1621–1630 (2008)

    Article  Google Scholar 

  6. Berthold, J., Dieterle, M., Loogen, R., Priebe, S.: Hierarchical Master-Worker Skeletons. In: Practical Aspects of Declarative Languages, pp. 248–264. Springer (2008)

  7. Bradley, D., St Clair, T., Farrellee, M., Guo, Z., Livny, M., Sfiligoi, I., Tannenbaum, T.: An Update on the Scalability Limits of the Condor Batch System. In: J. Phys.: Conf. Ser., vol. 331, p. 062002. IOP Publishing (2011)

  8. Cao, J., Kwong, O.M., Wang, X., Cai, W.: A Peer-To-Peer Approach to Task Scheduling in Computation Grid. In: Grid and Cooperative Computing, pp. 316–323. Springer (2004)

  9. Chazapis, A., Zissimos, A., Koziris, N.: A Peer-To-Peer Replica Management Service for High-Throughput Grids. In: Intl. Conf. on Parallel Processing, 2005, pp. 443–451 (2005)

  10. Delgado Peris, A., Hernández, J. M., Huedo, E.: Evaluation of the Broadcast Operation in Kademlia. In: Min, G., Hu, J., Liu, L.C., Yang, L.T., Seelam, S., Lefevre, L. (eds.) IEEE 14th Intl. Conf. on High Performance Computing and Communication & IEEE 9th Intl. Conf. on Embedded Software and Systems (HPCC-ICESS), pp. 756–763 (2012)

  11. Delgado Peris, A., Hernández, J.M., Huedo, E.: Distributed scheduling and data sharing in late-binding overlays. In: High Performance Computing Simulation (HPCS), 2014 Intl. Conf. on. doi:10.1109/HPCSim.2014.6903678, pp. 129–136 (2014)

  12. Diaz, R.G., Ramo, A.C., Agüero, A.C., Fifield, T., Sevior, M.: Belle-DIRAC setup for using amazon elastic compute cloud. J. Grid Comput. 9(1), 65–79 (2011)

    Article  Google Scholar 

  13. Evans, D., Fisk, I., Holzman, B., Melo, A., Metson, S., Pordes, R., Sheldon, P., Tiradani, A.: Using Amazon’s Elastic Compute Cloud to Dynamically Scale Cms Computational Resources. In: J. of Phys.: Conf. Series, vol. 331, p. 062031. IOP Publishing (2011)

  14. Fajardo, E., Dost, J., Holzman, B., Tannenbaum, T., Letts, J., Tiradani, A., Bockelman, B., Frey, J., Mason, D.: How Much Higher Can HtCondor Fly?. In: Journal of Physics: Conference Series, vol. 664, p. 062014. IOP Publishing (2015)

  15. Fitzpatrick, B.: Distributed caching with Memcached. Linux J. 2004(124), 5 (2004)

    Google Scholar 

  16. Garonne, V., Serfon, C., Beermann, T., Goossens, L., Nairz, A., Lassnig, M., Stewart, G., Vigne, V., Barisits, M.: Rucio – the next generation of large scale distributed system for ATLAS data management. In: J. Phys.: Conf. Ser., vol. 513. IOP Publishing (2014, in press)

  17. Germain-Renaud, C., Loomis, C., Moscicki, J. T., Texier, R.: Scheduling for responsive grids. J. Grid Comput. 6(1), 15–27 (2008)

    Article  Google Scholar 

  18. Hasham, K., Delgado Peris, A., Anjum, A., Evans, D., Gowdy, S., Hernández, J., Huedo, E., Hufnagel, D., van Lingen, F., Mcclatchey, R.: CMS workflow execution using intelligent job scheduling and data access strategies. IEEE Trans. Nucl. Sci. 58(3), 1221–1232 (2011)

    Article  Google Scholar 

  19. Hernández, J., Evans, D., Foulkes, S.: Multi-Core Processing and Scheduling Performance in CMS. In: J. Phys.: Conf. Ser., vol. 396, p. 032055. IOP Publishing (2012)

  20. Hufnagel, D., Collaboration, C., et al.: Enabling Opportunistic Resources for CMS Computing Operations. In: Journal of Physics: Conference Series, vol. 664, p. 022025. IOP Publishing (2015)

  21. Maeno, T.: PanDA: Distributed Production and Distributed Analysis System for ATLAS. In: J. Phys.: Conf. Ser., vol. 119, p. 062036. IOP Publishing (2008)

  22. Maeno, T., De, K., Wenaus, T., Nilsson, P., Walker, R., Stradling, A., Fine, V., Potekhin, M., Panitkin, S., Compostella, G.: Evolution of the ATLAS PanDA Production and Distributed Analysis System. In: J. Phys.: Conf. Ser., vol. 396, p. 032071. IOP Publishing (2012)

  23. Maymounkov, P., Mazieres, D.: Kademlia: a Peer-To-Peer Information System Based on the XOR Metric. In: Revised Papers from the First Intl. Workshop on Peer-To-Peer Systems (IPTPS ’01), pp. 53–65. Springer, London (2002)

  24. Moscicki, J., Lamanna, M., Bubak, M., Sloot, P.: Processing moldable tasks on the grid: Late job binding with lightweight user-level overlay. Futur. Gener. Comput. Syst. 27(6), 725–736 (2011). doi:10.1016/j.future.2011.02.002

  25. Paterson, S.K., Tsaregorodtsev, A.: DIRAC Optimized Workload Management. In: J. Phys.: Conf. Ser., vol. 119, p. 062040. IOP Publishing (2008)

  26. Pinchak, C., Lu, P., Goldenberg, M.: Practical Heterogeneous Placeholder Scheduling in Overlay Metacomputers: Early Experiences. In: Job Scheduling Strategies for Parallel Processing, pp. 205–228. Springer (2002)

  27. Rahman, M., Ranjan, R., Buyya, R.: Cooperative and decentralized workflow scheduling in global grids. Futur. Gener. Comput. Syst. 26(5), 753–768 (2010)

    Article  Google Scholar 

  28. Saiz, P., Aphecetche, L., Buncic, P., Piskac, R., Revsbech, J.E.: Alien–ALICE environment on the GRID. Nucl. Instrum. Methods Phys. Res., Sect. A 502(2), 437–440 (2003)

  29. Sfiligoi, I.: GlideinWMS–A Generic Pilot-Based Workload Management System. In: J. Phys.: Conf. Ser., vol. 119, p. 062044. IOP Publishing (2008)

  30. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. In: Mass Storage Systems and Technologies (MSST), 2010 IEEE 26Th Symp. On, pp. 1–10 (2010)

  31. Thain, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: The Condor experience. Concurrency and Comput.: Pract. Experience 17(2-4), 323–356 (2005)

    Article  Google Scholar 

  32. Tsaregorodtsev, A., Garonne, V., Closier, J., Frank, M., Gaspar, C., van Herwijnen, E., Loverre, F., Ponce, S., Diaz, R.G., Galli, D., et al.: DIRAC–Distributed Infrastructure with Remote Agent Control. In: Proc. of CHEP2003 (2003)

  33. Yang, Y., Liu, K., Chen, J., Lignier, J., Jin, H.: Peer-To-Peer Based Grid Workflow Runtime Environment of SwinDeW-G. In: IEEE Intl. Conf. on E-Science and Grid Computing, pp. 51–58 (2007)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonio Delgado Peris.

Additional information

We acknowledge the funding support provided by the Spanish Secretaría de Estado de Investigación, Desarrollo e Innovación, through the grant FPA2010-21638-C02-02.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Delgado Peris, A., Hernández, J.M. & Huedo, E. Distributed Late-binding Scheduling and Cooperative Data Caching. J Grid Computing 15, 235–256 (2017). https://doi.org/10.1007/s10723-016-9374-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-016-9374-y

Keywords

Navigation