Distributed Late-binding Scheduling and Cooperative Data Caching

Delgado Peris, Antonio; Hernández, José M.; Huedo, Eduardo

doi:10.1007/s10723-016-9374-y

Distributed Late-binding Scheduling and Cooperative Data Caching

Published: 18 August 2016

Volume 15, pages 235–256, (2017)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Antonio Delgado Peris ORCID: orcid.org/0000-0002-8511-7958¹,
José M. Hernández¹ &
Eduardo Huedo²

87 Accesses
2 Citations
Explore all metrics

Abstract

Pull-based overlays are used in some of today’s largest computational grids. Job agents are submitted to resources with the duty of retrieving real workload from a central queue at runtime and executing it. This model helps overcome the problems of direct job submission in the highly complex grid environments, namely, heterogeneity, imprecise status information, relatively high failure rates and slow adaptation to changes of grid conditions or user priorities. This article presents a distributed scheduling architecture for such late-binding overlays. In this architecture, execution nodes share a distributed hash table and cooperatively perform job assignment. As our experiments prove, scalability problems of centralized matching are avoided, achieving low and predictable scheduling overheads even for execution of large workflows, and total turnaround times are improved. This is in line with the predictions of a theoretical model of grid workflow execution that the article also discusses. Scalability makes fine-grained scheduling possible and enables new functionalities, like a distributed data cache shared by the execution nodes, which helps alleviate the commonly congested storage services. In addition, we show that our system is more resilient to problems like communication breakdowns between computation centres. Moreover, the new architecture is better prepared to deal with demanding scenarios like intense demand of popular data files or remote data processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of Kubernetes scheduling algorithms

Article Open access 13 June 2023

Performance improvement of the triangular matrix product in commodity clusters

Article Open access 15 April 2024

The evolution of distributed computing systems: from fundamental to new frontiers

Article 30 January 2021

References

WLCG: Worldwide LHC computing grid. http://wlcg.web.cern.ch (2014)
Altunay, M., Avery, P., Blackburn, K., Bockelman, B., Ernst, M., Fraser, D., Quick, R., Gardner, R., Goasguen, S., Levshina, T., et al.: A science driven production cyberinfrastructure – the open science grid. J. Grid Comput. 9(2), 201–218 (2011)
Article Google Scholar
Andreetto, P., Andreozzi, S., Avellino, G., Beco, S., Cavallini, A., Cecchi, M., Ciaschini, V., Dorise, A., Giacomini, F., Gianelle, A., et al.: The GLite Workload Management System. In: J. Phys.: Conf. Ser., vol. 119, p. 062007. IOP Publishing (2008)
Balcas, J., Belforte, S., Bockelman, B., Colling, D., Gutsche, O., Hufnagel, D., Khan, F., Larson, K., Letts, J., Mascheroni, M., et al.: Using the GlideinWMS System as a Common Resource Provisioning Layer in CMS. In: Journal of Physics: Conference Series, vol. 664, p. 062031. IOP Publishing (2015)
Bencivenni, M., Bonifazi, F., Carbone, A., Chierici, A., D’Apice, A., De Girolamo, D., Donatelli, M., Donvito, G., Fella, A., Furano, F., et al.: A comparison of data-access platforms for the computing of Large Hadron Collider experiments. IEEE Trans. Nucl. Sci. 55(3), 1621–1630 (2008)
Article Google Scholar
Berthold, J., Dieterle, M., Loogen, R., Priebe, S.: Hierarchical Master-Worker Skeletons. In: Practical Aspects of Declarative Languages, pp. 248–264. Springer (2008)
Bradley, D., St Clair, T., Farrellee, M., Guo, Z., Livny, M., Sfiligoi, I., Tannenbaum, T.: An Update on the Scalability Limits of the Condor Batch System. In: J. Phys.: Conf. Ser., vol. 331, p. 062002. IOP Publishing (2011)
Cao, J., Kwong, O.M., Wang, X., Cai, W.: A Peer-To-Peer Approach to Task Scheduling in Computation Grid. In: Grid and Cooperative Computing, pp. 316–323. Springer (2004)
Chazapis, A., Zissimos, A., Koziris, N.: A Peer-To-Peer Replica Management Service for High-Throughput Grids. In: Intl. Conf. on Parallel Processing, 2005, pp. 443–451 (2005)
Delgado Peris, A., Hernández, J. M., Huedo, E.: Evaluation of the Broadcast Operation in Kademlia. In: Min, G., Hu, J., Liu, L.C., Yang, L.T., Seelam, S., Lefevre, L. (eds.) IEEE 14th Intl. Conf. on High Performance Computing and Communication & IEEE 9th Intl. Conf. on Embedded Software and Systems (HPCC-ICESS), pp. 756–763 (2012)
Delgado Peris, A., Hernández, J.M., Huedo, E.: Distributed scheduling and data sharing in late-binding overlays. In: High Performance Computing Simulation (HPCS), 2014 Intl. Conf. on. doi:10.1109/HPCSim.2014.6903678, pp. 129–136 (2014)
Diaz, R.G., Ramo, A.C., Agüero, A.C., Fifield, T., Sevior, M.: Belle-DIRAC setup for using amazon elastic compute cloud. J. Grid Comput. 9(1), 65–79 (2011)
Article Google Scholar
Evans, D., Fisk, I., Holzman, B., Melo, A., Metson, S., Pordes, R., Sheldon, P., Tiradani, A.: Using Amazon’s Elastic Compute Cloud to Dynamically Scale Cms Computational Resources. In: J. of Phys.: Conf. Series, vol. 331, p. 062031. IOP Publishing (2011)
Fajardo, E., Dost, J., Holzman, B., Tannenbaum, T., Letts, J., Tiradani, A., Bockelman, B., Frey, J., Mason, D.: How Much Higher Can HtCondor Fly?. In: Journal of Physics: Conference Series, vol. 664, p. 062014. IOP Publishing (2015)
Fitzpatrick, B.: Distributed caching with Memcached. Linux J. 2004(124), 5 (2004)
Google Scholar
Garonne, V., Serfon, C., Beermann, T., Goossens, L., Nairz, A., Lassnig, M., Stewart, G., Vigne, V., Barisits, M.: Rucio – the next generation of large scale distributed system for ATLAS data management. In: J. Phys.: Conf. Ser., vol. 513. IOP Publishing (2014, in press)
Germain-Renaud, C., Loomis, C., Moscicki, J. T., Texier, R.: Scheduling for responsive grids. J. Grid Comput. 6(1), 15–27 (2008)
Article Google Scholar
Hasham, K., Delgado Peris, A., Anjum, A., Evans, D., Gowdy, S., Hernández, J., Huedo, E., Hufnagel, D., van Lingen, F., Mcclatchey, R.: CMS workflow execution using intelligent job scheduling and data access strategies. IEEE Trans. Nucl. Sci. 58(3), 1221–1232 (2011)
Article Google Scholar
Hernández, J., Evans, D., Foulkes, S.: Multi-Core Processing and Scheduling Performance in CMS. In: J. Phys.: Conf. Ser., vol. 396, p. 032055. IOP Publishing (2012)
Hufnagel, D., Collaboration, C., et al.: Enabling Opportunistic Resources for CMS Computing Operations. In: Journal of Physics: Conference Series, vol. 664, p. 022025. IOP Publishing (2015)
Maeno, T.: PanDA: Distributed Production and Distributed Analysis System for ATLAS. In: J. Phys.: Conf. Ser., vol. 119, p. 062036. IOP Publishing (2008)
Maeno, T., De, K., Wenaus, T., Nilsson, P., Walker, R., Stradling, A., Fine, V., Potekhin, M., Panitkin, S., Compostella, G.: Evolution of the ATLAS PanDA Production and Distributed Analysis System. In: J. Phys.: Conf. Ser., vol. 396, p. 032071. IOP Publishing (2012)
Maymounkov, P., Mazieres, D.: Kademlia: a Peer-To-Peer Information System Based on the XOR Metric. In: Revised Papers from the First Intl. Workshop on Peer-To-Peer Systems (IPTPS ’01), pp. 53–65. Springer, London (2002)
Moscicki, J., Lamanna, M., Bubak, M., Sloot, P.: Processing moldable tasks on the grid: Late job binding with lightweight user-level overlay. Futur. Gener. Comput. Syst. 27(6), 725–736 (2011). doi:10.1016/j.future.2011.02.002
Paterson, S.K., Tsaregorodtsev, A.: DIRAC Optimized Workload Management. In: J. Phys.: Conf. Ser., vol. 119, p. 062040. IOP Publishing (2008)
Pinchak, C., Lu, P., Goldenberg, M.: Practical Heterogeneous Placeholder Scheduling in Overlay Metacomputers: Early Experiences. In: Job Scheduling Strategies for Parallel Processing, pp. 205–228. Springer (2002)
Rahman, M., Ranjan, R., Buyya, R.: Cooperative and decentralized workflow scheduling in global grids. Futur. Gener. Comput. Syst. 26(5), 753–768 (2010)
Article Google Scholar
Saiz, P., Aphecetche, L., Buncic, P., Piskac, R., Revsbech, J.E.: Alien–ALICE environment on the GRID. Nucl. Instrum. Methods Phys. Res., Sect. A 502(2), 437–440 (2003)
Sfiligoi, I.: GlideinWMS–A Generic Pilot-Based Workload Management System. In: J. Phys.: Conf. Ser., vol. 119, p. 062044. IOP Publishing (2008)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. In: Mass Storage Systems and Technologies (MSST), 2010 IEEE 26Th Symp. On, pp. 1–10 (2010)
Thain, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: The Condor experience. Concurrency and Comput.: Pract. Experience 17(2-4), 323–356 (2005)
Article Google Scholar
Tsaregorodtsev, A., Garonne, V., Closier, J., Frank, M., Gaspar, C., van Herwijnen, E., Loverre, F., Ponce, S., Diaz, R.G., Galli, D., et al.: DIRAC–Distributed Infrastructure with Remote Agent Control. In: Proc. of CHEP2003 (2003)
Yang, Y., Liu, K., Chen, J., Lignier, J., Jin, H.: Peer-To-Peer Based Grid Workflow Runtime Environment of SwinDeW-G. In: IEEE Intl. Conf. on E-Science and Grid Computing, pp. 51–58 (2007)

Download references

Author information

Authors and Affiliations

CIEMAT, Av. Complutense, 40, 28040, Madrid, Spain
Antonio Delgado Peris & José M. Hernández
Facultad de Informática, Universidad Complutense de Madrid (UCM), Madrid, Spain
Eduardo Huedo

Authors

Antonio Delgado Peris
View author publications
You can also search for this author in PubMed Google Scholar
José M. Hernández
View author publications
You can also search for this author in PubMed Google Scholar
Eduardo Huedo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antonio Delgado Peris.

Additional information

We acknowledge the funding support provided by the Spanish Secretaría de Estado de Investigación, Desarrollo e Innovación, through the grant FPA2010-21638-C02-02.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Delgado Peris, A., Hernández, J.M. & Huedo, E. Distributed Late-binding Scheduling and Cooperative Data Caching. J Grid Computing 15, 235–256 (2017). https://doi.org/10.1007/s10723-016-9374-y

Download citation

Received: 10 March 2016
Accepted: 10 August 2016
Published: 18 August 2016
Issue Date: June 2017
DOI: https://doi.org/10.1007/s10723-016-9374-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed Late-binding Scheduling and Cooperative Data Caching

Abstract

Access this article

Similar content being viewed by others

A survey of Kubernetes scheduling algorithms

Performance improvement of the triangular matrix product in commodity clusters

The evolution of distributed computing systems: from fundamental to new frontiers

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Distributed Late-binding Scheduling and Cooperative Data Caching

Abstract

Access this article

Similar content being viewed by others

A survey of Kubernetes scheduling algorithms

Performance improvement of the triangular matrix product in commodity clusters

The evolution of distributed computing systems: from fundamental to new frontiers

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation