Journal of Grid Computing

, Volume 15, Issue 2, pp 235–256 | Cite as

Distributed Late-binding Scheduling and Cooperative Data Caching

  • Antonio Delgado Peris
  • José M. Hernández
  • Eduardo Huedo
Article

Abstract

Pull-based overlays are used in some of today’s largest computational grids. Job agents are submitted to resources with the duty of retrieving real workload from a central queue at runtime and executing it. This model helps overcome the problems of direct job submission in the highly complex grid environments, namely, heterogeneity, imprecise status information, relatively high failure rates and slow adaptation to changes of grid conditions or user priorities. This article presents a distributed scheduling architecture for such late-binding overlays. In this architecture, execution nodes share a distributed hash table and cooperatively perform job assignment. As our experiments prove, scalability problems of centralized matching are avoided, achieving low and predictable scheduling overheads even for execution of large workflows, and total turnaround times are improved. This is in line with the predictions of a theoretical model of grid workflow execution that the article also discusses. Scalability makes fine-grained scheduling possible and enables new functionalities, like a distributed data cache shared by the execution nodes, which helps alleviate the commonly congested storage services. In addition, we show that our system is more resilient to problems like communication breakdowns between computation centres. Moreover, the new architecture is better prepared to deal with demanding scenarios like intense demand of popular data files or remote data processing.

Keywords

Grid computing Scalable architectures Peer-to-peer Distributed algorithms 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    WLCG: Worldwide LHC computing grid. http://wlcg.web.cern.ch (2014)
  2. 2.
    Altunay, M., Avery, P., Blackburn, K., Bockelman, B., Ernst, M., Fraser, D., Quick, R., Gardner, R., Goasguen, S., Levshina, T., et al.: A science driven production cyberinfrastructure – the open science grid. J. Grid Comput. 9(2), 201–218 (2011)CrossRefGoogle Scholar
  3. 3.
    Andreetto, P., Andreozzi, S., Avellino, G., Beco, S., Cavallini, A., Cecchi, M., Ciaschini, V., Dorise, A., Giacomini, F., Gianelle, A., et al.: The GLite Workload Management System. In: J. Phys.: Conf. Ser., vol. 119, p. 062007. IOP Publishing (2008)Google Scholar
  4. 4.
    Balcas, J., Belforte, S., Bockelman, B., Colling, D., Gutsche, O., Hufnagel, D., Khan, F., Larson, K., Letts, J., Mascheroni, M., et al.: Using the GlideinWMS System as a Common Resource Provisioning Layer in CMS. In: Journal of Physics: Conference Series, vol. 664, p. 062031. IOP Publishing (2015)Google Scholar
  5. 5.
    Bencivenni, M., Bonifazi, F., Carbone, A., Chierici, A., D’Apice, A., De Girolamo, D., Donatelli, M., Donvito, G., Fella, A., Furano, F., et al.: A comparison of data-access platforms for the computing of Large Hadron Collider experiments. IEEE Trans. Nucl. Sci. 55(3), 1621–1630 (2008)CrossRefGoogle Scholar
  6. 6.
    Berthold, J., Dieterle, M., Loogen, R., Priebe, S.: Hierarchical Master-Worker Skeletons. In: Practical Aspects of Declarative Languages, pp. 248–264. Springer (2008)Google Scholar
  7. 7.
    Bradley, D., St Clair, T., Farrellee, M., Guo, Z., Livny, M., Sfiligoi, I., Tannenbaum, T.: An Update on the Scalability Limits of the Condor Batch System. In: J. Phys.: Conf. Ser., vol. 331, p. 062002. IOP Publishing (2011)Google Scholar
  8. 8.
    Cao, J., Kwong, O.M., Wang, X., Cai, W.: A Peer-To-Peer Approach to Task Scheduling in Computation Grid. In: Grid and Cooperative Computing, pp. 316–323. Springer (2004)Google Scholar
  9. 9.
    Chazapis, A., Zissimos, A., Koziris, N.: A Peer-To-Peer Replica Management Service for High-Throughput Grids. In: Intl. Conf. on Parallel Processing, 2005, pp. 443–451 (2005)Google Scholar
  10. 10.
    Delgado Peris, A., Hernández, J. M., Huedo, E.: Evaluation of the Broadcast Operation in Kademlia. In: Min, G., Hu, J., Liu, L.C., Yang, L.T., Seelam, S., Lefevre, L. (eds.) IEEE 14th Intl. Conf. on High Performance Computing and Communication & IEEE 9th Intl. Conf. on Embedded Software and Systems (HPCC-ICESS), pp. 756–763 (2012)Google Scholar
  11. 11.
    Delgado Peris, A., Hernández, J.M., Huedo, E.: Distributed scheduling and data sharing in late-binding overlays. In: High Performance Computing Simulation (HPCS), 2014 Intl. Conf. on. doi:10.1109/HPCSim.2014.6903678, pp. 129–136 (2014)
  12. 12.
    Diaz, R.G., Ramo, A.C., Agüero, A.C., Fifield, T., Sevior, M.: Belle-DIRAC setup for using amazon elastic compute cloud. J. Grid Comput. 9(1), 65–79 (2011)CrossRefGoogle Scholar
  13. 13.
    Evans, D., Fisk, I., Holzman, B., Melo, A., Metson, S., Pordes, R., Sheldon, P., Tiradani, A.: Using Amazon’s Elastic Compute Cloud to Dynamically Scale Cms Computational Resources. In: J. of Phys.: Conf. Series, vol. 331, p. 062031. IOP Publishing (2011)Google Scholar
  14. 14.
    Fajardo, E., Dost, J., Holzman, B., Tannenbaum, T., Letts, J., Tiradani, A., Bockelman, B., Frey, J., Mason, D.: How Much Higher Can HtCondor Fly?. In: Journal of Physics: Conference Series, vol. 664, p. 062014. IOP Publishing (2015)Google Scholar
  15. 15.
    Fitzpatrick, B.: Distributed caching with Memcached. Linux J. 2004(124), 5 (2004)Google Scholar
  16. 16.
    Garonne, V., Serfon, C., Beermann, T., Goossens, L., Nairz, A., Lassnig, M., Stewart, G., Vigne, V., Barisits, M.: Rucio – the next generation of large scale distributed system for ATLAS data management. In: J. Phys.: Conf. Ser., vol. 513. IOP Publishing (2014, in press)Google Scholar
  17. 17.
    Germain-Renaud, C., Loomis, C., Moscicki, J. T., Texier, R.: Scheduling for responsive grids. J. Grid Comput. 6(1), 15–27 (2008)CrossRefGoogle Scholar
  18. 18.
    Hasham, K., Delgado Peris, A., Anjum, A., Evans, D., Gowdy, S., Hernández, J., Huedo, E., Hufnagel, D., van Lingen, F., Mcclatchey, R.: CMS workflow execution using intelligent job scheduling and data access strategies. IEEE Trans. Nucl. Sci. 58(3), 1221–1232 (2011)CrossRefGoogle Scholar
  19. 19.
    Hernández, J., Evans, D., Foulkes, S.: Multi-Core Processing and Scheduling Performance in CMS. In: J. Phys.: Conf. Ser., vol. 396, p. 032055. IOP Publishing (2012)Google Scholar
  20. 20.
    Hufnagel, D., Collaboration, C., et al.: Enabling Opportunistic Resources for CMS Computing Operations. In: Journal of Physics: Conference Series, vol. 664, p. 022025. IOP Publishing (2015)Google Scholar
  21. 21.
    Maeno, T.: PanDA: Distributed Production and Distributed Analysis System for ATLAS. In: J. Phys.: Conf. Ser., vol. 119, p. 062036. IOP Publishing (2008)Google Scholar
  22. 22.
    Maeno, T., De, K., Wenaus, T., Nilsson, P., Walker, R., Stradling, A., Fine, V., Potekhin, M., Panitkin, S., Compostella, G.: Evolution of the ATLAS PanDA Production and Distributed Analysis System. In: J. Phys.: Conf. Ser., vol. 396, p. 032071. IOP Publishing (2012)Google Scholar
  23. 23.
    Maymounkov, P., Mazieres, D.: Kademlia: a Peer-To-Peer Information System Based on the XOR Metric. In: Revised Papers from the First Intl. Workshop on Peer-To-Peer Systems (IPTPS ’01), pp. 53–65. Springer, London (2002)Google Scholar
  24. 24.
    Moscicki, J., Lamanna, M., Bubak, M., Sloot, P.: Processing moldable tasks on the grid: Late job binding with lightweight user-level overlay. Futur. Gener. Comput. Syst. 27(6), 725–736 (2011). doi:10.1016/j.future.2011.02.002
  25. 25.
    Paterson, S.K., Tsaregorodtsev, A.: DIRAC Optimized Workload Management. In: J. Phys.: Conf. Ser., vol. 119, p. 062040. IOP Publishing (2008)Google Scholar
  26. 26.
    Pinchak, C., Lu, P., Goldenberg, M.: Practical Heterogeneous Placeholder Scheduling in Overlay Metacomputers: Early Experiences. In: Job Scheduling Strategies for Parallel Processing, pp. 205–228. Springer (2002)Google Scholar
  27. 27.
    Rahman, M., Ranjan, R., Buyya, R.: Cooperative and decentralized workflow scheduling in global grids. Futur. Gener. Comput. Syst. 26(5), 753–768 (2010)CrossRefGoogle Scholar
  28. 28.
    Saiz, P., Aphecetche, L., Buncic, P., Piskac, R., Revsbech, J.E.: Alien–ALICE environment on the GRID. Nucl. Instrum. Methods Phys. Res., Sect. A 502(2), 437–440 (2003)Google Scholar
  29. 29.
    Sfiligoi, I.: GlideinWMS–A Generic Pilot-Based Workload Management System. In: J. Phys.: Conf. Ser., vol. 119, p. 062044. IOP Publishing (2008)Google Scholar
  30. 30.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. In: Mass Storage Systems and Technologies (MSST), 2010 IEEE 26Th Symp. On, pp. 1–10 (2010)Google Scholar
  31. 31.
    Thain, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: The Condor experience. Concurrency and Comput.: Pract. Experience 17(2-4), 323–356 (2005)CrossRefGoogle Scholar
  32. 32.
    Tsaregorodtsev, A., Garonne, V., Closier, J., Frank, M., Gaspar, C., van Herwijnen, E., Loverre, F., Ponce, S., Diaz, R.G., Galli, D., et al.: DIRAC–Distributed Infrastructure with Remote Agent Control. In: Proc. of CHEP2003 (2003)Google Scholar
  33. 33.
    Yang, Y., Liu, K., Chen, J., Lignier, J., Jin, H.: Peer-To-Peer Based Grid Workflow Runtime Environment of SwinDeW-G. In: IEEE Intl. Conf. on E-Science and Grid Computing, pp. 51–58 (2007)Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2016

Authors and Affiliations

  1. 1.CIEMATMadridSpain
  2. 2.Facultad de InformáticaUniversidad Complutense de Madrid (UCM)MadridSpain

Personalised recommendations