Abstract
Recent trends in grid computing development is moving towards a service-oriented architecture. With the momentum gaining for the service-oriented grid computing systems, the issue of deploying support for integrated scheduling and fault-tolerant approaches becomes paramount importance. To this end, we propose a scalable framework that loosely couples the dynamic job scheduling approach with the hybrid replications approach to schedule jobs efficiently while at the same time providing fault-tolerance. The novelty of the proposed framework is that it uses passive replication approach under high system load and active replication approach under low system loads. The switch between these two replication methods is also done dynamically and transparently.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abawajy, J.H., Dandamudi, S.P.: Parallel job scheduling on multicluster computing systems. In: Proceedings of IEEE International Conference on Cluster Computing (CLUSTER 2003), pp. 11–21 (2003)
Abawajy, J.H., Dandamudi, S.P.: A reconfigurable multi-layered grid scheduling infrastructure. In: Arabnia, H.R., Mun, Y. (eds.) Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA 2003, Las Vegas, Nevada, USA, June 23 - 26, vol. 1, pp. 138–144. CSREA Press (2003)
Abawajy, J.H., Dandamudi, S.P.: Fault-tolerant grid resource management infrastructure. Journal of Neural, Parallel and Scientific Computations 12, 208–220 (2004)
Abawajy, J.H.: Fault detection service architecture for grid computing systems. In: Laganá, A., Gavrilova, M.L., Kumar, V., Mun, Y., Tan, C.J.K., Gervasi, O. (eds.) ICCSA 2004. LNCS, vol. 3044, pp. 107–115. Springer, Heidelberg (2004)
Birman, K.P.: The process group approach to reliable distributed computing. Technical report, Department of Computer Science, Cornell University (July 1991)
Foster, I.: The grid: A new infrastructure for 21st century science. Physics Today 55(2), 42–47 (2002)
Foster, I.T., Kesselman, C., Tuecke, S.: The anatomy of the grid - enabling scalable virtual organizations. CoRR, cs.AR/0103025 (2001)
Gehring, J., Streit, A.: Robust resource management for metacomputers. In: HPDC 2000: Proceedings of the Ninth IEEE International Symposium on High Performance Distributed Computing (HPDC 2000), p. 105. IEEE Computer Society, Los Alamitos (2000)
Hwang, S., Kesselman, C.: Gridworkflow: A flexible failure handling framework for the grid. In: 12th International Symposium on High-Performance Distributed Computing (HPDC-12 2003), Seattle, WA, USA, June 22-24, 2003, pp. 126–137. IEEE Computer Society, Los Alamitos (2003)
Foster, I., Kesselman, C.: Globus: A Toolkit-Based Grid Architecture. In: The Grid: Blueprint for a Future Computing Infrastructure, pp. 259–278. Morgan Kaufmann, San Francisco (1998)
Juan, L., Fisher Allan, L., Peter, S.: Fail-safe PVM: A Portable Package for Distributed Programming with Transparent Recovery. Technical report, CMU, Department of Computer Science (February 1993)
Marzullo, K., Alvisi, L.: Waft: Support for fault-tolerance in wide-area object oriented systems. In: Proceedings of ISW 1998, pp. 5–10 (1998)
Nguyen-Tuong, A., Grimshaw, A.S., Karprovich, J.F.: Fault-tolerance via replication in coarse grain data-flow. Technical Report CS-95-38, Department of Computer Science, University of Virginia (1995)
Plank, J.S., Elwasif, W.R.: Experimental assessment of workstation failures and their impact on checkpointing systems. In: Symposium on FTC 1998, pp. 48–57 (1998)
Anuraag, S., Alok, S., Avinash, S.: A scheduling model for grid computing systems. In: Proceedings of Grid 2001, pp. 111–123. IEEE Computer Society, Los Alamitos (2001)
Schneider, F.B.: Byzantine generals in action: Implementing failstop processors. ACM Transactions on Computer Systems 2(2), 145–154 (1984)
Stelling, P., Foster, I., Kesselman, C., von Laszewski, G., Lee, C.: A fault detection service for wide area distributed computations. In: Proc. 7th Symposium on High Performance Computing, pp. 268–278 (1998)
Tierney, B., Crowley, B., Gunter, D., Holding, M., Lee, J., Thompson, M.: A monitoring sensor management system for grid environments. In: HPDC, pp. 97–104 (2000)
Namyoon, W., Soonho, C., Hyungsoo, J., Park, Y., Park, H., Jungwhan, M., Heon, Y.Y.: Mpich-gf: Providing fault tolerance on grid environments. In: Proceedings of 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (2003)
Weissman, J.B.: Fault-tolerant wide area parallel computation. In: Proceedings of IDDPS 2000 Workshops, pp. 1214–1225 (2000)
Weissman, J.B.: Fault tolerant computing on the grid: What are my options? In: HPDC 1999: Proceedings of the The Eighth IEEE International Symposium on High Performance Distributed Computing, p. 26. IEEE Computer Society, Los Alamitos (1999)
Xu, M.Q.: Effective metacomputing using LSF multicluster. In: CCGRID 2001: Proceedings of the 1st International Symposium on Cluster Computing and the Grid, pp. 100–106. IEEE Computer Society, Los Alamitos (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Abawajy, J.H. (2005). Robust Parallel Job Scheduling Infrastructure for Service-Oriented Grid Computing Systems. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2005. ICCSA 2005. Lecture Notes in Computer Science, vol 3483. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11424925_132
Download citation
DOI: https://doi.org/10.1007/11424925_132
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25863-6
Online ISBN: 978-3-540-32309-9
eBook Packages: Computer ScienceComputer Science (R0)