Abstract
Computational Grids like EGEE offer sufficient capacity for even most challenging large-scale computational experiments, thus becoming an indispensable tool for researchers in various fields. However, the utility of these infrastructures is severely hampered by their notoriously low reliability: a recent nine-month study found that only 48% of jobs submitted in South-Eastern-Europe completed successfully. We attack this problem by means of proactive failure detection. Specifically, we predict site failures on short-term time scale by deploying machine learning algorithms to discover relationships between site performance variables and subsequent failures. Such predictions can be used by Resource Brokers for deciding where to submit new jobs, and help operators to take preventive measures. Our experimental evaluation on a 30-day trace from 197 EGEE queues shows that the accuracy of results is highly dependent on the selected queue, the type of failure, the preprocessing and the choice of input variables.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
A. Andrzejak and L. Silva. Using machine learning for non-intrusive modeling and prediction of software aging. In JEEF/IFIP Network Operations & Management Symposium (NOMS 2008), Salvador de Bahia, Brazil, Apr 7—11 2008.
A. Cooke et al. The Relational Grid Monitoring Architecture: Mediating Infomiation about the Grid. Journal of Grid Computing, 2(4):323—339, 2004.
R. Duda, P. Hart, and D. Stork. Pattern C1assflcation. John Wiley and Sons, 2001. 0471-05669-3.
EGEE. Service availability monitoring (SAM), http:llsam-docs.web.cern.chlsam-docs/.
I. Foster. Globus Toolkit Version 4: Software for Service-Oriented Systems. Journal of Computer Science and Technology, 21(4):5 13—520, 2006.
Glite. Glite nflddlewaie, http://glite.org/.
GStat. Grid statistics (gstat), http://goc.grid.sinica.edu.tw/gstat!.
E. J. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards parameter-free data nflning. In Proceedings of the Tenth ACM SIGKDD International Conference on Kiwwledge Discovery and Data Mining, pages 206—215, August 2004.
E. Kiciman and A. Fox. Detecting application-level failures in component-based internet services, June 2004.
E. Kiciman and L. Subramanian. Root cause localization in large scale systems. In In Proceedings of the 1 st Workshop on Hot Topics in System Dependability (HotDep-05. IEEE Computer Society, June 2005.
5. Krishnamurthy, W. H. Sanders, and M. Cukiet A dynamic replica selection algorithm for tolemting timing faults. In 2001 International Conference on Dependable Systems and Networks (DSN 2001) (formerly: FTCS), pages 107—116, Goteborg, Sweden, July 2001. IEEE Computer Society.
M. E. Lccasto, S. Sidiroglou, andA. D. Keromytis. Application communities: Using mono- culture for dependability. In In Proceedings of the 1 st Workshop on Hot Topics in System Dependability (HotDep-05, pages 288—292,2005.
K. Neocleous. Failure analysis, prediction and management on the EGEE grid infrastructure. Master’s thesis, University of Cypms, August 2007.
K. Neocleous, M. D. Dikaiakos, P. Fragopoulou, and E. Markatos. Failure management in grids: The case of the EGEE infrastructure. Parallel Processing Letters, 17(4):391—410, Dec. 2007.
D. G. Stork, E. Yom-Tov, and R. 0. Duda. Computer manual in MATL4B to accompany Pattern Classfication. Wiley, second edition, 2004.
F. van der Heijden, R. P. W. Duin, D. de Ridder, and D. M. J. Tax. Classfi cation, Parameter Estimation and State Estimation. John Wiley & Sons, 2004.
R. Vilalta, C. V. Apte, J. L. Hellerstein, S. Ma, and S. M. Weiss. Predictive algorithms in the management of computer systems. IBM Systems Journal, 41(3):461—474, 2002.
WISDOM. Initiative for grid-enabled thug discovery against neglected and emergent diseases, http://wisdom.eu-egee.fn
I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, 2nd edition edition, 2005.
D. Zeinalipour-Yazti, H. Papadakis, C. Georgiou, and M. Dikalakos. Metadata ranking and pruning forfailure detection in grids. Parallel Processing Letters, 18(3):371—390, Sept. 2008.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer US
About this paper
Cite this paper
Andrzejak, A., Zeinalipour-Yazti, D., Dikaiakos, M.D. (2010). Improving the Dependability of Grids via Short-Term Failure Predictions. In: Desprez, F., Getov, V., Priol, T., Yahyapour, R. (eds) Grids, P2P and Services Computing. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-6794-7_3
Download citation
DOI: https://doi.org/10.1007/978-1-4419-6794-7_3
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-6793-0
Online ISBN: 978-1-4419-6794-7
eBook Packages: Computer ScienceComputer Science (R0)