Predicting Job Failures in AuverGrid Based on Workload Log Analysis
- 196 Downloads
Grid systems are popular today due to their ability to solve large problems in business and science. Job failures which are inherent in any computational environment are more common in grids due to their dynamic and complex nature. Furthermore, traditional methods for job failure recovery have proven costly and thus a need to shift toward proactive and predictive management strategies is necessary in such systems. In this paper, an innovative effort has been made to predict the futurity of jobs in a production grid environment. First of all, we investigated the relationship between workload characteristics and job failures by analyzing workload traces of AuverGrid which is a part of EGEE (Enabling Grids for E-science) project. After the recognition of failure patterns, the success or failure status of jobs during 6 months of AuverGrid activity was predicted with approximately 96% accuracy. The quality of services on the grid can be improved by integrating the result of this work into management services like scheduling and monitoring.
KeywordsJob Failure Prediction Grid Workload Archive Trace Analysis Bayesian Networks
Unable to display preview. Download preview PDF.
- 1.The Large Hadron Collider Grid (LCG) website. Availaible on: http://lcg.web.cern.ch/LCG/.
- 2.The Lawrence Livermore National Laboratory BlueGene/L supercomputer website. Availaible on: http://asc.llnl.gov/computing_resources/bluegenel/.
- 3.The AuverGrid website. Availaible on: http://www.auvergrid.fr/.
- 4.The EGEE (Enabling Grid for E-Science) project website. Availaible on: http://www.eu-egee.org/.
- 5.Fu, S. and Xu, C.-Z., “Exploring Event Correlation for Failure Prediction in Coalitions of Clusters,” in Proc. of the International Conference on Supercomputing, 41, ACM/IEEE, 2007.Google Scholar
- 6.Asadzadeh, P., Buyya, R., Kei, C. L., Nayar, D. and Venugopal, S., “Global Grids and Software Toolkits: A Study of Four Grid Middleware Technologies,” Wiley Series on Parallel and Distributed Computing, Chap. 22, 2006.Google Scholar
- 8.Gunter, D., Tierney, B. L., Brown, A., Swany, M., Bresnahan, J. and Schopf, J. M., “Log Summarization and Anomaly Detection for Troubleshooting Distributed Systems,” in Proc. of the International Conference on Grid Computing, IEEE/ACM, pp. 226–234, 2007.Google Scholar
- 9.Chawla, N. V., Thain, D., Lichtenwalter, R. and Cieslak, D. A., “Data Mining on the Grid for the Grid,” in Proc. of International Symposium on Parallel and Distributed Processing, IEEE, pp. 1–8, 2008.Google Scholar
- 10.Zeinalipour-Yazti, D., Neocleous, K., Georgiou, C. and Dikaiakos, M. D., “Identifying Failures in Grids through Monitoring and Ranking,” in Proc. of International Symposium on Network Computing and Applications, IEEE, pp. 291–298, 2008.Google Scholar
- 11.Zhang, X., Sebag, M. and Germain, C., “Toward Behavioral Modeling of A Grid System: Mining the Logging and Bookkeeping Files,” in Proc. of International Conference on Data Mining, IEEE, pp. 581–588, 2007.Google Scholar
- 12.Kang, W. and Grimshaw, A., “Failure Prediction in Computational Grids,” in Proc. of Annual Simulation Symposium, IEEE, pp. 275–282, 2007.Google Scholar
- 13.Yuan, Y., Wu, Y., Yang, G. and Zheng, W., “Adaptive Hybrid Model for Long Term Load Prediction in Computational Grid,” in Proc. of International Symposium on Cluster Computing and the Grid, IEEE, pp. 340–347, 2008.Google Scholar
- 14.Akioka, S. and Muraoka, Y., “Extended Forecast of CPU and Network Load on Computational Grid,” in Proc. of International Symposium on Cluster Computing and the Grid, IEEE, pp. 765–772, 2004.Google Scholar
- 15.Iosup, A., Dumitrescu, C. and Epema., D., “How are real grids used? The Analysis of Four Grid Traces and Its Implications,” in Proc. of International Conference on Grid Computing, IEEE/ACM, pp. 262–269, 2006.Google Scholar
- 16.Wolski, R., Spring, N. and Hayes, J., “Predicting the CPU availability of Timeshared UNIX Systems,” in Proc. of International Symposium on High Performance and Distributed Computing, IEEE, pp. 105–112, 1999.Google Scholar
- 17.Dinda, P. A., “The Statistical Properties of Host Load,” in Journal of Scientific Programming, 7, (3,4), pp. 211–229, 1998.Google Scholar
- 18.Nadeem, F., Prodan, R. and Fahringer, T., “Characterizing, Modeling and Predicting Dynamic Resource Availability in a Large Scale Multi-purpose Grid,” in Proc. of International Symposium on Cluster Computing and the Grid, IEEE, pp. 348–357, 2008.Google Scholar
- 19.Domingues, P., Marques, P. and Silva, L., “DGSchedSim: A Trace-Driven Simulator to Evaluate Scheduling Algorithms for Desktop Grid Environments,” in Proc. of International Conference on Parallel, Distributed, and Network-Based Processing, IEEE, pp. 8–15, 2006.Google Scholar
- 20.Zhao, Y., Shao, G. and Yang, G., “A Survey of Methods and Applications for Trace Analysis in Grid Systems,” in Proc. of ChinaGrid Annual Conference, IEEE, pp. 264–271, 2008.Google Scholar
- 21.Andrzejak, A., Domingues, P. and Silva, L., “Classifier-Based Capacity Prediction for Desktop Grids,” in Proc. of Integrated Research in Grid Computing, CoreGRID Workshop, pp. 135–144, 2005.Google Scholar
- 22.Rood, B., Walters, J. P., Chaudhary, V. and Lewis, M. J., “Failure Prediction and Scalable Checkpointing for Reliable Large-Scale Grid Computing,” in HPDC’07, IEEE, 2007.Google Scholar
- 23.Rood, B. and Lewis, M. J., “Resource Availability Prediction for Improved Grid Scheduling,” in Proc. of eScience’08, IEEE, pp. 711–718, 2008.Google Scholar
- 24.Li, H., Groep, D. and Wolters, L., “Mining Performance Data for Metascheduling Decision Support in the Grid,” in FGCS, 23, 1, Elsevier, pp. 92–99, 2007.Google Scholar
- 25.Spooner, D. P., Jarvis, S. A., Cao, J., Saini, S. and Nudd, G. R., “Local Grid Scheduling Techniques using Performance Prediction,” in Proc. on Computers and Digital Techniques, IEE, pp. 87–96, 2003.Google Scholar
- 26.Leangsuksun, C., Liu, T., Rao, T., Scott, S. L. and Libby, R., “A Failure Predictive and Policy-Based High Availability Strategy For Linux HPC Cluster,” in Proc. of the International Conference on Linux Clusters, pp. 1–12, 2004.Google Scholar
- 27.Lin, T.-T. Y. and Siewiorek, D. P., “Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis,” in IEEE Transactions on Reliability, 39, 4, pp. 419–432, 1990.Google Scholar
- 28.Brevik, J., Nurmi, D. and Wolski, R., “Automatic methods for predicting machine availability in desktop Grid and peer to-peer systems,” in Proc. of CCGRID’ 04, IEEE, pp. 190–199, 2004.Google Scholar
- 29.Li, H., “Workload Characterization, Modeling, and Prediction in Grid Computing,” Ph.D. Thesis, Leiden University, 2008.Google Scholar
- 30.Li, H., Groep, D., Wolters, L., and Templon, J., “Job Failure Analysis and Its Implications in a Large-Scale Production Grid,” in Proc. of International Conference on e-Science and Grid Computing, IEEE, pp. 27–34, 2006.Google Scholar
- 31.Cieslak, D. A., Chawla, N. V. and Thain, D. L., “Troubleshooting Thousands of Jobs on Production Grids Using Data Mining Techniques,” in GRID’08, IEEE, pp. 217–224, 2008.Google Scholar
- 32.Cieslak, D. A., Thain, D. L. and Chawla, N. V., “Short Paper: Troubleshooting Distributed Systems via Data Mining,” in HDPC’06, IEEE, pp. 309–312, 2006.Google Scholar
- 33.Lan, Z., Gujrati, P., Li, Y., Zheng, Z., Thakur, R. and White, J., “A Fault Diagnosis and Prognosis Service for TeraGrid Clusters,” in TeraGrid’07 Conference, 2007.Google Scholar
- 34.Dabrowski, C., “Reliability in Grid Computing Systems,” Concurrency and Computation, Special OGF Issue, Wiley, pp. 927–959, 2009.Google Scholar
- 35.Smith, W. and Wong, P., “Resource selection using execution and queue wait time predictions,” Technical Report NAS-02-003, NAS, 2002.Google Scholar
- 36.Kiran M., A. Hashim A.-H., Kuan L.M., Jiun Y.Y.: “Execution Time Prediction of Imperative Paradigm Tasks for Grid Scheduling Optimization”. International Journal of Computer Science and Network Security 9(2), 155–163 (2009)Google Scholar
- 37.Sonmez, O., Yigitbasi, N., Iosup, A. and Epema, D., “Trace-Based Evaluation of Job Runtime and Queue Wait Time Predictions in Grids,” in Proc. of the 18th International Symposium on High Performance Distributed Computing (HPDC’09), ACM, pp. 111–120, 2009.Google Scholar
- 39.Witten, I. H., Frank, E., Data Mining: Practical Machine Learning Tools and Techniques (Second Edition), Elsevier, 2005.Google Scholar
- 40.Jensen, F. V., Nielsen, T. D., Bayesian Networks and Decision Graphs, (Second Edition), Springer-Verlag, 2007.Google Scholar
- 43.Wu, L., Ren, C, Meng, D, Jianfeng, Z. and Tu, B., “The Failure-Rate Aware Scheduling Policies for Large-Scale Cluster Systems,” in Proc. of the 7th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT’06), IEEE, pp. 364–367, 2006.Google Scholar
- 44.Shrinivas, L., Naughton, J. F., “Issues in Applying Data Mining to Grid Job Failure Detection and Diagnosis,” in Proc. of the International Symposium on High Performance and Distributed Computing (HDPC’08), ACM, pp. 239–240, 2008.Google Scholar
- 45.Duan, R., Prodan, R., Fahringer T., “Short Paper: Data Mining-based Fault Prediction and Detection on the Grid,” in Proc. of the 15th International Conference on High Performance Distributed Computing (HPDC’06), IEEE, pp. 305–308, 2006.Google Scholar
- 46.Gu, J., Zheng, Z., Lan, Z., White, J., Hocks, E., Park, B.-H., “Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study,” in Proc. of the 37th International Conference on Parallel Processing (ICPP’08), IEEE, pp. 157–164, 2008.Google Scholar