New Generation Computing

, Volume 30, Issue 1, pp 73–94 | Cite as

Predicting Job Failures in AuverGrid Based on Workload Log Analysis

  • Hamid Saadatfar
  • Hamid Fadishei
  • Hossein Deldari


Grid systems are popular today due to their ability to solve large problems in business and science. Job failures which are inherent in any computational environment are more common in grids due to their dynamic and complex nature. Furthermore, traditional methods for job failure recovery have proven costly and thus a need to shift toward proactive and predictive management strategies is necessary in such systems. In this paper, an innovative effort has been made to predict the futurity of jobs in a production grid environment. First of all, we investigated the relationship between workload characteristics and job failures by analyzing workload traces of AuverGrid which is a part of EGEE (Enabling Grids for E-science) project. After the recognition of failure patterns, the success or failure status of jobs during 6 months of AuverGrid activity was predicted with approximately 96% accuracy. The quality of services on the grid can be improved by integrating the result of this work into management services like scheduling and monitoring.


Job Failure Prediction Grid Workload Archive Trace Analysis Bayesian Networks 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    The Large Hadron Collider Grid (LCG) website. Availaible on:
  2. 2.
    The Lawrence Livermore National Laboratory BlueGene/L supercomputer website. Availaible on:
  3. 3.
    The AuverGrid website. Availaible on:
  4. 4.
    The EGEE (Enabling Grid for E-Science) project website. Availaible on:
  5. 5.
    Fu, S. and Xu, C.-Z., “Exploring Event Correlation for Failure Prediction in Coalitions of Clusters,” in Proc. of the International Conference on Supercomputing, 41, ACM/IEEE, 2007.Google Scholar
  6. 6.
    Asadzadeh, P., Buyya, R., Kei, C. L., Nayar, D. and Venugopal, S., “Global Grids and Software Toolkits: A Study of Four Grid Middleware Technologies,” Wiley Series on Parallel and Distributed Computing, Chap. 22, 2006.Google Scholar
  7. 7.
    Iosup A., Li H., Jan M., Anoep S., Dumitrescu C., Wolters L., Epema D.H.J.: “The Grid Workloads Archive”. Elsevier Journal of Future Generation Computer Systems 24, 672–686 (2008)CrossRefGoogle Scholar
  8. 8.
    Gunter, D., Tierney, B. L., Brown, A., Swany, M., Bresnahan, J. and Schopf, J. M., “Log Summarization and Anomaly Detection for Troubleshooting Distributed Systems,” in Proc. of the International Conference on Grid Computing, IEEE/ACM, pp. 226–234, 2007.Google Scholar
  9. 9.
    Chawla, N. V., Thain, D., Lichtenwalter, R. and Cieslak, D. A., “Data Mining on the Grid for the Grid,” in Proc. of International Symposium on Parallel and Distributed Processing, IEEE, pp. 1–8, 2008.Google Scholar
  10. 10.
    Zeinalipour-Yazti, D., Neocleous, K., Georgiou, C. and Dikaiakos, M. D., “Identifying Failures in Grids through Monitoring and Ranking,” in Proc. of International Symposium on Network Computing and Applications, IEEE, pp. 291–298, 2008.Google Scholar
  11. 11.
    Zhang, X., Sebag, M. and Germain, C., “Toward Behavioral Modeling of A Grid System: Mining the Logging and Bookkeeping Files,” in Proc. of International Conference on Data Mining, IEEE, pp. 581–588, 2007.Google Scholar
  12. 12.
    Kang, W. and Grimshaw, A., “Failure Prediction in Computational Grids,” in Proc. of Annual Simulation Symposium, IEEE, pp. 275–282, 2007.Google Scholar
  13. 13.
    Yuan, Y., Wu, Y., Yang, G. and Zheng, W., “Adaptive Hybrid Model for Long Term Load Prediction in Computational Grid,” in Proc. of International Symposium on Cluster Computing and the Grid, IEEE, pp. 340–347, 2008.Google Scholar
  14. 14.
    Akioka, S. and Muraoka, Y., “Extended Forecast of CPU and Network Load on Computational Grid,” in Proc. of International Symposium on Cluster Computing and the Grid, IEEE, pp. 765–772, 2004.Google Scholar
  15. 15.
    Iosup, A., Dumitrescu, C. and Epema., D., “How are real grids used? The Analysis of Four Grid Traces and Its Implications,” in Proc. of International Conference on Grid Computing, IEEE/ACM, pp. 262–269, 2006.Google Scholar
  16. 16.
    Wolski, R., Spring, N. and Hayes, J., “Predicting the CPU availability of Timeshared UNIX Systems,” in Proc. of International Symposium on High Performance and Distributed Computing, IEEE, pp. 105–112, 1999.Google Scholar
  17. 17.
    Dinda, P. A., “The Statistical Properties of Host Load,” in Journal of Scientific Programming, 7, (3,4), pp. 211–229, 1998.Google Scholar
  18. 18.
    Nadeem, F., Prodan, R. and Fahringer, T., “Characterizing, Modeling and Predicting Dynamic Resource Availability in a Large Scale Multi-purpose Grid,” in Proc. of International Symposium on Cluster Computing and the Grid, IEEE, pp. 348–357, 2008.Google Scholar
  19. 19.
    Domingues, P., Marques, P. and Silva, L., “DGSchedSim: A Trace-Driven Simulator to Evaluate Scheduling Algorithms for Desktop Grid Environments,” in Proc. of International Conference on Parallel, Distributed, and Network-Based Processing, IEEE, pp. 8–15, 2006.Google Scholar
  20. 20.
    Zhao, Y., Shao, G. and Yang, G., “A Survey of Methods and Applications for Trace Analysis in Grid Systems,” in Proc. of ChinaGrid Annual Conference, IEEE, pp. 264–271, 2008.Google Scholar
  21. 21.
    Andrzejak, A., Domingues, P. and Silva, L., “Classifier-Based Capacity Prediction for Desktop Grids,” in Proc. of Integrated Research in Grid Computing, CoreGRID Workshop, pp. 135–144, 2005.Google Scholar
  22. 22.
    Rood, B., Walters, J. P., Chaudhary, V. and Lewis, M. J., “Failure Prediction and Scalable Checkpointing for Reliable Large-Scale Grid Computing,” in HPDC’07, IEEE, 2007.Google Scholar
  23. 23.
    Rood, B. and Lewis, M. J., “Resource Availability Prediction for Improved Grid Scheduling,” in Proc. of eScience’08, IEEE, pp. 711–718, 2008.Google Scholar
  24. 24.
    Li, H., Groep, D. and Wolters, L., “Mining Performance Data for Metascheduling Decision Support in the Grid,” in FGCS, 23, 1, Elsevier, pp. 92–99, 2007.Google Scholar
  25. 25.
    Spooner, D. P., Jarvis, S. A., Cao, J., Saini, S. and Nudd, G. R., “Local Grid Scheduling Techniques using Performance Prediction,” in Proc. on Computers and Digital Techniques, IEE, pp. 87–96, 2003.Google Scholar
  26. 26.
    Leangsuksun, C., Liu, T., Rao, T., Scott, S. L. and Libby, R., “A Failure Predictive and Policy-Based High Availability Strategy For Linux HPC Cluster,” in Proc. of the International Conference on Linux Clusters, pp. 1–12, 2004.Google Scholar
  27. 27.
    Lin, T.-T. Y. and Siewiorek, D. P., “Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis,” in IEEE Transactions on Reliability, 39, 4, pp. 419–432, 1990.Google Scholar
  28. 28.
    Brevik, J., Nurmi, D. and Wolski, R., “Automatic methods for predicting machine availability in desktop Grid and peer to-peer systems,” in Proc. of CCGRID’ 04, IEEE, pp. 190–199, 2004.Google Scholar
  29. 29.
    Li, H., “Workload Characterization, Modeling, and Prediction in Grid Computing,” Ph.D. Thesis, Leiden University, 2008.Google Scholar
  30. 30.
    Li, H., Groep, D., Wolters, L., and Templon, J., “Job Failure Analysis and Its Implications in a Large-Scale Production Grid,” in Proc. of International Conference on e-Science and Grid Computing, IEEE, pp. 27–34, 2006.Google Scholar
  31. 31.
    Cieslak, D. A., Chawla, N. V. and Thain, D. L., “Troubleshooting Thousands of Jobs on Production Grids Using Data Mining Techniques,” in GRID’08, IEEE, pp. 217–224, 2008.Google Scholar
  32. 32.
    Cieslak, D. A., Thain, D. L. and Chawla, N. V., “Short Paper: Troubleshooting Distributed Systems via Data Mining,” in HDPC’06, IEEE, pp. 309–312, 2006.Google Scholar
  33. 33.
    Lan, Z., Gujrati, P., Li, Y., Zheng, Z., Thakur, R. and White, J., “A Fault Diagnosis and Prognosis Service for TeraGrid Clusters,” in TeraGrid’07 Conference, 2007.Google Scholar
  34. 34.
    Dabrowski, C., “Reliability in Grid Computing Systems,” Concurrency and Computation, Special OGF Issue, Wiley, pp. 927–959, 2009.Google Scholar
  35. 35.
    Smith, W. and Wong, P., “Resource selection using execution and queue wait time predictions,” Technical Report NAS-02-003, NAS, 2002.Google Scholar
  36. 36.
    Kiran M., A. Hashim A.-H., Kuan L.M., Jiun Y.Y.: “Execution Time Prediction of Imperative Paradigm Tasks for Grid Scheduling Optimization”. International Journal of Computer Science and Network Security 9(2), 155–163 (2009)Google Scholar
  37. 37.
    Sonmez, O., Yigitbasi, N., Iosup, A. and Epema, D., “Trace-Based Evaluation of Job Runtime and Queue Wait Time Predictions in Grids,” in Proc. of the 18th International Symposium on High Performance Distributed Computing (HPDC’09), ACM, pp. 111–120, 2009.Google Scholar
  38. 38.
    Cooper G., Herskovits E.: “A Bayesian Method for the Induction of Probabilistic Networks from Data”. Journal of Machine Learning 9, 309–347 (1992)MATHGoogle Scholar
  39. 39.
    Witten, I. H., Frank, E., Data Mining: Practical Machine Learning Tools and Techniques (Second Edition), Elsevier, 2005.Google Scholar
  40. 40.
    Jensen, F. V., Nielsen, T. D., Bayesian Networks and Decision Graphs, (Second Edition), Springer-Verlag, 2007.Google Scholar
  41. 41.
    Fu S.: “Failure-aware resource management for high-availability computing clusters with distributed virtual machines”. Journal of Parallel and Distributed Computing 70(4), 384–393 (2010)CrossRefGoogle Scholar
  42. 42.
    Khoo B.T.B., Veeravalli B.: “Pro-active failure handling mechanisms for scheduling in grid computing environments”. Journal of Parallel and Distributed Computing 70(3), 189–200 (2010)CrossRefGoogle Scholar
  43. 43.
    Wu, L., Ren, C, Meng, D, Jianfeng, Z. and Tu, B., “The Failure-Rate Aware Scheduling Policies for Large-Scale Cluster Systems,” in Proc. of the 7th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT’06), IEEE, pp. 364–367, 2006.Google Scholar
  44. 44.
    Shrinivas, L., Naughton, J. F., “Issues in Applying Data Mining to Grid Job Failure Detection and Diagnosis,” in Proc. of the International Symposium on High Performance and Distributed Computing (HDPC’08), ACM, pp. 239–240, 2008.Google Scholar
  45. 45.
    Duan, R., Prodan, R., Fahringer T., “Short Paper: Data Mining-based Fault Prediction and Detection on the Grid,” in Proc. of the 15th International Conference on High Performance Distributed Computing (HPDC’06), IEEE, pp. 305–308, 2006.Google Scholar
  46. 46.
    Gu, J., Zheng, Z., Lan, Z., White, J., Hocks, E., Park, B.-H., “Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study,” in Proc. of the 37th International Conference on Parallel Processing (ICPP’08), IEEE, pp. 157–164, 2008.Google Scholar

Copyright information

© Ohmsha and Springer Japan jointly hold copyright of the journal. 2012

Authors and Affiliations

  • Hamid Saadatfar
    • 1
  • Hamid Fadishei
    • 1
  • Hossein Deldari
    • 1
  1. 1.Parallel & Distributed Processing Lab, Computer Engineering DepartmentFerdowsi University of MashhadMashhadIran

Personalised recommendations