Abstract
Supercomputers stand as a fundamental tool for developing our understanding of the universe. State-of-the-art scientific simulations, big data analyses, and machine learning executions require high performance computing platforms. Such infrastructures have been growing lately with the addition of thousands of newly designed components, calling their resiliency into question. It is crucial to solidify our knowledge on the way supercomputers fail. Other recent studies have highlighted the importance of characterizing failures on supercomputers. This paper aims at modelling component failures of a supercomputer based on Mixed Weibull distributions. The model is built using a real-life multi-year failure record from a leadership-class supercomputer. Using several key observations from the data, we designed an analytical model that is robust enough to represent each of the main components of supercomputers, yet it is flexible enough to alter the composition of the machine and be able to predict resilience of future or hypothetical systems.
Notice: This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bautista-Gomez, L., Zyulkyarov, F., Unsal, O., McIntosh-Smith, S.: Unprotected computing: a large-scale study of dram raw error rate on a supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, pp. 645–655, November 2016. https://doi.org/10.1109/SC.2016.54
Borghesi, A., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: Predictive modeling for job power consumption in HPC systems. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) ISC High Performance 2016. LNCS, vol. 9697, pp. 181–199. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41321-1_10
Cappello, F., Al, G., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innov. Int. J. 1(1), 5–28 (2014). https://doi.org/10.14529/jsfi140101
Di, S., Gupta, R., Snir, M., Pershey, E., Cappello, F.: Logaider: a tool for mining potential correlations of HPC log events. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 442–451, May 2017. https://doi.org/10.1109/CCGRID.2017.18
El-Sayed, N., Schroeder, B.: Reading between the lines of failure logs: understanding how HPC systems fail. In: 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1–12, June 2013. https://doi.org/10.1109/DSN.2013.6575356
Gainaru, A., Cappello, F., Kramer, W.: Taming of the shrew: modeling the normal and faulty behaviour of large-scale HPC systems. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp. 1168–1179, May 2012. https://doi.org/10.1109/IPDPS.2012.107
Gallet, M., Yigitbasi, N., Javadi, B., Kondo, D., Iosup, A., Epema, D.: A model for space-correlated failures in large-scale distributed systems. In: D’Ambra, P., Guarracino, M., Talia, D. (eds.) Euro-Par 2010. LNCS, vol. 6271, pp. 88–100. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15277-1_10
Heien, E., LaPine, D., Kondo, D., Kramer, B., Gainaru, A., Cappello, F.: Modeling and tolerating heterogeneous failures in large parallel systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 1–11, November 2011. https://doi.org/10.1145/2063384.2063444
Levy, S., Ferreira, K.B.: An examination of the impact of failure distribution on coordinated checkpoint/restart. In: Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS 2016, pp. 35–42. ACM, New York (2016). https://doi.org/10.1145/2909428.2909430
Li, S., et al.: System implications of memory reliability in exascale computing. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 46:1–46:12. ACM, New York (2011). https://doi.org/10.1145/2063384.2063445
Martino, C.D., Kalbarczyk, Z., Iyer, R.K., Baccanico, F., Fullop, J., Kramer, W.: Lessons learned from the analysis of system failures at petascale: the case of blue waters. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 610–621, June 2014. https://doi.org/10.1109/DSN.2014.62
Martino, C.D., Kramer, W., Kalbarczyk, Z., Iyer, R.: Measuring and understanding extreme-scale application resilience: a field study of 5,000,000 HPC application runs. In: 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 25–36, June 2015. https://doi.org/10.1109/DSN.2015.50
Meneses, E., Ni, X., Jones, T., Maxwell, D.: Analyzing the interplay of failures and workload on a leadership-class supercomputer. In: Cray User Group (CUG) Conference, May 2015
Minh, T.N., Pierre, G.: Failure analysis and modeling in large multi-site infrastructures. In: Dowling, J., Taïani, F. (eds.) DAIS 2013. LNCS, vol. 7891, pp. 127–140. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38541-4_10
Nie, B., Tiwari, D., Gupta, S., Smirni, E., Rogers, J.H.: A large-scale study of soft-errors on GPUs in the field. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 519–530. IEEE Xplore Digital Library, Barcelona, March 2016. https://doi.org/10.1109/HPCA.2016.7446091
Oliner, A., Stearley, J.: What supercomputers say: a study of five system logs. In: 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2007), pp. 575–584, June 2007. https://doi.org/10.1109/DSN.2007.103
Rojas, E., Meneses, E., Jones, T., Maxwell, D.: Analyzing a five-year failure record of a leadership-class supercomputer. In: International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), October 2019. https://doi.org/10.1109/SBAC-PAD.2019.00040
Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secure Comput. 7(4), 337–350 (2010). https://doi.org/10.1109/TDSC.2009.4
Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. J. Phys. Conf. Ser. 78 (2007). https://doi.org/10.1088/1742-6596/78/1/012022
Schroeder, B., Pinheiro, E., Weber, W.D.: Dram errors in the wild: a large-scale field study. In: Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2009, pp. 193–204. ACM, New York (2009). https://doi.org/10.1145/1555349.1555372
Schulz, M., Lucas, B., Macaluso, T., Quinlan, D., Wu, J.: Inter-Agency Workshop on HPC Resilience at Extreme Scale National Security Agency Advanced Computing Systems, 21–24 February 2012 Coordinating Representatives John Daly (DOD) Bill Harrod (DOE/SC) Thuc Hoang (DOE/NNSA) (2012)
Snir, M., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28(2), 129–173 (2014). https://doi.org/10.1177/1094342014522573
Tiwari, D., Gupta, S., Gallarno, G., Rogers, J., Maxwell, D.: Reliability lessons learned from GPU experience with the titan supercomputer at oak ridge leadership computing facility. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pp. 1–12. IEEE Xplore Digital Library, Austin, November 2015. https://doi.org/10.1145/2807591.2807666
Tiwari, D., et al.: Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 331–342. IEEE Xplore Digital Library, Burlingame, February 2015. https://doi.org/10.1109/HPCA.2015.7056044
Top500.org: Top500 supercomputing sites (2018). https://www.top500.org/. Accessed 19 Aug 2018
Vaarandi, R.: Sec - simple event correlator (2018). https://simple-evcorr.github.io. Accessed 19 Aug 2018
Wu, M., Sun, X., Jin, H.: Performance under failures of high-end computing. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 2007, pp. 1–11, November 2007. https://doi.org/10.1145/1362622.1362687
Yigitbasi, N., Gallet, M., Kondo, D., Iosup, A., Epema, D.: Analysis and modeling of time-correlated failures in large-scale distributed systems. In: 2010 11th IEEE/ACM International Conference on Grid Computing, pp. 65–72, October 2010. https://doi.org/10.1109/GRID.2010.5697961
Zheng, Z., et al.: Co-analysis of RAS log and job log on Blue Gene/P. In: 2011 IEEE International Parallel Distributed Processing Symposium, pp. 840–851, May 2011. https://doi.org/10.1109/IPDPS.2011.83
Acknowledgment
This research was partially supported by a machine allocation on Kabré supercomputer at the Costa Rica National High Technology Center. Early versions of this manuscript received valuable comments from Prof. Marcela Alfaro-Cordoba at University of Costa Rica.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Rojas, E., Meneses, E., Jones, T., Maxwell, D. (2020). Towards a Model to Estimate the Reliability of Large-Scale Hybrid Supercomputers. In: Malawski, M., Rzadca, K. (eds) Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12247. Springer, Cham. https://doi.org/10.1007/978-3-030-57675-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-57675-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57674-5
Online ISBN: 978-3-030-57675-2
eBook Packages: Computer ScienceComputer Science (R0)