Towards Self-optimization in HPC I/O
Performance analysis and optimization of high-performance I/O systems is a daunting task. Mainly, this is due to the overwhelmingly complex interplay of internal processes while executing application programs. Unfortunately, there is a lack of monitoring tools to reduce this complexity to a bearable level. For these reasons, the project Scalable I/O for Extreme Performance (SIOX) aims to provide a versatile environment for recording system activities and learning from this information. While still under development, SIOX will ultimately assist in locating and diagnosing performance problems and automatically suggest and apply performance optimizations.
The SIOX knowledge path is concerned with the analysis and utilization of data describing the cause-and-effect chain recorded via the monitoring path. In this paper, we present our refined modular design of the knowledge path. This includes a description of logical components and their interfaces, details about extracting, storing and retrieving abstract activity patterns, a concept for tying knowledge to these patterns, and the integration of machine learning. Each of these tasks is illustrated through examples. The feasibility of our design is further demonstrated with an internal component for anomaly detection, permitting intelligent monitoring to limit the SIOX system’s impact on system resources.
KeywordsParallel I/O Machine Learning Self-Optimization
Unable to display preview. Download preview PDF.
- 2.Wiedemann, M.C., Kunkel, J.M., Zimmer, M., Ludwig, T., Resch, M., Bönisch, T., Wang, X., Chut, A., Aguilera, A., Nagel, W.E., Kluge, M., Mickler, H.: Towards I/O Analysis of HPC Systems and a Generic Architecture to Collect Access Patterns. Computer Science - Research and Development 1, 1–11 (2012)Google Scholar
- 4.Modani, N., Gupta, R., Lohman, G., Syeda-Mahmood, T., Mignet, L.: Automatically Identifying Known Software Problems. In: 2007 IEEE 23rd International Conference on Data Engineering Workshop, pp. 433–441 (April 2007)Google Scholar
- 5.Barham, P., Donnelly, A., Isaacs, R., Mortier, R.: Using Magpie for Request Extraction and Workload Modelling. In: Proceedings of the 6th Symposium on Opearting Systems Design and Implementation, vol. 6, pp. 259–272 (2004)Google Scholar
- 6.Yuan, C., Lao, N., Wen, J.-R., Li, J., Zhang, Z., Wang, Y.-M., Ma, W.-Y.: Automated Known Problem Diagnosis with Event Traces. In: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006, EuroSys 2006, pp. 375–388. ACM, New York (2006)Google Scholar
- 7.Sandeep, S.R., Swapna, M., Niranjan, T., Susarla, S., Nandi, S.: CLUEBOX: a Performance Log Analyzer for Automated Troubleshooting. In: Proceedings of the First USENIX Conference on Analysis of System Logs, WASL 2008. USENIX Association, Berkeley (2008)Google Scholar
- 9.Cohen, I., Goldszmidt, M., Kelly, T., Symons, J., Chase, J.S.: Correlating Instrumentation Data to System States: a Building Block for Automated Diagnosis and Control. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, OSDI 2004, vol. 6. USENIX Association, Berkeley (2004)Google Scholar
- 10.Duan, S.S., Babu, Munagala, K.: Fa: A System for Automating Failure Diagnosis. In: IEEE 25th International Conference on Data Engineering, ICDE 2009, March 29-April 2, pp. 1012–1023 (2009)Google Scholar
- 11.Bader, M., Bungartz, H.J., Gerndt, M., Hollmann, A., Weidendorfer, J.: Invasive programming as a concept for HPC. In: Proc. of the 10th IASTED Int. Conf. on Parallel and Distr. Comp. and Netw., PDCN (2011)Google Scholar
- 12.Kunkel, J., Ludwig, T.: IOPm – Modeling the I/O Path with a Functional Representation of Parallel File System and Hardware Architecture. In: PDP 2012, Munich Network Management Team. IEEE (2012)Google Scholar