Abstract
An analysis of real-world operational data of Tianhe-1A (TH-1A) supercomputer system shows that chilled water data not only can reflect the status of a chiller system but also are related to supercomputer load. This study proposes AquaSee, a method that can predict the load and cooling system faults of supercomputers by using chilled water pressure and temperature data. This method is validated on the basis of real-world operational data of the TH-1A supercomputer system at the National Supercomputer Center in Tianjin. Datasets with various compositions are used to construct the prediction model, which is also established using different prediction sequence lengths. Experimental results show that the method that uses a combination of pressure and temperature data performs more effectively than that only consisting of either pressure or temperature data. The best inference sequence length is two points. Furthermore, an anomaly monitoring system is set up by using chilled water data to help engineers detect chiller system anomalies.
Similar content being viewed by others
References
Yang X J, Liao X K, Lu K et al. The Tianhe-1A supercomputer: Its hardware and software. Journal of Computer Science and Technology, 2011, 26(3): 344-351.
Sîrbu A, Babaoglu Ö. Towards a systematic analysis of cluster computing log data: The case of IBM BlueGene/Q. arXiv: 1410.4449v2, 2014. https://arxiv.org/pdf/1410.4449v2.pdf, June 2019.
Patnaik D, Marwah M, Sharma R K et al. Data mining for modeling chiller systems in data centers. In Proc. the 9th International Symposium on Intelligent Data Analysis, May 2010, pp.125-136.
Patnaik D, Marwah M, Sharma R K et al. Temporal data mining approaches for sustainable chiller management in data centers. ACM Transactions on Intelligent Systems and Technology, 2011, 2(4): Article No. 34.
Chou J S, Hsu Y C, Lin L T. Smart meter monitoring and data mining techniques for predicting refrigeration system performance. Expert Systems with Applications, 2014, 41(5): 2144-2156.
Zapater M, Tuncer O, Ayala J L et al. Leakage-aware cooling management for improving server energy efficiency. IEEE Transactions on Parallel and Distributed Systems, 2015, 26(10): 2764-2777.
Dayarathna M, Wen Y, Fan R. Data center energy consumption modeling: A survey. IEEE Communications Surveys & Tutorials, 2017, 18(1): 732-794.
Banerjee A, Mukherjee T, Varsamopoulos G et al. Coolingaware and thermal-aware workload placement for green HPC data centers. In Proc. the 2010 International Green Computing Conference, August 2010, pp.245-256.
Chen T, Wang X, Giannakis G B. Cooling-aware energy and workload management in data centers via stochastic optimization. IEEE Journal of Selected Topics in Signal Processing, 2016, 10(2): 402-415.
Liu Z, Chen Y, Bash C et al. Renewable and cooling aware workload management for sustainable data centers. ACM SIGMETRICS Performance Evaluation Review, 2012, 40(1): 175-186.
Li Y L, Wen Y G, Guan K, Tao D C. Transforming cooling optimization for green data center via deep reinforcement learning. IEEE Transactions on Cybernetics. doi:https://doi.org/10.1109/TCYB.2019.2927410.
O’Brien K, Pietri I, Reddy R et al. A survey of power and energy predictive models in HPC systems and applications. ACM Computing Surveys, 2017, 50(3): Article No. 37.
Etinski M, Corbalán J, Labarta J et al. Utilization driven power-aware parallel job scheduling. Computer Science —Research and Development, 2010, 25(3-4): 207-216.
Butts J A, Sohi G S. A static power model for architects. In Proc. the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, December 2000, pp.191-201.
Carbó A, Oró E, Salom J, Canuto M, Macías M, Guitart J. Experimental and numerical analysis for potential heat reuse in liquid cooled data centres. Energy Conversion and Management, 2016, 112: 135-145.
Xu H, Feng C, Li B. Temperature aware workload management in geo-distributed data centers. ACM SIGMETRICS Performance Evaluation Review, 2013, 41(1): 373-374.
Bates N J, Ghatikar G, Abdulla G et al. Electrical grid and supercomputing centers: An investigative analysis of emerging opportunities and challenges. Informatik Spektrum, 2015, 38(2): 111-127.
Bai Y, Gu L, Qi X. Comparative study of energy performance between chip and inlet temperature-aware workload allocation in air-cooled data center. Energies, 2018, 11(3): Article No. 669.
Meng J, Mccauley S, Kaplan F, Leung V, Coskun A. Simulation and optimization of HPC job allocation for jointly reducing communication and cooling costs. Sustainable Computing: Informatics and Systems, 2015, 6: 48-57.
Rahmani R, Moser I, Seyedmahmoudian M. A complete model for modular simulation of data centre power load. arXiv:1804.00703, 2018. https://arxiv.org/abs/1804.00703, June 2019.
Ranganathan P, Leech P, Irwin D et al. Ensemblelevel power management for dense blade servers. ACM SIGARCH Computer Architecture News, 2006, 34(2): 66-77.
Hilburg J C S, Zapater M, Risco-Martín J L et al. Unsupervised power modeling of co-allocated workloads for energy efficiency in data centers. In Proc. the 2016 Design, Automation & Test in Europe Conference & Exhibition, March 2016, pp.1345-1350.
Sapankevych N I, Sankar R. Time series prediction using support vector machines: A survey. IEEE Computational Intelligence Magazine, 2009, 4(2): 24-38.
Roy N, Dubey A, Gokhale A. Efficient autoscaling in the cloud using predictive models for workload forecasting. In Proc. the 4th IEEE International Conference on Cloud Computing, July 2011, pp.500-507.
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735-1780.
Kumar J, Goomer R, Singh A K. Long short term memory recurrent neural network (LSTM-RNN) based workload forecasting model for cloud datacenters. Procedia Computer Science, 2018, 125: 676-682.
Kong W, Dong Z Y, Jia Y et al. Short-term residential load forecasting based on LSTM recurrent neural network. IEEE Transactions on Smart Grid, 2019, 10(1): 841-851.
Krstanovic S, Paulheim H. Ensembles of recurrent neural networks for robust time series forecasting. In Proc. the 37th SGAI International Conference on Artificial Intelligence, December 2017, pp.34-46.
Malhotra P, Vig L, Shroff G, Agarwal P. Long short term memory networks for anomaly detection in time series. In Proc. the 23rd European Symposium on Artificial Neural Networks, April 2015, Article No. 15.
Bontemps L, Cao V L, Mcdermott J et al. Collective anomaly detection based on long short term memory recurrent neural network. arXiv:1703.09752, 2017. https://arxiv.org/abs/1703.09752, June 2019.
Filonov P, Lavrentyev A, Vorontsov A. Multivariate industrial time series with cyber-attack simulation: Fault detection using an LSTM-based predictive data model. arXiv:1612.06676, 2016. https://arxiv.org/abs/1612.06676, June 2019.
Hundman K, Constantinou V, Laporte C et al. Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding. In Proc. the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 2018, pp.387-395.
Wong C, Houlsby N, Lu Y et al. Transfer learning with Neural AutoML. arXiv:1803.02780v3, 2018. http://export.arxiv.org/abs/1803.02780v3, Aug. 2019.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
ESM 1
(PDF 345 kb)
Rights and permissions
About this article
Cite this article
Li, YQ., Xiao, LQ., Feng, JH. et al. AquaSee: Predict Load and Cooling System Faults of Supercomputers Using Chilled Water Data. J. Comput. Sci. Technol. 35, 221–230 (2020). https://doi.org/10.1007/s11390-019-1951-7
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-019-1951-7