An Integrated Data Preprocessing Framework Based on Apache Spark for Fault Diagnosis of Power Grid Equipment
- 770 Downloads
Big data techniques have been applied to power grid for the prediction and evaluation of grid conditions. However, the raw data quality can rarely meet the requirement of precise data analytics since raw data set usually contains samples with missing data to which the common data mining models are sensitive. Besides, the raw training data from a single monitoring system, e.g. dissolved gas analysis (DGA), are rarely sufficient for training in the form of valid instances since raw data set usually contains samples with noisy data. Though classic methods like neural network can be used to fill the gaps of missing data and classify the fault type, their models often fail to fit the rules of power grid conditions. This paper presents an integrated data preprocessing framework (DPF) based on Apache Spark to improve the prediction accuracy for data sets with missing data points and classification accuracy with noise data as well as to meet the big data requirement, which mainly combines missing data prediction, data fusion, data cleansing and fault type classification. First, the prediction model is trained based on the linear regression (LinR). Afterwards, we propose an optimized linear method (OLR) to improve the prediction accuracy. Then, to better utilize the strong correlation among different data sources, new data features extracted by persons correlation coefficient (PCC) are fused into a training data set. Next, principal component analysis (PCA) is taken to reduce the side effect brought by the new feature as well as retaining significant information for classification. Finally, the classification model based on logistic regression (LogR) and support vector machine (SVM) is trained to classify the fault type of electric equipment. We test the DPF framework on missing data prediction and fault type classification of power transformers in power grid system. The experimental results show that the predictors based on the proposed framework achieve lower mean square error and the classifiers obtain higher accuracy than traditional ones. Besides, the training time required for training large-scale data shows a decreasing trend. Therefore, the data preprocessing framework DPF would be a good candidate to predict the missing data and classify the fault type in power grid system.
KeywordsBig data Apache spark Framework Missing data prediction Fault diagnose
This paper is sponsored in part by the National High Technology and Research Development Program of China (863 Program, 2015AA050204), State Grid Science and Technology Project (520626140020, 14H100000552, SGCQDK00PJJS1400020), State Grid Corporation of China, the National Research Foundation Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) program, and the National Natural Science Foundation of China (No.61373032).
- 2.Niu, J., Gao, Y., Qiu, M., & Ming, Z. (2012). Selecting proper wireless network interfaces for user experience enhancement with guaranteed probability. Journal of Parallel and Distributed Computing, 72(12), 1565–1575. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0743731512002134.CrossRefGoogle Scholar
- 3.Li, Y., Dai, W., Ming, Z., & Qiu, M. (2015). Privacy protection for preventing data over-collection in smart city. IEEE Transactions on Computers, PP(99), 1–1.Google Scholar
- 6.Davis, J.J., & Clark, A.J. (2011). Data preprocessing for anomaly based network intrusion detection: A review. Computers & Security, 30(6–7), 353–375. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167404811000691.CrossRefGoogle Scholar
- 8.Qiu, M., Ming, Z., Li, J., Liu, J., Quan, G., & Zhu, Y. (2013). Informer homed routing fault tolerance mechanism for wireless sensor networks. Journal of Systems Architecture, 59(4–5), 260–270. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1383762113000040.CrossRefGoogle Scholar
- 9.Ma, H., King, I., & Lyu, M.R. (2007). Effective missing data prediction for collaborative filtering. In Inproceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 39–46). Amsterdam: ACM.Google Scholar
- 10.Nogueira, R., Vieira, S., & Sousa, J. (2005). The prediction of bankruptcy using fuzzy classifiers. In 2005 ICSC Congress on Computational Intelligence Methods and Applications (p. 6).Google Scholar
- 11.Lei, K.S., & Wan, F. (2010). Pre-processing for missing data: A hybrid approach to air pollution prediction in macau. In 2010 IEEE International Conference on Automation and Logistics (ICAL), (Vol. 16–20 pp. 418–422).Google Scholar
- 12.Tian, F., Sun, J., & Shao, S. (2013). Wavelet threshold de-noising applications in avionics test data processing. In 2013 Third International Conference on Instrumentation, Measurement, Computer, Communication and Control (IMCCC), (Vol. 21–23, pp. 667– 671).Google Scholar
- 13.Wei, X., Xiao, B., Zhang, Q., & Liu, R. (2011). A rigid structure matching-based noise data processing approach for human motion capture. In 2011 Workshop on Digital Media and Digital Content Management (DMDCM) (Vol. 15–16 pp. 91–96).Google Scholar
- 14.da Silva, I., & Adeodato, P. (2011). Pca and gaussian noise in mlp neural network training improve generalization in problems with small and unbalanced data sets. In The 2011 International Joint Conference on Neural Networks (IJCNN) (pp. 2664–2669).Google Scholar
- 16.Atasu, K. (2015). Feature-rich regular expression matching accelerator for text analytics. Journal of Signal Processing Systems (JSPS), 1–17. [Online]. Available. doi: 10.1007/s11265-015-1052-y.
- 17.Karthikeyan, P., Amudhavel, J., Abraham, A., Sathian, D., Raghav, R.S., & Dhavachelvan, P. (2015). A comprehensive survey on variants and its extensions of big data in cloud environment. In Proceedings of the 2015 International Conference on Advanced Research in Computer Science Engineering and Technology (ICARCSET 2015) (pp. 1–5). Unnao: ACM.Google Scholar
- 19.Shi, W., Zhu, Y., Zhang, J., Tao, X., Sheng, G., Lian, Y., Wang, G., & Chen, Y. (2015). Improving power grid monitoring data quality: An efficient machine learning framework for missing data prediction. In IEEE 17th International Conference on High Performance Computing and Communications, 2015 (pp. 417–422). IEEE Computer Society.Google Scholar
- 20.Zhang, J., Zhu, Y., Shi, W., Sheng, G., & Chen, Y. (2015). An improved machine learning scheme for data-driven fault diagnosis of power grid equipment. In The 2015 IEEE International Symposium on Smart Data (pp. 1737–1742). IEEE Computer Society.Google Scholar
- 22.Hong, S.T., & Chang, J.W. (2011). A new data filtering scheme based on statistical data analysis for monitoring systems in wireless sensor networks. In Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications, (pp. 635–640). IEEE Computer Society.Google Scholar
- 23.Grunwald, P. (2007). Linear regression. In The Minimum Description Length Principle (pp. 335–368). MIT Press. [Online]. Available: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6282057.
- 26.Abe, S. (2003). Analysis of multiclass support vector machines. Thyroid, 21(3), 3772.Google Scholar
- 27.Lin, C.-Y., Tsai, C.-H., Lee, C.-P., & Lin, C.-J. (2014). Large-scale logistic regression and linear support vector machines using spark. In IEEE International Conference on Big Data (Big Data), 2014 (pp. 519–528): IEEE.Google Scholar
- 28.Solaimani, M., Iftekhar, M., Khan, L., Thuraisingham, B., & Ingram, J.B. (2014). Spark-based anomaly detection over multi-source vmware performance data in real-time. In IEEE Symposium on Computational Intelligence in Cyber Security (CICS), 2014 (pp. 1–8). IEEE.Google Scholar
- 29.Harnie, D., Vapirev, A.E., Wegner, J.K., Gedich, A., Steijaert, M., Wuyts, R., & De Meuter, W. (2015). Scaling machine learning for target prediction in drug discovery using apache spark. In Proceedings of the 15th IEEE/ACM International Symposium on Cluster Cloud and Grid Computing.Google Scholar
- 32.Jolliffe, I. (2014). Principal component analysis. In Wiley StatsRef: Statistics Reference Online (pp. –): Wiley. [Online]. Available. doi: 10.1002/9781118445112.stat06472 .
- 33.Sun, G., Wang, Z., & Wang, M. (2008). A new multi-classification method based on binary tree support vector machine. In 3rd International Conference on Innovative Computing Information and Control, 2008. ICICIC ’08 (p. 77).Google Scholar
- 34.Dorffner, G. (1996). Neural networks for time series processing. Neural Network World, 6, 447–468.Google Scholar