Journal of Signal Processing Systems

, Volume 86, Issue 2–3, pp 221–236 | Cite as

An Integrated Data Preprocessing Framework Based on Apache Spark for Fault Diagnosis of Power Grid Equipment

  • Weiwei Shi
  • Yongxin Zhu
  • Tian Huang
  • Gehao Sheng
  • Yong Lian
  • Guoxing Wang
  • Yufeng Chen


Big data techniques have been applied to power grid for the prediction and evaluation of grid conditions. However, the raw data quality can rarely meet the requirement of precise data analytics since raw data set usually contains samples with missing data to which the common data mining models are sensitive. Besides, the raw training data from a single monitoring system, e.g. dissolved gas analysis (DGA), are rarely sufficient for training in the form of valid instances since raw data set usually contains samples with noisy data. Though classic methods like neural network can be used to fill the gaps of missing data and classify the fault type, their models often fail to fit the rules of power grid conditions. This paper presents an integrated data preprocessing framework (DPF) based on Apache Spark to improve the prediction accuracy for data sets with missing data points and classification accuracy with noise data as well as to meet the big data requirement, which mainly combines missing data prediction, data fusion, data cleansing and fault type classification. First, the prediction model is trained based on the linear regression (LinR). Afterwards, we propose an optimized linear method (OLR) to improve the prediction accuracy. Then, to better utilize the strong correlation among different data sources, new data features extracted by persons correlation coefficient (PCC) are fused into a training data set. Next, principal component analysis (PCA) is taken to reduce the side effect brought by the new feature as well as retaining significant information for classification. Finally, the classification model based on logistic regression (LogR) and support vector machine (SVM) is trained to classify the fault type of electric equipment. We test the DPF framework on missing data prediction and fault type classification of power transformers in power grid system. The experimental results show that the predictors based on the proposed framework achieve lower mean square error and the classifiers obtain higher accuracy than traditional ones. Besides, the training time required for training large-scale data shows a decreasing trend. Therefore, the data preprocessing framework DPF would be a good candidate to predict the missing data and classify the fault type in power grid system.


Big data Apache spark Framework Missing data prediction Fault diagnose 



This paper is sponsored in part by the National High Technology and Research Development Program of China (863 Program, 2015AA050204), State Grid Science and Technology Project (520626140020, 14H100000552, SGCQDK00PJJS1400020), State Grid Corporation of China, the National Research Foundation Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) program, and the National Natural Science Foundation of China (No.61373032).


  1. 1.
    Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys, 41(3), 1–52.CrossRefGoogle Scholar
  2. 2.
    Niu, J., Gao, Y., Qiu, M., & Ming, Z. (2012). Selecting proper wireless network interfaces for user experience enhancement with guaranteed probability. Journal of Parallel and Distributed Computing, 72(12), 1565–1575. [Online]. Available: Scholar
  3. 3.
    Li, Y., Dai, W., Ming, Z., & Qiu, M. (2015). Privacy protection for preventing data over-collection in smart city. IEEE Transactions on Computers, PP(99), 1–1.Google Scholar
  4. 4.
    Lee, K., Kung, S.-Y., & Verma, N. (2012). Low-energy formulations of support vector machine kernel functions for biomedical sensor applications. Journal of Signal Processing Systems (JSPS), 69(3), 339–349. [Online]. Available. doi: 10.1007/s11265-012-0672-8.CrossRefGoogle Scholar
  5. 5.
    Zliobaite, I., & Gabrys, B. (2014). Adaptive preprocessing for streaming data. IEEE Transactions on Knowledge and Data Engineering, 26(2), 309–321.CrossRefGoogle Scholar
  6. 6.
    Davis, J.J., & Clark, A.J. (2011). Data preprocessing for anomaly based network intrusion detection: A review. Computers & Security, 30(6–7), 353–375. [Online]. Available: Scholar
  7. 7.
    Khalighi, S., Pak, F., Tirdad, P., & Nunes, U. (2015). Iris recognition using robust localization and nonsubsampled contourlet based features. Journal of Signal Processing Systems (JSPS), 81(1), 111–128. [Online]. Available. doi: 10.1007/s11265-014-0911-2.CrossRefGoogle Scholar
  8. 8.
    Qiu, M., Ming, Z., Li, J., Liu, J., Quan, G., & Zhu, Y. (2013). Informer homed routing fault tolerance mechanism for wireless sensor networks. Journal of Systems Architecture, 59(4–5), 260–270. [Online]. Available: Scholar
  9. 9.
    Ma, H., King, I., & Lyu, M.R. (2007). Effective missing data prediction for collaborative filtering. In Inproceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 39–46). Amsterdam: ACM.Google Scholar
  10. 10.
    Nogueira, R., Vieira, S., & Sousa, J. (2005). The prediction of bankruptcy using fuzzy classifiers. In 2005 ICSC Congress on Computational Intelligence Methods and Applications (p. 6).Google Scholar
  11. 11.
    Lei, K.S., & Wan, F. (2010). Pre-processing for missing data: A hybrid approach to air pollution prediction in macau. In 2010 IEEE International Conference on Automation and Logistics (ICAL), (Vol. 16–20 pp. 418–422).Google Scholar
  12. 12.
    Tian, F., Sun, J., & Shao, S. (2013). Wavelet threshold de-noising applications in avionics test data processing. In 2013 Third International Conference on Instrumentation, Measurement, Computer, Communication and Control (IMCCC), (Vol. 21–23, pp. 667– 671).Google Scholar
  13. 13.
    Wei, X., Xiao, B., Zhang, Q., & Liu, R. (2011). A rigid structure matching-based noise data processing approach for human motion capture. In 2011 Workshop on Digital Media and Digital Content Management (DMDCM) (Vol. 15–16 pp. 91–96).Google Scholar
  14. 14.
    da Silva, I., & Adeodato, P. (2011). Pca and gaussian noise in mlp neural network training improve generalization in problems with small and unbalanced data sets. In The 2011 International Joint Conference on Neural Networks (IJCNN) (pp. 2664–2669).Google Scholar
  15. 15.
    Yu, L., Wang, S., & Lai, K. (2006). An integrated data preparation scheme for neural network data analysis. IEEE Transactions on Knowledge and Data Engineering, 18(2), 217–230.CrossRefGoogle Scholar
  16. 16.
    Atasu, K. (2015). Feature-rich regular expression matching accelerator for text analytics. Journal of Signal Processing Systems (JSPS), 1–17. [Online]. Available. doi: 10.1007/s11265-015-1052-y.
  17. 17.
    Karthikeyan, P., Amudhavel, J., Abraham, A., Sathian, D., Raghav, R.S., & Dhavachelvan, P. (2015). A comprehensive survey on variants and its extensions of big data in cloud environment. In Proceedings of the 2015 International Conference on Advanced Research in Computer Science Engineering and Technology (ICARCSET 2015) (pp. 1–5). Unnao: ACM.Google Scholar
  18. 18.
    Morchen, F., & Ultsch, A. (2005). Optimizing time series discretization for knowledge discovery. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (pp. 660–665). Chicago: ACM.CrossRefGoogle Scholar
  19. 19.
    Shi, W., Zhu, Y., Zhang, J., Tao, X., Sheng, G., Lian, Y., Wang, G., & Chen, Y. (2015). Improving power grid monitoring data quality: An efficient machine learning framework for missing data prediction. In IEEE 17th International Conference on High Performance Computing and Communications, 2015 (pp. 417–422). IEEE Computer Society.Google Scholar
  20. 20.
    Zhang, J., Zhu, Y., Shi, W., Sheng, G., & Chen, Y. (2015). An improved machine learning scheme for data-driven fault diagnosis of power grid equipment. In The 2015 IEEE International Symposium on Smart Data (pp. 1737–1742). IEEE Computer Society.Google Scholar
  21. 21.
    Lu, Z., & Hui, Y. (2003). L 1 linear interpolator for missing values in time series. Annals of the Institute of Statistical Mathematics, 55(1), 197–216. [Online]. Available. doi: 10.1007/BF02530494.MathSciNetzbMATHGoogle Scholar
  22. 22.
    Hong, S.T., & Chang, J.W. (2011). A new data filtering scheme based on statistical data analysis for monitoring systems in wireless sensor networks. In Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications, (pp. 635–640). IEEE Computer Society.Google Scholar
  23. 23.
    Grunwald, P. (2007). Linear regression. In The Minimum Description Length Principle (pp. 335–368). MIT Press. [Online]. Available:
  24. 24.
    Trevor, H., Robert, T., & Jerome, F. (2001). The elements of statistical learning: data mining, inference and prediction (Vol. 1, pp. 371–406). New York: Springer.zbMATHGoogle Scholar
  25. 25.
    Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.zbMATHGoogle Scholar
  26. 26.
    Abe, S. (2003). Analysis of multiclass support vector machines. Thyroid, 21(3), 3772.Google Scholar
  27. 27.
    Lin, C.-Y., Tsai, C.-H., Lee, C.-P., & Lin, C.-J. (2014). Large-scale logistic regression and linear support vector machines using spark. In IEEE International Conference on Big Data (Big Data), 2014 (pp. 519–528): IEEE.Google Scholar
  28. 28.
    Solaimani, M., Iftekhar, M., Khan, L., Thuraisingham, B., & Ingram, J.B. (2014). Spark-based anomaly detection over multi-source vmware performance data in real-time. In IEEE Symposium on Computational Intelligence in Cyber Security (CICS), 2014 (pp. 1–8). IEEE.Google Scholar
  29. 29.
    Harnie, D., Vapirev, A.E., Wegner, J.K., Gedich, A., Steijaert, M., Wuyts, R., & De Meuter, W. (2015). Scaling machine learning for target prediction in drug discovery using apache spark. In Proceedings of the 15th IEEE/ACM International Symposium on Cluster Cloud and Grid Computing.Google Scholar
  30. 30.
    Shanahan, J.G., & Dai, L. (2015). Large scale distributed data science using apache spark. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 2323–2324). Sydney: ACM.CrossRefGoogle Scholar
  31. 31.
    Stoica, I. (2014). Conquering big data with spark and bdas. SIGMETRICS Perform Evaluation Review, 42 (1), 193– 193.CrossRefGoogle Scholar
  32. 32.
    Jolliffe, I. (2014). Principal component analysis. In Wiley StatsRef: Statistics Reference Online (pp. –): Wiley. [Online]. Available. doi: 10.1002/9781118445112.stat06472 .
  33. 33.
    Sun, G., Wang, Z., & Wang, M. (2008). A new multi-classification method based on binary tree support vector machine. In 3rd International Conference on Innovative Computing Information and Control, 2008. ICICIC ’08 (p. 77).Google Scholar
  34. 34.
    Dorffner, G. (1996). Neural networks for time series processing. Neural Network World, 6, 447–468.Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.School of Electronic Information and Electrical EngineeringShanghai Jiao Tong UniversityShanghaiChina
  2. 2.Electric Power Research Institute of Shandong Power Supply Company of State GridShandongChina

Personalised recommendations