Advertisement

HATDC: A Holistic Approach for Time Series Data Repairing

  • Xiaojie Liu
  • Guangxuan Song
  • Xiaoling WangEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11440)

Abstract

Time series data is prevalent in real life, and time series data mining is also a hot research topic nowadays. However, there may exist lots of anomalous data caused by sensor error in the real data sets, which brings difficulties for data mining. To improve the quality of data mining, it is to repair the data before data analysis. Most of the existing repairing methods use smooth-based or constraint-based techniques, but they only consider a few adjacent points and ignore global holistic information. In this paper, we propose a novel time series data repairing algorithm, named HATDC, that can exploit the holistic information of the time series. First, we use speed constraints and the probability distribution of change rates to detect the dirty data points. After that, the dynamic time warping (DTW) is applied as the distance measure to find similar subsequences in the series, and we estimate the value of these abnormal data points according to the selected similar subsequences from the whole aspect. In addition, we propose an improved algorithm for reducing the time cost based on incremental clustering. Experiments on several real datasets demonstrate that HATDC has a significantly higher repair accuracy and a lower RMS error than other methods.

Keywords

Data repairing Time series Anomaly detection DTW 

Notes

Acknowledgments

This work was supported by National Key R&D Program of China (No. 2017YFC 0803700), NSFC grants (No. 61532021 and 61472141), Shanghai Knowledge Service Platform Project (No. ZF1213)and SHEITC.

References

  1. 1.
    Begum, N., Ulanova, L., Wang, J., Keogh, E.: Accelerating dynamic time warping clustering with a novel admissible pruning strategy. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 49–58. ACM (2015)Google Scholar
  2. 2.
    Box, G.E., Jenkins, G.M., Reinsel, G.C., Ljung, G.M.: Time Series Analysis: Forecasting and Control. Wiley, Hoboken (2015)zbMATHGoogle Scholar
  3. 3.
    Brillinger, D.R.: Time Series: Data Analysis and Theory, vol. 36 (2001).  https://doi.org/10.1016/0304-4149(79)90039-5
  4. 4.
    Brockwell, P.J., Davis, R.A.: Introduction to Time Series and Forecasting. STS. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-29854-2CrossRefzbMATHGoogle Scholar
  5. 5.
    Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 458–469. IEEE (2013)Google Scholar
  6. 6.
    Forestier, G., Webb, G.I., Nicholson, A.E., Chen, Y., Keogh, E.: Faster and more accurate classification of time series by exploiting a novel dynamic time warping averaging algorithm. Knowl. Inf. Syst. 47(1), 1–26 (2016)CrossRefGoogle Scholar
  7. 7.
    Furlanello, C., Merler, S., Jurman, G.: Combining feature selection and DTW for time-varying functional genomics. IEEE Trans. Sig. Process. 54(6 II), 2436–2443 (2006).  https://doi.org/10.1109/TSP.2006.873715
  8. 8.
    Gardner, E.: Exponential Smoothing: The State of the Art Part II, vol. 22 (2006).  https://doi.org/10.1016/j.ijforecast.2006.03.005
  9. 9.
    Golab, L., Karloff, H., Korn, F., Saha, A., Srivastava, D.: Sequential dependencies. Proc. VLDB Endow. 2(1), 574–585 (2009)CrossRefGoogle Scholar
  10. 10.
    Hsu, H.H., Yang, A.C., Lu, M.D.: KNN-DTW based missing value imputation for microarray time series data. JCP 6(3), 418–425 (2011)Google Scholar
  11. 11.
    Jeffery, S.R., Berkeley, U.C., Franklin, M.J.: Adaptive cleaning for RFID data streams. In: VLDB, pp. 163–174 (2006)Google Scholar
  12. 12.
    Jeong, Y.S., Jeong, M.K., Omitaomu, O.A.: Weighted dynamic time warping for time series classification. Pattern Recognit. 44, 2231–2240 (2011).  https://doi.org/10.1016/j.patcog.2010.09.022CrossRefGoogle Scholar
  13. 13.
    Kate, R.J.: Using dynamic time warping distances as features for improved time series classification. Data Min. Knowl. Discov. 30(2), 283–312 (2015).  https://doi.org/10.1007/s10618-015-0418-xMathSciNetCrossRefGoogle Scholar
  14. 14.
    Keogh, E., Chu, S., Hart, D., Pazzani, M.: An online algorithm for segmenting time series. In: Proceedings IEEE International Conference on Data Mining, ICDM 2001, pp. 289–296. IEEE (2001)Google Scholar
  15. 15.
    Kostadinova, E., Boeva, V., Boneva, L., Tsiporkova, E.: An integrative DTW-based imputation method for gene expression time series data. In: Proceedings of 2012 6th IEEE International Conference Intelligent Systems, IS 2012, pp. 258–263 (2012).  https://doi.org/10.1109/IS.2012.6335145
  16. 16.
    Park, G., Rutherford, A.C., Sohn, H., Farrar, C.R.: An outlier analysis framework for impedance-based structural health monitoring. J. Sound Vib. 286(1), 229–250 (2005)CrossRefGoogle Scholar
  17. 17.
    Petitjean, F., Forestier, G., Webb, G.I., Nicholson, A.E., Chen, Y., Keogh, E.: Dynamic time warping averaging of time series allows faster and more accurate classification. In: 2014 IEEE International Conference on Data Mining (ICDM), pp. 470–479. IEEE (2014)Google Scholar
  18. 18.
    Petitjean, F., Ketterlin, A., Gancarski, P.: A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognit. 44(3), 678–693 (2011).  https://doi.org/10.1016/j.patcog.2010.09.013CrossRefzbMATHGoogle Scholar
  19. 19.
    Song, S., Li, C., Zhang, X.: Turn waste into wealth: on simultaneous clustering and cleaning over dirty data. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1115–1124. ACM (2015)Google Scholar
  20. 20.
    Song, S., Zhang, A., Wang, J., Yu, P.S.: SCREEN: stream data cleaning under speed constraints. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 827–841 (2015).  https://doi.org/10.1145/2723372.2723730
  21. 21.
    Tsiporkova, E., Boeva, V.: Two-pass imputation algorithm for missing value estimation in gene expression time series. J. Bioinform. Comput. Biol. 5(05), 1005–1022 (2007)CrossRefGoogle Scholar
  22. 22.
    Zhang, A., Song, S., Wang, J.: Sequential data cleaning: a statistical approach. In: Proceedings of the 2016 International Conference on Management of Data, pp. 909–924 (2016).  https://doi.org/10.1145/2882903.2915233
  23. 23.
    Zhang, A., Song, S., Wang, J., Yu, P.S.: Time series data cleaning: from anomaly detection to anomaly repairing. Proc. VLDB Endow. 10(10), 1046–1057 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.East China Normal UniversityShanghaiChina

Personalised recommendations