Advertisement

Scalable recovery of missing blocks in time series with high and low cross-correlations

  • Mourad KhayatiEmail author
  • Philippe Cudré-Mauroux
  • Michael H. Böhlen
Regular Paper
  • 20 Downloads

Abstract

Missing values are very common in real-world data including time-series data. Failures in power, communication or storage can leave occasional blocks of data missing in multiple series, affecting not only real-time monitoring but also compromising the quality of data analysis. Traditional recovery (imputation) techniques often leverage the correlation across time series to recover missing blocks in multiple series. These recovery techniques, however, assume high correlation and fall short in recovering missing blocks when the series exhibit variations in correlation. In this paper, we introduce a novel approach called CDRec to recover large missing blocks in time series with high and low correlations. CDRec relies on the centroid decomposition (CD) technique to recover multiple time series at a time. We also propose and analyze a new algorithm called Incremental Scalable Sign Vector to efficiently compute CD in long time series. We empirically evaluate the accuracy and the efficiency of our recovery technique on several real-world datasets that represent a broad range of applications. The results show that our recovery is orders of magnitude faster than the most accurate algorithm while producing superior results in terms of recovery.

Keywords

Recovery of missing blocks Time series Centroid decomposition Correlation 

Notes

References

  1. 1.
    Administration CM (2018) Temperature TS. https://www.hydrodaten.admin.ch/en. Accessed 01 July 2018
  2. 2.
    Agarwal A, Amjad MJ, Shah D, Shen D (2018) Model agnostic time series analysis via matrix estimation. POMACS 2(3):40:1–40:39.  https://doi.org/10.1145/3287319 CrossRefGoogle Scholar
  3. 3.
    Agarwal A, Amjad MJ, Shah D, Shen D (2018) Time series analysis via matrix estimation. CoRR arXiv:1802.09064
  4. 4.
    Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97(18):10101–6CrossRefGoogle Scholar
  5. 5.
    Balzano L, Chi Y, Lu YM (2018) Streaming PCA and subspace tracking: the missing data case. In: Proceedings of the IEEE 106(7)CrossRefGoogle Scholar
  6. 6.
    Balzano L, Nowak R, Recht B (2010) Online identification and tracking of subspaces from highly incomplete information. In: 2010 48th Annual allerton conference on communication, control, and computing (Allerton).  https://doi.org/10.1109/allerton.2010.5706976
  7. 7.
    Bodik P, Hong W, Guestrin C, Madden S, Paskin M, Thibaux R (2004) Intel Berkeley research lab dataset. http://db.csail.mit.edu/labdata/labdata.html. Accessed 8 Nov 2018
  8. 8.
    Cambronero J, Feser JK, Smith MJ, Madden S (2017) Query optimization for dynamic imputation. PVLDB 10(11):1310–1321.  https://doi.org/10.14778/3137628.3137641. http://www.vldb.org/pvldb/vol10/p1310-feser.pdf CrossRefGoogle Scholar
  9. 9.
    Chu MT, Funderlic R (2002) The centroid decomposition: relationships between discrete variational decompositions and svds. SIAM J Matrix Anal Appl 23(4):1025–1044.  https://doi.org/10.1137/S0895479800382555 MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    D’agostino Sr, RB, Russell HK (2005) Centroid method. Wiley.  https://doi.org/10.1002/0470011815.b2a13006
  11. 11.
    Dheeru D, Karra TE (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 1 Jan 2019
  12. 12.
    for the Environment FOEN, F.O.: BAFU TS. https://www.meteoswiss.admin.ch/home.html?tab=alarm. Accessed 01 Mar 2018
  13. 13.
    Gold MS, Bentler PM (2000) Treatments of missing data: a monte carlo comparison of rbhdi, iterative stochastic regression imputation, and expectation-maximization. Struct Equ Model Multidiscip J 7(3):319–355.  https://doi.org/10.1207/S15328007SEM0703_1 CrossRefGoogle Scholar
  14. 14.
    Hening D, Koonce DA (2014) Missing data imputation method comparison in Ohio university student retention database. In: Proceedings of the 2014 international conference on industrial engineering and operations management, Bali, Indonesia, January 7–9, 2014Google Scholar
  15. 15.
    Jain AK, Nandakumar K, Ross A (2005) Score normalization in multimodal biometric systems. Pattern Recognit 38(12):2270–2285.  https://doi.org/10.1016/j.patcog.2005.01.012 CrossRefGoogle Scholar
  16. 16.
    Khayati M, Böhlen MH, Cudré-Mauroux P (2015) Using lowly correlated time series to recover missing values in time series: a comparison between SVD and CD. In: Advances in spatial and temporal databases—14th international symposium, SSTD 2015, Hong Kong, China, August 26–28, 2015. Proceedings, pp. 237–254.  https://doi.org/10.1007/978-3-319-22363-6_13 CrossRefGoogle Scholar
  17. 17.
    Khayati M, Böhlen MH, Gamper J (2014) Memory-efficient centroid decomposition for long time series. In: IEEE 30th international conference on data engineering, Chicago, ICDE 2014, IL, USA, March 31–April 4, 2014, pp 100–111Google Scholar
  18. 18.
    Lee M, An J, Lee Y (2019) Missing-value imputation of continuous missing based on deep imputation network using correlations among multiple IOT data streams in a smart space. IEICE Trans 102-D(2):289–298. http://search.ieice.org/bin/summary.php?id=e102-d_2_289 CrossRefGoogle Scholar
  19. 19.
    Mei J, de Castro Y, Goude Y, Hébrail G (2017) Nonnegative matrix factorization for time series recovery from a few temporal aggregates. In: Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, pp 2382–2390Google Scholar
  20. 20.
    Meteorology FO (2018) Climatology: MeteoSwiss TS. https://www.hydrodaten.admin.ch/en. Accessed 01 Mar 2018
  21. 21.
    Meyer CD (2000) Matrix analysis and applied linear algebra. SIAM.  https://doi.org/10.1137/1.9780898719512. https://my.siam.org/Store/Product/viewproduct/?ProductId=971
  22. 22.
    Moritz S, Sardá A, Bartz-Beielstein T, Zaefferer M, Stork J (2015) Comparison of different methods for univariate time series imputation in R. CoRR arXiv:1510.03924
  23. 23.
    Ongie G, Willett R, Nowak RD, Balzano L (2017) Algebraic variety models for high-rank matrix completion. In: Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, pp 2691–2700Google Scholar
  24. 24.
    Rasp J (2018) Statistical quality control. http://www.cma.gov.cn. Accessed 01 July 2018
  25. 25.
    Recht B (2011) A simpler approach to matrix completion. J Mach Learn Res 12:3413–3430MathSciNetzbMATHGoogle Scholar
  26. 26.
    Rodriguez-Lujan I, Fonollosa J, Vergara A, Homer M, Huerta R (2014) On the calibration of sensor arrays for pattern recognition using the minimal number of experiments. Chemom Intell Lab Syst 130:123–134.  https://doi.org/10.1016/j.chemolab.2013.10.012 CrossRefGoogle Scholar
  27. 27.
    Sanderson C, Curtin RR (2018) A user-friendly hybrid sparse matrix class in C++. In: Mathematical software—ICMS 2018—6th international conference, South Bend, IN, USA, July 24–27, 2018, Proceedings, pp 422–430.  https://doi.org/10.1007/978-3-319-96418-8_50 CrossRefGoogle Scholar
  28. 28.
    Vergara A, Vembu S, Ayhan T, Ryan MA, Homer ML, Huerta R (2012) Chemical gas sensor drift compensation using classifier ensembles. Sens Actuat B Chem 166–167:320–329.  https://doi.org/10.1016/j.snb.2012.01.074 CrossRefGoogle Scholar
  29. 29.
    Wellenzohn K, Böhlen MH, Dignös A, Gamper J, Mitterer H (2017) Continuous imputation of missing values in streams of pattern-determining time series. In: Proceedings of the 20th international conference on extending database technology, EDBT 2017, Venice, Italy, March 21–24, 2017, pp 330–341Google Scholar
  30. 30.
    Yi X, Zheng Y, Zhang J, Li T (2016) ST-MVL: filling missing values in geo-sensory time series data. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, IJCAI 2016, New York, NY, USA, 9–15 July 2016, pp 2704–2710Google Scholar
  31. 31.
    Yoon J, Zame WR, van der Schaar M (2019) Estimating missing data in temporal data streams using multi-directional recurrent neural networks. IEEE Trans Biomed Eng 66(5):1477–1490.  https://doi.org/10.1109/TBME.2018.2874712 CrossRefGoogle Scholar
  32. 32.
    Yu H, Rao N, Dhillon IS (2016) Temporal regularized matrix factorization for high-dimensional time series prediction. In: Advances in neural information processing systems 29: annual conference on neural information processing systems 2016, December 5–10, 2016, Barcelona, Spain, pp 847–855Google Scholar
  33. 33.
    Zhang S (2012) Nearest neighbor selection for iteratively KNN imputation. J Syst Softw 85(11):2541–2552.  https://doi.org/10.1016/j.jss.2012.05.073 CrossRefGoogle Scholar
  34. 34.
    Zhu X (2014) Comparison of four methods for handing missing data in longitudinal data analysis through a simulation study. Open J Stat 4:933–944.  https://doi.org/10.4236/ojs.2014.411088 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  • Mourad Khayati
    • 1
    Email author
  • Philippe Cudré-Mauroux
    • 1
  • Michael H. Böhlen
    • 2
  1. 1.eXascale InfolabUniversity of FribourgFribourgSwitzerland
  2. 2.Department of InformaticsUniversity of ZurichZurichSwitzerland

Personalised recommendations