We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Advertisement

Scalable recovery of missing blocks in time series with high and low cross-correlations

  • 50 Accesses

Abstract

Missing values are very common in real-world data including time-series data. Failures in power, communication or storage can leave occasional blocks of data missing in multiple series, affecting not only real-time monitoring but also compromising the quality of data analysis. Traditional recovery (imputation) techniques often leverage the correlation across time series to recover missing blocks in multiple series. These recovery techniques, however, assume high correlation and fall short in recovering missing blocks when the series exhibit variations in correlation. In this paper, we introduce a novel approach called CDRec to recover large missing blocks in time series with high and low correlations. CDRec relies on the centroid decomposition (CD) technique to recover multiple time series at a time. We also propose and analyze a new algorithm called Incremental Scalable Sign Vector to efficiently compute CD in long time series. We empirically evaluate the accuracy and the efficiency of our recovery technique on several real-world datasets that represent a broad range of applications. The results show that our recovery is orders of magnitude faster than the most accurate algorithm while producing superior results in terms of recovery.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 99

This is the net price. Taxes to be calculated in checkout.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. 1.

    We consider time series with equally spaced granularity.

  2. 2.

    Source code and datasets are available online: https://github.com/eXascaleInfolab/2019_kais-bench.git.

  3. 3.

    https://cran.r-project.org/web/views/MissingData.html.

  4. 4.

    https://en.wikipedia.org/wiki/Hankel_matrix.

  5. 5.

    https://github.com/eXascaleInfolab/2019_kais-bench.git.

References

  1. 1.

    Administration CM (2018) Temperature TS. https://www.hydrodaten.admin.ch/en. Accessed 01 July 2018

  2. 2.

    Agarwal A, Amjad MJ, Shah D, Shen D (2018) Model agnostic time series analysis via matrix estimation. POMACS 2(3):40:1–40:39. https://doi.org/10.1145/3287319

  3. 3.

    Agarwal A, Amjad MJ, Shah D, Shen D (2018) Time series analysis via matrix estimation. CoRR arXiv:1802.09064

  4. 4.

    Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97(18):10101–6

  5. 5.

    Balzano L, Chi Y, Lu YM (2018) Streaming PCA and subspace tracking: the missing data case. In: Proceedings of the IEEE 106(7)

  6. 6.

    Balzano L, Nowak R, Recht B (2010) Online identification and tracking of subspaces from highly incomplete information. In: 2010 48th Annual allerton conference on communication, control, and computing (Allerton). https://doi.org/10.1109/allerton.2010.5706976

  7. 7.

    Bodik P, Hong W, Guestrin C, Madden S, Paskin M, Thibaux R (2004) Intel Berkeley research lab dataset. http://db.csail.mit.edu/labdata/labdata.html. Accessed 8 Nov 2018

  8. 8.

    Cambronero J, Feser JK, Smith MJ, Madden S (2017) Query optimization for dynamic imputation. PVLDB 10(11):1310–1321. https://doi.org/10.14778/3137628.3137641. http://www.vldb.org/pvldb/vol10/p1310-feser.pdf

  9. 9.

    Chu MT, Funderlic R (2002) The centroid decomposition: relationships between discrete variational decompositions and svds. SIAM J Matrix Anal Appl 23(4):1025–1044. https://doi.org/10.1137/S0895479800382555

  10. 10.

    D’agostino Sr, RB, Russell HK (2005) Centroid method. Wiley. https://doi.org/10.1002/0470011815.b2a13006

  11. 11.

    Dheeru D, Karra TE (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 1 Jan 2019

  12. 12.

    for the Environment FOEN, F.O.: BAFU TS. https://www.meteoswiss.admin.ch/home.html?tab=alarm. Accessed 01 Mar 2018

  13. 13.

    Gold MS, Bentler PM (2000) Treatments of missing data: a monte carlo comparison of rbhdi, iterative stochastic regression imputation, and expectation-maximization. Struct Equ Model Multidiscip J 7(3):319–355. https://doi.org/10.1207/S15328007SEM0703_1

  14. 14.

    Hening D, Koonce DA (2014) Missing data imputation method comparison in Ohio university student retention database. In: Proceedings of the 2014 international conference on industrial engineering and operations management, Bali, Indonesia, January 7–9, 2014

  15. 15.

    Jain AK, Nandakumar K, Ross A (2005) Score normalization in multimodal biometric systems. Pattern Recognit 38(12):2270–2285. https://doi.org/10.1016/j.patcog.2005.01.012

  16. 16.

    Khayati M, Böhlen MH, Cudré-Mauroux P (2015) Using lowly correlated time series to recover missing values in time series: a comparison between SVD and CD. In: Advances in spatial and temporal databases—14th international symposium, SSTD 2015, Hong Kong, China, August 26–28, 2015. Proceedings, pp. 237–254. https://doi.org/10.1007/978-3-319-22363-6_13

  17. 17.

    Khayati M, Böhlen MH, Gamper J (2014) Memory-efficient centroid decomposition for long time series. In: IEEE 30th international conference on data engineering, Chicago, ICDE 2014, IL, USA, March 31–April 4, 2014, pp 100–111

  18. 18.

    Lee M, An J, Lee Y (2019) Missing-value imputation of continuous missing based on deep imputation network using correlations among multiple IOT data streams in a smart space. IEICE Trans 102-D(2):289–298. http://search.ieice.org/bin/summary.php?id=e102-d_2_289

  19. 19.

    Mei J, de Castro Y, Goude Y, Hébrail G (2017) Nonnegative matrix factorization for time series recovery from a few temporal aggregates. In: Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, pp 2382–2390

  20. 20.

    Meteorology FO (2018) Climatology: MeteoSwiss TS. https://www.hydrodaten.admin.ch/en. Accessed 01 Mar 2018

  21. 21.

    Meyer CD (2000) Matrix analysis and applied linear algebra. SIAM. https://doi.org/10.1137/1.9780898719512. https://my.siam.org/Store/Product/viewproduct/?ProductId=971

  22. 22.

    Moritz S, Sardá A, Bartz-Beielstein T, Zaefferer M, Stork J (2015) Comparison of different methods for univariate time series imputation in R. CoRR arXiv:1510.03924

  23. 23.

    Ongie G, Willett R, Nowak RD, Balzano L (2017) Algebraic variety models for high-rank matrix completion. In: Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, pp 2691–2700

  24. 24.

    Rasp J (2018) Statistical quality control. http://www.cma.gov.cn. Accessed 01 July 2018

  25. 25.

    Recht B (2011) A simpler approach to matrix completion. J Mach Learn Res 12:3413–3430

  26. 26.

    Rodriguez-Lujan I, Fonollosa J, Vergara A, Homer M, Huerta R (2014) On the calibration of sensor arrays for pattern recognition using the minimal number of experiments. Chemom Intell Lab Syst 130:123–134. https://doi.org/10.1016/j.chemolab.2013.10.012

  27. 27.

    Sanderson C, Curtin RR (2018) A user-friendly hybrid sparse matrix class in C++. In: Mathematical software—ICMS 2018—6th international conference, South Bend, IN, USA, July 24–27, 2018, Proceedings, pp 422–430. https://doi.org/10.1007/978-3-319-96418-8_50

  28. 28.

    Vergara A, Vembu S, Ayhan T, Ryan MA, Homer ML, Huerta R (2012) Chemical gas sensor drift compensation using classifier ensembles. Sens Actuat B Chem 166–167:320–329. https://doi.org/10.1016/j.snb.2012.01.074

  29. 29.

    Wellenzohn K, Böhlen MH, Dignös A, Gamper J, Mitterer H (2017) Continuous imputation of missing values in streams of pattern-determining time series. In: Proceedings of the 20th international conference on extending database technology, EDBT 2017, Venice, Italy, March 21–24, 2017, pp 330–341

  30. 30.

    Yi X, Zheng Y, Zhang J, Li T (2016) ST-MVL: filling missing values in geo-sensory time series data. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, IJCAI 2016, New York, NY, USA, 9–15 July 2016, pp 2704–2710

  31. 31.

    Yoon J, Zame WR, van der Schaar M (2019) Estimating missing data in temporal data streams using multi-directional recurrent neural networks. IEEE Trans Biomed Eng 66(5):1477–1490. https://doi.org/10.1109/TBME.2018.2874712

  32. 32.

    Yu H, Rao N, Dhillon IS (2016) Temporal regularized matrix factorization for high-dimensional time series prediction. In: Advances in neural information processing systems 29: annual conference on neural information processing systems 2016, December 5–10, 2016, Barcelona, Spain, pp 847–855

  33. 33.

    Zhang S (2012) Nearest neighbor selection for iteratively KNN imputation. J Syst Softw 85(11):2541–2552. https://doi.org/10.1016/j.jss.2012.05.073

  34. 34.

    Zhu X (2014) Comparison of four methods for handing missing data in longitudinal data analysis through a simulation study. Open J Stat 4:933–944. https://doi.org/10.4236/ojs.2014.411088

Download references

Author information

Correspondence to Mourad Khayati.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Mourad Khayati received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No. 732328 (FashionBrain). Philippe Cudré-Mauroux received funding from the European Research Council (ERC) under the European Union Horizon 2020 Research and Innovation Programme (Grant Agreement 683253/Graphint).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Khayati, M., Cudré-Mauroux, P. & Böhlen, M.H. Scalable recovery of missing blocks in time series with high and low cross-correlations. Knowl Inf Syst (2019). https://doi.org/10.1007/s10115-019-01421-7

Download citation

Keywords

  • Recovery of missing blocks
  • Time series
  • Centroid decomposition
  • Correlation