Missing values are very common in real-world data including time-series data. Failures in power, communication or storage can leave occasional blocks of data missing in multiple series, affecting not only real-time monitoring but also compromising the quality of data analysis. Traditional recovery (imputation) techniques often leverage the correlation across time series to recover missing blocks in multiple series. These recovery techniques, however, assume high correlation and fall short in recovering missing blocks when the series exhibit variations in correlation. In this paper, we introduce a novel approach called CDRec to recover large missing blocks in time series with high and low correlations. CDRec relies on the centroid decomposition (CD) technique to recover multiple time series at a time. We also propose and analyze a new algorithm called Incremental Scalable Sign Vector to efficiently compute CD in long time series. We empirically evaluate the accuracy and the efficiency of our recovery technique on several real-world datasets that represent a broad range of applications. The results show that our recovery is orders of magnitude faster than the most accurate algorithm while producing superior results in terms of recovery.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
We consider time series with equally spaced granularity.
Source code and datasets are available online: https://github.com/eXascaleInfolab/2019_kais-bench.git.
Administration CM (2018) Temperature TS. https://www.hydrodaten.admin.ch/en. Accessed 01 July 2018
Agarwal A, Amjad MJ, Shah D, Shen D (2018) Model agnostic time series analysis via matrix estimation. POMACS 2(3):40:1–40:39. https://doi.org/10.1145/3287319
Agarwal A, Amjad MJ, Shah D, Shen D (2018) Time series analysis via matrix estimation. CoRR arXiv:1802.09064
Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97(18):10101–6
Balzano L, Chi Y, Lu YM (2018) Streaming PCA and subspace tracking: the missing data case. In: Proceedings of the IEEE 106(7)
Balzano L, Nowak R, Recht B (2010) Online identification and tracking of subspaces from highly incomplete information. In: 2010 48th Annual allerton conference on communication, control, and computing (Allerton). https://doi.org/10.1109/allerton.2010.5706976
Bodik P, Hong W, Guestrin C, Madden S, Paskin M, Thibaux R (2004) Intel Berkeley research lab dataset. http://db.csail.mit.edu/labdata/labdata.html. Accessed 8 Nov 2018
Cambronero J, Feser JK, Smith MJ, Madden S (2017) Query optimization for dynamic imputation. PVLDB 10(11):1310–1321. https://doi.org/10.14778/3137628.3137641. http://www.vldb.org/pvldb/vol10/p1310-feser.pdf
Chu MT, Funderlic R (2002) The centroid decomposition: relationships between discrete variational decompositions and svds. SIAM J Matrix Anal Appl 23(4):1025–1044. https://doi.org/10.1137/S0895479800382555
D’agostino Sr, RB, Russell HK (2005) Centroid method. Wiley. https://doi.org/10.1002/0470011815.b2a13006
Dheeru D, Karra TE (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 1 Jan 2019
for the Environment FOEN, F.O.: BAFU TS. https://www.meteoswiss.admin.ch/home.html?tab=alarm. Accessed 01 Mar 2018
Gold MS, Bentler PM (2000) Treatments of missing data: a monte carlo comparison of rbhdi, iterative stochastic regression imputation, and expectation-maximization. Struct Equ Model Multidiscip J 7(3):319–355. https://doi.org/10.1207/S15328007SEM0703_1
Hening D, Koonce DA (2014) Missing data imputation method comparison in Ohio university student retention database. In: Proceedings of the 2014 international conference on industrial engineering and operations management, Bali, Indonesia, January 7–9, 2014
Jain AK, Nandakumar K, Ross A (2005) Score normalization in multimodal biometric systems. Pattern Recognit 38(12):2270–2285. https://doi.org/10.1016/j.patcog.2005.01.012
Khayati M, Böhlen MH, Cudré-Mauroux P (2015) Using lowly correlated time series to recover missing values in time series: a comparison between SVD and CD. In: Advances in spatial and temporal databases—14th international symposium, SSTD 2015, Hong Kong, China, August 26–28, 2015. Proceedings, pp. 237–254. https://doi.org/10.1007/978-3-319-22363-6_13
Khayati M, Böhlen MH, Gamper J (2014) Memory-efficient centroid decomposition for long time series. In: IEEE 30th international conference on data engineering, Chicago, ICDE 2014, IL, USA, March 31–April 4, 2014, pp 100–111
Lee M, An J, Lee Y (2019) Missing-value imputation of continuous missing based on deep imputation network using correlations among multiple IOT data streams in a smart space. IEICE Trans 102-D(2):289–298. http://search.ieice.org/bin/summary.php?id=e102-d_2_289
Mei J, de Castro Y, Goude Y, Hébrail G (2017) Nonnegative matrix factorization for time series recovery from a few temporal aggregates. In: Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, pp 2382–2390
Meteorology FO (2018) Climatology: MeteoSwiss TS. https://www.hydrodaten.admin.ch/en. Accessed 01 Mar 2018
Meyer CD (2000) Matrix analysis and applied linear algebra. SIAM. https://doi.org/10.1137/1.9780898719512. https://my.siam.org/Store/Product/viewproduct/?ProductId=971
Moritz S, Sardá A, Bartz-Beielstein T, Zaefferer M, Stork J (2015) Comparison of different methods for univariate time series imputation in R. CoRR arXiv:1510.03924
Ongie G, Willett R, Nowak RD, Balzano L (2017) Algebraic variety models for high-rank matrix completion. In: Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, pp 2691–2700
Rasp J (2018) Statistical quality control. http://www.cma.gov.cn. Accessed 01 July 2018
Recht B (2011) A simpler approach to matrix completion. J Mach Learn Res 12:3413–3430
Rodriguez-Lujan I, Fonollosa J, Vergara A, Homer M, Huerta R (2014) On the calibration of sensor arrays for pattern recognition using the minimal number of experiments. Chemom Intell Lab Syst 130:123–134. https://doi.org/10.1016/j.chemolab.2013.10.012
Sanderson C, Curtin RR (2018) A user-friendly hybrid sparse matrix class in C++. In: Mathematical software—ICMS 2018—6th international conference, South Bend, IN, USA, July 24–27, 2018, Proceedings, pp 422–430. https://doi.org/10.1007/978-3-319-96418-8_50
Vergara A, Vembu S, Ayhan T, Ryan MA, Homer ML, Huerta R (2012) Chemical gas sensor drift compensation using classifier ensembles. Sens Actuat B Chem 166–167:320–329. https://doi.org/10.1016/j.snb.2012.01.074
Wellenzohn K, Böhlen MH, Dignös A, Gamper J, Mitterer H (2017) Continuous imputation of missing values in streams of pattern-determining time series. In: Proceedings of the 20th international conference on extending database technology, EDBT 2017, Venice, Italy, March 21–24, 2017, pp 330–341
Yi X, Zheng Y, Zhang J, Li T (2016) ST-MVL: filling missing values in geo-sensory time series data. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, IJCAI 2016, New York, NY, USA, 9–15 July 2016, pp 2704–2710
Yoon J, Zame WR, van der Schaar M (2019) Estimating missing data in temporal data streams using multi-directional recurrent neural networks. IEEE Trans Biomed Eng 66(5):1477–1490. https://doi.org/10.1109/TBME.2018.2874712
Yu H, Rao N, Dhillon IS (2016) Temporal regularized matrix factorization for high-dimensional time series prediction. In: Advances in neural information processing systems 29: annual conference on neural information processing systems 2016, December 5–10, 2016, Barcelona, Spain, pp 847–855
Zhang S (2012) Nearest neighbor selection for iteratively KNN imputation. J Syst Softw 85(11):2541–2552. https://doi.org/10.1016/j.jss.2012.05.073
Zhu X (2014) Comparison of four methods for handing missing data in longitudinal data analysis through a simulation study. Open J Stat 4:933–944. https://doi.org/10.4236/ojs.2014.411088
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Mourad Khayati received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No. 732328 (FashionBrain). Philippe Cudré-Mauroux received funding from the European Research Council (ERC) under the European Union Horizon 2020 Research and Innovation Programme (Grant Agreement 683253/Graphint).
About this article
Cite this article
Khayati, M., Cudré-Mauroux, P. & Böhlen, M.H. Scalable recovery of missing blocks in time series with high and low cross-correlations. Knowl Inf Syst 62, 2257–2280 (2020). https://doi.org/10.1007/s10115-019-01421-7
- Recovery of missing blocks
- Time series
- Centroid decomposition