Scalable recovery of missing blocks in time series with high and low cross-correlations
- 20 Downloads
Abstract
Missing values are very common in real-world data including time-series data. Failures in power, communication or storage can leave occasional blocks of data missing in multiple series, affecting not only real-time monitoring but also compromising the quality of data analysis. Traditional recovery (imputation) techniques often leverage the correlation across time series to recover missing blocks in multiple series. These recovery techniques, however, assume high correlation and fall short in recovering missing blocks when the series exhibit variations in correlation. In this paper, we introduce a novel approach called CDRec to recover large missing blocks in time series with high and low correlations. CDRec relies on the centroid decomposition (CD) technique to recover multiple time series at a time. We also propose and analyze a new algorithm called Incremental Scalable Sign Vector to efficiently compute CD in long time series. We empirically evaluate the accuracy and the efficiency of our recovery technique on several real-world datasets that represent a broad range of applications. The results show that our recovery is orders of magnitude faster than the most accurate algorithm while producing superior results in terms of recovery.
Keywords
Recovery of missing blocks Time series Centroid decomposition CorrelationNotes
References
- 1.Administration CM (2018) Temperature TS. https://www.hydrodaten.admin.ch/en. Accessed 01 July 2018
- 2.Agarwal A, Amjad MJ, Shah D, Shen D (2018) Model agnostic time series analysis via matrix estimation. POMACS 2(3):40:1–40:39. https://doi.org/10.1145/3287319 CrossRefGoogle Scholar
- 3.Agarwal A, Amjad MJ, Shah D, Shen D (2018) Time series analysis via matrix estimation. CoRR arXiv:1802.09064
- 4.Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97(18):10101–6CrossRefGoogle Scholar
- 5.Balzano L, Chi Y, Lu YM (2018) Streaming PCA and subspace tracking: the missing data case. In: Proceedings of the IEEE 106(7)CrossRefGoogle Scholar
- 6.Balzano L, Nowak R, Recht B (2010) Online identification and tracking of subspaces from highly incomplete information. In: 2010 48th Annual allerton conference on communication, control, and computing (Allerton). https://doi.org/10.1109/allerton.2010.5706976
- 7.Bodik P, Hong W, Guestrin C, Madden S, Paskin M, Thibaux R (2004) Intel Berkeley research lab dataset. http://db.csail.mit.edu/labdata/labdata.html. Accessed 8 Nov 2018
- 8.Cambronero J, Feser JK, Smith MJ, Madden S (2017) Query optimization for dynamic imputation. PVLDB 10(11):1310–1321. https://doi.org/10.14778/3137628.3137641. http://www.vldb.org/pvldb/vol10/p1310-feser.pdf CrossRefGoogle Scholar
- 9.Chu MT, Funderlic R (2002) The centroid decomposition: relationships between discrete variational decompositions and svds. SIAM J Matrix Anal Appl 23(4):1025–1044. https://doi.org/10.1137/S0895479800382555 MathSciNetCrossRefzbMATHGoogle Scholar
- 10.D’agostino Sr, RB, Russell HK (2005) Centroid method. Wiley. https://doi.org/10.1002/0470011815.b2a13006
- 11.Dheeru D, Karra TE (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 1 Jan 2019
- 12.for the Environment FOEN, F.O.: BAFU TS. https://www.meteoswiss.admin.ch/home.html?tab=alarm. Accessed 01 Mar 2018
- 13.Gold MS, Bentler PM (2000) Treatments of missing data: a monte carlo comparison of rbhdi, iterative stochastic regression imputation, and expectation-maximization. Struct Equ Model Multidiscip J 7(3):319–355. https://doi.org/10.1207/S15328007SEM0703_1 CrossRefGoogle Scholar
- 14.Hening D, Koonce DA (2014) Missing data imputation method comparison in Ohio university student retention database. In: Proceedings of the 2014 international conference on industrial engineering and operations management, Bali, Indonesia, January 7–9, 2014Google Scholar
- 15.Jain AK, Nandakumar K, Ross A (2005) Score normalization in multimodal biometric systems. Pattern Recognit 38(12):2270–2285. https://doi.org/10.1016/j.patcog.2005.01.012 CrossRefGoogle Scholar
- 16.Khayati M, Böhlen MH, Cudré-Mauroux P (2015) Using lowly correlated time series to recover missing values in time series: a comparison between SVD and CD. In: Advances in spatial and temporal databases—14th international symposium, SSTD 2015, Hong Kong, China, August 26–28, 2015. Proceedings, pp. 237–254. https://doi.org/10.1007/978-3-319-22363-6_13 CrossRefGoogle Scholar
- 17.Khayati M, Böhlen MH, Gamper J (2014) Memory-efficient centroid decomposition for long time series. In: IEEE 30th international conference on data engineering, Chicago, ICDE 2014, IL, USA, March 31–April 4, 2014, pp 100–111Google Scholar
- 18.Lee M, An J, Lee Y (2019) Missing-value imputation of continuous missing based on deep imputation network using correlations among multiple IOT data streams in a smart space. IEICE Trans 102-D(2):289–298. http://search.ieice.org/bin/summary.php?id=e102-d_2_289 CrossRefGoogle Scholar
- 19.Mei J, de Castro Y, Goude Y, Hébrail G (2017) Nonnegative matrix factorization for time series recovery from a few temporal aggregates. In: Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, pp 2382–2390Google Scholar
- 20.Meteorology FO (2018) Climatology: MeteoSwiss TS. https://www.hydrodaten.admin.ch/en. Accessed 01 Mar 2018
- 21.Meyer CD (2000) Matrix analysis and applied linear algebra. SIAM. https://doi.org/10.1137/1.9780898719512. https://my.siam.org/Store/Product/viewproduct/?ProductId=971
- 22.Moritz S, Sardá A, Bartz-Beielstein T, Zaefferer M, Stork J (2015) Comparison of different methods for univariate time series imputation in R. CoRR arXiv:1510.03924
- 23.Ongie G, Willett R, Nowak RD, Balzano L (2017) Algebraic variety models for high-rank matrix completion. In: Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, pp 2691–2700Google Scholar
- 24.Rasp J (2018) Statistical quality control. http://www.cma.gov.cn. Accessed 01 July 2018
- 25.Recht B (2011) A simpler approach to matrix completion. J Mach Learn Res 12:3413–3430MathSciNetzbMATHGoogle Scholar
- 26.Rodriguez-Lujan I, Fonollosa J, Vergara A, Homer M, Huerta R (2014) On the calibration of sensor arrays for pattern recognition using the minimal number of experiments. Chemom Intell Lab Syst 130:123–134. https://doi.org/10.1016/j.chemolab.2013.10.012 CrossRefGoogle Scholar
- 27.Sanderson C, Curtin RR (2018) A user-friendly hybrid sparse matrix class in C++. In: Mathematical software—ICMS 2018—6th international conference, South Bend, IN, USA, July 24–27, 2018, Proceedings, pp 422–430. https://doi.org/10.1007/978-3-319-96418-8_50 CrossRefGoogle Scholar
- 28.Vergara A, Vembu S, Ayhan T, Ryan MA, Homer ML, Huerta R (2012) Chemical gas sensor drift compensation using classifier ensembles. Sens Actuat B Chem 166–167:320–329. https://doi.org/10.1016/j.snb.2012.01.074 CrossRefGoogle Scholar
- 29.Wellenzohn K, Böhlen MH, Dignös A, Gamper J, Mitterer H (2017) Continuous imputation of missing values in streams of pattern-determining time series. In: Proceedings of the 20th international conference on extending database technology, EDBT 2017, Venice, Italy, March 21–24, 2017, pp 330–341Google Scholar
- 30.Yi X, Zheng Y, Zhang J, Li T (2016) ST-MVL: filling missing values in geo-sensory time series data. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, IJCAI 2016, New York, NY, USA, 9–15 July 2016, pp 2704–2710Google Scholar
- 31.Yoon J, Zame WR, van der Schaar M (2019) Estimating missing data in temporal data streams using multi-directional recurrent neural networks. IEEE Trans Biomed Eng 66(5):1477–1490. https://doi.org/10.1109/TBME.2018.2874712 CrossRefGoogle Scholar
- 32.Yu H, Rao N, Dhillon IS (2016) Temporal regularized matrix factorization for high-dimensional time series prediction. In: Advances in neural information processing systems 29: annual conference on neural information processing systems 2016, December 5–10, 2016, Barcelona, Spain, pp 847–855Google Scholar
- 33.Zhang S (2012) Nearest neighbor selection for iteratively KNN imputation. J Syst Softw 85(11):2541–2552. https://doi.org/10.1016/j.jss.2012.05.073 CrossRefGoogle Scholar
- 34.Zhu X (2014) Comparison of four methods for handing missing data in longitudinal data analysis through a simulation study. Open J Stat 4:933–944. https://doi.org/10.4236/ojs.2014.411088 CrossRefGoogle Scholar