Abstract
Simultaneous confidence intervals, or confidence bands, provide an intuitive description of the variability of a time series. Given a set of \(N\) time series of length \(M\), we consider the problem of finding a confidence band that contains a \((1-\alpha )\)-fraction of the observations. We construct such confidence bands by finding the set of \(N\!\!-\!\!K\) time series whose envelope is minimized. We refer to this problem as the minimum width envelope problem. We show that the minimum width envelope problem is \(\mathbf {NP}\)-hard, and we develop a greedy heuristic algorithm, which we compare to quantile- and distance-based confidence band methods. We also describe a method to find an effective confidence level \(\alpha _{\mathrm {eff}}\) and an effective number of observations to remove \(K_{\mathrm {eff}}\), such that the resulting confidence bands will keep the family-wise error rate below \(\alpha \). We evaluate our methods on synthetic and real datasets. We demonstrate that our method can be used to construct confidence bands with guaranteed family-wise error rate control, also when there is too little data for the quantile-based methods to work.
Similar content being viewed by others
Notes
References
Aggarwal CC (2013) Outlier analysis. Springer, New York
Aigner W, Miksch S, Schumann H, Tominski C (2011) Visualization of time-oriented data. Human–computer interaction series. Springer, New York
Arlot S, Blanchard G, Roquain E (2010) Some nonasymptotic results on resampling in high dimension, I: confidence regions. Ann Stat 38(1):51–82. doi:10.1214/08-AOS667
Arning A, Agrawal R, Raghavan P (1996) A linear method for deviation detection in large databases. In: KDD, pp 164–169
Bache K, Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
Chawla S, Gionis A (2013) k -means: a unified approach to clustering and outlier detection. In: Proceedings of SIAM international conference data mining (SDM)
Davison A, Hinkley D (1997) Bootstrap methods and their application. Cambridge University Press, Cambridge
Dudoit S, Shaffer JP, Boldrick JC (2003) Multiple hypothesis testing in microarray experiments. Stat Sci 18(1):71–103
Efron B (2006) Minimum volume confidence regions for a multivariate normal mean vector. J R Stat Soc Ser B Stat Methodol 68(4):655–670. doi:10.1111/j.1467-9868.2006.00560.x
Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, Mietus JE, Moody GB, Peng CK, Stanley HE (2000) PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23):E215–20
Guilbaud O (2008) Simultaneous confidence regions corresponding to Holm’s step-down procedure and other closed-testing procedures. Biom J 50(5):678–92. doi:10.1002/bimj.200710449
Gupta M, Gao J, Aggarwal CC (2013) Outlier detection for temporal data: a survey. IEEE Trans Knowl Data Eng 25(1):1–20
Hahn GJ, Meeker WQ (1991) Statistical intervals: a guide for practitioners. Wiley, New York
Mandel M, Betensky R (2008) Simultaneous confidence intervals based on the percentile bootstrap approach. Comput Stat Data Anal 52(4):2158–2165. doi:10.1016/j.csda.2007.07.005
Moody GB, Mark RG (2001) The impact of the MIT-BIH arrhythmia database. IEEE Eng Med Biol Mag 20(3):45–50
Owen A (1990) Empirical likelihood ratio confidence regions. Ann Stat 18(1):90–120. doi:10.1214/aos/1176347494
Williams VV (2011) Breaking the coppersmith-winograd barrier, manuscript
Xavier EC (2012) A note on a maximum k-subset intersection problem. Inf Process Lett 112(12):471–472. doi:10.1016/j.ipl.2012.03.007
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Acknowledgments
The authors would like to thank Andreas Henelius for helpful discussions and suggestions. The work of J. Korpela and K. Puolamäki was supported in part by the Revolution of Knowledge Work Project, funded by Tekes (The Finnish Funding Agency for Innovation).
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editors: Toon Calders, Floriana Esposito, Eyke Hüllermeier, Rosa Meo.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix
Appendix
1.1 Efficient implementation of order data structure \(\mathrm{R}\)
This section describes the data structure \(\mathrm{R}\), referred to in Algorithm 1, that allows the mwe algorithm to be efficient. \(\mathrm{R}\) stores the ordering information for columns \(j\) of the data matrix \(\mathbf {X}[i,j]\). A substructure \(\mathrm{R_j}\) for a single column \(j\) with \(N=5\) observations is shown in Fig. 11a. The rank order of the values in column \(j\) are stored in a doubly linked list, with the first element corresponding to the index \(i\) of the smallest element in \(\mathbf {X}[\cdot ,j]\). The second element contains the index of the second largest value etc. The indices of the (second) largest and (second) smallest values can be extracted in \(O(1)\) time for a single column \(j\), or in time \(O(M)\) for all columns (all values of \(j\)).
The substructure \(\mathrm{R_j}\) additionally contains a vector of length \(N\), where the \(i\)th item is a pointer to the node of the doubly linked list with a value of \(i\). With the help of this additional vector, it is possible to delete (bypass) a node corresponding to any time series \(i\) from the doubly linked list as shown in Fig. 11b. This takes \(O(1)\) time for single column \(j\) and \(O(M)\) time for the whole time series. The data structure can be initialized in \(O(MN\log {N})\) time with the memory requirement of \(O(MN)\).
Rights and permissions
About this article
Cite this article
Korpela, J., Puolamäki, K. & Gionis, A. Confidence bands for time series data. Data Min Knowl Disc 28, 1530–1553 (2014). https://doi.org/10.1007/s10618-014-0371-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-014-0371-0