CID: an efficient complexity-invariant distance for time series

Batista, Gustavo E. A. P. A.; Keogh, Eamonn J.; Tataw, Oben Moses; de Souza, Vinícius M. A.

doi:10.1007/s10618-013-0312-3

CID: an efficient complexity-invariant distance for time series

Published: 12 April 2013

Volume 28, pages 634–669, (2014)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Gustavo E. A. P. A. Batista¹,
Eamonn J. Keogh¹,
Oben Moses Tataw¹ &
…
Vinícius M. A. de Souza²

4344 Accesses
227 Citations
3 Altmetric
Explore all metrics

Abstract

The ubiquity of time series data across almost all human endeavors has produced a great interest in time series data mining in the last decade. While dozens of classification algorithms have been applied to time series, recent empirical evidence strongly suggests that simple nearest neighbor classification is exceptionally difficult to beat. The choice of distance measure used by the nearest neighbor algorithm is important, and depends on the invariances required by the domain. For example, motion capture data typically requires invariance to warping, and cardiology data requires invariance to the baseline (the mean value). Similarly, recent work suggests that for time series clustering, the choice of clustering algorithm is much less important than the choice of distance measure used.In this work we make a somewhat surprising claim. There is an invariance that the community seems to have missed, complexity invariance. Intuitively, the problem is that in many domains the different classes may have different complexities, and pairs of complex objects, even those which subjectively may seem very similar to the human eye, tend to be further apart under current distance measures than pairs of simple objects. This fact introduces errors in nearest neighbor classification, where some complex objects may be incorrectly assigned to a simpler class. Similarly, for clustering this effect can introduce errors by “suggesting” to the clustering algorithm that subjectively similar, but complex objects belong in a sparser and larger diameter cluster than is truly warranted.We introduce the first complexity-invariant distance measure for time series, and show that it generally produces significant improvements in classification and clustering accuracy. We further show that this improvement does not compromise efficiency, since we can lower bound the measure and use a modification of triangular inequality, thus making use of most existing indexing and data mining algorithms. We evaluate our ideas with the largest and most comprehensive set of time series mining experiments ever attempted in a single work, and show that complexity-invariant distance measures can produce improvements in classification and clustering in the vast majority of cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Elastic similarity and distance measures for multivariate time series

Article Open access 14 February 2023

catch22: CAnonical Time-series CHaracteristics

Article Open access 09 August 2019

Constrained distance based clustering for time-series: a comparative and experimental study

Article 30 May 2018

Notes

In practice these two options can be logically equivalent. For example, DTW can be seen as a more robust distance measure, or it can be seen as using the Euclidean distance after a dynamic programming algorithm has removed the warping.
This experiment, like all others in this work, is 100 % reproducible; see Sect. 4 for our experimental philosophy.
The running times were obtained in a Intel Core i7, 3.4 Ghz, 8 Gb of RAM computer. These results are also reproducible, the paper website has ready-to-use scripts that report running times and spreadsheet results with detailed execution times for each data set.
For data lengths/reduced dimensionality that are powers of two, PAA is exactly equivalent to Haar wavelets (Ding et al. 2008).
Therefore, Orchard’s algorithm requires $O(m^2)$ in space, where $m$ is the number of database objects. However, Ye et al. (2009) shows how to significantly reduce the space requirement, while producing nearly identical speedup.
Notice that $\rho = CF(A,B)$ is not a bounded value; however, for indexing purposes this fact has no major consequences. $\rho $-relaxed triangular inequality implies $2\rho $-inframetric inequality. This means that the following property also holds: $D_{CID}(A,B) \le 2 \rho \max {(D_{CID}(A,C), D_{CID}(C,B))}$.

References

Andino SG, de Peralta Menendez RG (2000) Measuring the complexity of time series: an application to neurophysiological signals. Hum Brain Mapp 11(1):46–57
Article Google Scholar
Aziz W, Arif M (2006) Complexity analysis of stride interval time series by threshold dependent symbolic entropy. Eur J Appl Physiol 98:30–40. doi:10.1007/s00421-006-0226-5
Article Google Scholar
Bandt C, Pompe B (2002) Permutation entropy: a natural complexity measure for time series. Phys Rev Lett 88(17). doi:10.1103/PhysRevLett.88.174102
Batista G (2011) Website for this paper. http://www.icmc.usp.br/~gbatista/cid (Online)
Batista G, Wang X, Keogh EJ (2011) A complexity-invariant distance measure for time series. In: Proceedings of the 2011 SIAM International Conference on Data Mining (SDM), pp 699–710. http://www.siam.omnibooksonline.com/2011datamining/data/papers/106.pdf
Chandola V, Cheboli D, Kumar V (2009) Detecting anomalies in a time series database. CS Technical Report 09–004, Computer Science Department, University of Minnesota
Chávez E, Navarro G, Baeza-Yates R, Marroquín JL (2001) Searching in metric spaces. ACM Comput Surv 33:273–321. doi10.1145/502807.502808
Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MATH MathSciNet Google Scholar
Ding H, Trajcevski G, Wang X, Keogh E (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. In: International Conference on Very Large Data Bases, pp 1542–1552
Elkan C (2003) Using the triangle inequality to accelerate k-means. In: International Conference on Machine Learning, pp 147–153
Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. SIGMOD Rec 23:419–429. doi:10.1145/191843.191925
Article Google Scholar
Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PC, Mark RG, Mietus JE, Moody GB, Peng CK, Stanley HE (2000) PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23):e215–e220
Article Google Scholar
Hearn DJ (2009) Shape analysis for the automated identification of plants from images of leaves. Taxon 58:934–954(21). http://www.ingentaconnect.com/content/iapt/tax/2009/00000058/00000003/art00021
Google Scholar
Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces (survey article). ACM Trans Database Syst 28:517–580. doi:10.1145/958942.958948
Google Scholar
Hu B, Rakthanmanon T, Hao Y, Evans S, Lonardi S, Keogh E (2011) Discovering the intrinsic cardinality and dimensionality of time series using MDL. In: IEEE International Conference on Data Mining(ICDM), pp 1086–1091
Kaufman L, Rousseeuw PJ (2005) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Google Scholar
Keogh E (2002) Exact indexing of dynamic time warping. In: International Conference on Very Large Data Bases, pp 406–417. http://portal.acm.org/citation.cfm?id=1287369.1287405
Keogh E (2003) Efficiently finding arbitrarily scaled patterns in massive time series databases. In: Knowledge Discovery in Databases: PKDD 2003, vol 2838, pp 253–265. doi:10.1007/978-3-540-39804-2_24
Keogh EJ, Xi X, Wei L, Ratanamahatana C (2006) The UCR time series classification/clustering homepage. http://www.cs.ucr.edu/~eamonn/time_series_data/ (Online)
Keogh E, Lonardi S, Ratanamahatana CA, Wei L, Lee SH, Handley J (2007) Compression-based data mining of sequential data. Data Min Knowl Discov 14:99–129. http://portal.acm.org/citation.cfm?id=1231311.1231321
Keogh E, Wei L, Xi X, Vlachos M, Lee SH, Protopapas P (2009) Supporting exact indexing of arbitrarily rotated shapes and periodic time series under euclidean and warping distance measures. VLDB J 18:611–630. doi:10.1007/s00778-008-0111-4
Google Scholar
Li M, Vitnyi PM (2008) An introduction to Kolmogorov complexity and its applications, 3rd edn. Springer Publishing Company, Incorporated, Heidelberg
Li K, Yan M, Yuan S (2002) A simple statistical model for depicting the cdc15-synchronized yeast cell-cycle regulated gene expression data. Stat Sin 12:141–158
MATH MathSciNet Google Scholar
Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms. In: 8th ACM SIGMOD Workshop on Research Issues in, Data Mining and Knowledge Discovery, pp 2–11
Moore A (2000) The anchors hierarchy: using the triangle inequality to survive high-dimensional data. In: Conference on Uncertainty in Artificial Intelligence, pp 397–405
Mueen A, Keogh E, Bigdely-Shamlo N (2009) Finding time series motifs in disk-resident data. In: IEEE International Conference on Data Mining, pp 367–376. doi:10.1109/ICDM.2009.15
Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems, MIT Press, Cambridge, pp 849–856
Orchard M (1991) A fast nearest-neighbor search algorithm. In: Acoustics, Speech, and Signal Processing, 1991. ICASSP-91, 1991 International Conference on, vol 4, pp 2297–2300. doi:10.1109/ICASSP.1991.150755
Protopapas P, Giammarco JM, Faccioli L, Struble MF, Dave R, Alcock C (2006) Finding outlier light curves in catalogues of periodic variable stars. Mon Notices R Astron Soc 369(2):677–696
Article Google Scholar
Rabiner L, Schafer R (1978) Digital Processing of Speech Signals. Prentice Hall, Englewood Cliffs
Google Scholar
Rakthanmanon T, Campana B, Mueen A, Batista G, Westover B, Zhu Q, Zakaria J, Keogh E (2012) Searching and mining trillions of time series subsequences under dynamic time warping. In: ACM KDD, pp 262–270
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Article Google Scholar
Rezek I (1998) Stochastic complexity measures for physiological signal analysis. IEEE Trans Biomed Eng 44(9):1186–1191
Article Google Scholar
Schroeder M (2009) Fractals Chaos, Power Laws: minutes from an infinite paradise. Dover Publications, New York
Google Scholar
Vlachos M, Hadjieleftheriou M, Gunopulos D, Keogh EJ (2003) Indexing multi-dimensional time-series with support for multiple distance measures. In: ACM KDD, pp 216–225
Yankov D, Keogh E, Rebbapragada U (2008) Disk aware discord discovery: finding unusual time series in terabyte sized datasets. Knowl Info Syst 17:241–262. doi:10.1007/s10115-008-0131-9
Ye L, Wang X, Keogh EJ, Mafra-Neto A (2009) Autocannibalistic and anyspace indexing algorithms with application to sensor data mining. In: SIAM International Conference on Data Mining, pp 85–96. http://www.siam.org/proceedings/datamining/2009/dm09_009_yel.pdf
Žunic J, Rosin P, Kopanja L (2006) Shape orientability. In: Computer Vision ACCV 2006, pp 11–20. doi:10.1007/11612704_2

Download references

Acknowledgments

Thanks to Abdullah Mueen and Pavlos Protopapas for their help with the star light curve experiments, to Bing Hu and Yuan Hao for their help preparing some of the datasets, and to Thanawin Rakthanmanon, Ronaldo C. Prati, Edson T. Matsubara, André G. Maletzke and Claudia R. Milaré for their suggestions on a draft version of this paper. This work is an expanded version of a paper that appeared in SDM 2011 (Batista et al. 2011), in which we show CID is useful for classification. In this expanded version we also show that the complexity invariance is useful for other data mining tasks such as clustering and anomaly detection. The first author is in a leave from Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, Brazil. This work was funded by NSF awards 0803410 and 0808770, FAPESP awards 2009/06349-0 and 2012/07295-3 and a gift from Microsoft.

Author information

Authors and Affiliations

University of California, Riverside, 900 University Ave, Riverside, CA, 92521, USA
Gustavo E. A. P. A. Batista, Eamonn J. Keogh & Oben Moses Tataw
Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, Caixa Postal 668, São Carlos, SP, 13560-970, Brazil
Vinícius M. A. de Souza

Authors

Gustavo E. A. P. A. Batista
View author publications
You can also search for this author in PubMed Google Scholar
Eamonn J. Keogh
View author publications
You can also search for this author in PubMed Google Scholar
Oben Moses Tataw
View author publications
You can also search for this author in PubMed Google Scholar
Vinícius M. A. de Souza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gustavo E. A. P. A. Batista.

Additional information

Communicated by Kristian Kersting.

Appendix

1.1 A $\rho $-relaxed triangular inequality proof

In this section we prove that CID obeys the $\rho $-relaxed triangular inequality:

$$\begin{aligned} D_{CID}(A,B) \le \rho (D_{CID}(A,C) + D_{CID}(C,B)) \end{aligned}$$

We start our proof by stating the triangular inequality of Euclidean distance:

$$\begin{aligned} ED(A,B) \le ED(A,C) + ED(C,A) \end{aligned}$$

Remember that the complexity correction factor $CF$ is a quantity greater than or equal to one; therefore, we can multiply both sides of the inequality by $CF(A,B)$:

$$\begin{aligned} ED(A,B)CF(A,B) \le CF(A,B)(ED(A,C)+ED(C,A)) \end{aligned}$$

The left-hand side of the inequality is our definition of CID; hence:

$$\begin{aligned} D_{CID}(A,B) \le \rho (ED(A,C)+ED(C,A)) \end{aligned}$$

with $\rho = CF(A,B)$.

Finally, we can again use the fact that $CF$ is greater than or equal to one to change Euclidean distances to CID on the right-hand side of the equation:

$$\begin{aligned} D_{CID}(A,B) \le \rho (ED(A,C)CF(A,C)+ED(C,A)CF(C,A)) \end{aligned}$$

And, therefore:

$$\begin{aligned} D_{CID}(A,B) \le \rho (D_{CID}(A,C)+D_{CID}(C,A)) \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Batista, G.E.A.P.A., Keogh, E.J., Tataw, O.M. et al. CID: an efficient complexity-invariant distance for time series. Data Min Knowl Disc 28, 634–669 (2014). https://doi.org/10.1007/s10618-013-0312-3

Download citation

Received: 09 June 2011
Accepted: 23 March 2013
Published: 12 April 2013
Issue Date: May 2014
DOI: https://doi.org/10.1007/s10618-013-0312-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CID: an efficient complexity-invariant distance for time series

Abstract

Access this article

Similar content being viewed by others

Elastic similarity and distance measures for multivariate time series

catch22: CAnonical Time-series CHaracteristics

Constrained distance based clustering for time-series: a comparative and experimental study

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

1.1 A \(\rho \)-relaxed triangular inequality proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CID: an efficient complexity-invariant distance for time series

Abstract

Access this article

Similar content being viewed by others

Elastic similarity and distance measures for multivariate time series

catch22: CAnonical Time-series CHaracteristics

Constrained distance based clustering for time-series: a comparative and experimental study

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

1.1 A \(\rho \)-relaxed triangular inequality proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation