Skip to main content

Advertisement

Log in

CID: an efficient complexity-invariant distance for time series

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The ubiquity of time series data across almost all human endeavors has produced a great interest in time series data mining in the last decade. While dozens of classification algorithms have been applied to time series, recent empirical evidence strongly suggests that simple nearest neighbor classification is exceptionally difficult to beat. The choice of distance measure used by the nearest neighbor algorithm is important, and depends on the invariances required by the domain. For example, motion capture data typically requires invariance to warping, and cardiology data requires invariance to the baseline (the mean value). Similarly, recent work suggests that for time series clustering, the choice of clustering algorithm is much less important than the choice of distance measure used.In this work we make a somewhat surprising claim. There is an invariance that the community seems to have missed, complexity invariance. Intuitively, the problem is that in many domains the different classes may have different complexities, and pairs of complex objects, even those which subjectively may seem very similar to the human eye, tend to be further apart under current distance measures than pairs of simple objects. This fact introduces errors in nearest neighbor classification, where some complex objects may be incorrectly assigned to a simpler class. Similarly, for clustering this effect can introduce errors by “suggesting” to the clustering algorithm that subjectively similar, but complex objects belong in a sparser and larger diameter cluster than is truly warranted.We introduce the first complexity-invariant distance measure for time series, and show that it generally produces significant improvements in classification and clustering accuracy. We further show that this improvement does not compromise efficiency, since we can lower bound the measure and use a modification of triangular inequality, thus making use of most existing indexing and data mining algorithms. We evaluate our ideas with the largest and most comprehensive set of time series mining experiments ever attempted in a single work, and show that complexity-invariant distance measures can produce improvements in classification and clustering in the vast majority of cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30

Similar content being viewed by others

Notes

  1. In practice these two options can be logically equivalent. For example, DTW can be seen as a more robust distance measure, or it can be seen as using the Euclidean distance after a dynamic programming algorithm has removed the warping.

  2. This experiment, like all others in this work, is 100 % reproducible; see Sect. 4 for our experimental philosophy.

  3. The running times were obtained in a Intel Core i7, 3.4 Ghz, 8 Gb of RAM computer. These results are also reproducible, the paper website has ready-to-use scripts that report running times and spreadsheet results with detailed execution times for each data set.

  4. For data lengths/reduced dimensionality that are powers of two, PAA is exactly equivalent to Haar wavelets (Ding et al. 2008).

  5. Therefore, Orchard’s algorithm requires \(O(m^2)\) in space, where \(m\) is the number of database objects. However, Ye et al. (2009) shows how to significantly reduce the space requirement, while producing nearly identical speedup.

  6. Notice that \(\rho = CF(A,B)\) is not a bounded value; however, for indexing purposes this fact has no major consequences. \(\rho \)-relaxed triangular inequality implies \(2\rho \)-inframetric inequality. This means that the following property also holds: \(D_{CID}(A,B) \le 2 \rho \max {(D_{CID}(A,C), D_{CID}(C,B))}\).

References

  • Andino SG, de Peralta Menendez RG (2000) Measuring the complexity of time series: an application to neurophysiological signals. Hum Brain Mapp 11(1):46–57

    Article  Google Scholar 

  • Aziz W, Arif M (2006) Complexity analysis of stride interval time series by threshold dependent symbolic entropy. Eur J Appl Physiol 98:30–40. doi:10.1007/s00421-006-0226-5

    Article  Google Scholar 

  • Bandt C, Pompe B (2002) Permutation entropy: a natural complexity measure for time series. Phys Rev Lett 88(17). doi:10.1103/PhysRevLett.88.174102

  • Batista G (2011) Website for this paper. http://www.icmc.usp.br/~gbatista/cid (Online)

  • Batista G, Wang X, Keogh EJ (2011) A complexity-invariant distance measure for time series. In: Proceedings of the 2011 SIAM International Conference on Data Mining (SDM), pp 699–710. http://www.siam.omnibooksonline.com/2011datamining/data/papers/106.pdf

  • Chandola V, Cheboli D, Kumar V (2009) Detecting anomalies in a time series database. CS Technical Report 09–004, Computer Science Department, University of Minnesota

  • Chávez E, Navarro G, Baeza-Yates R, Marroquín JL (2001) Searching in metric spaces. ACM Comput Surv 33:273–321. doi10.1145/502807.502808

    Google Scholar 

  • Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MATH  MathSciNet  Google Scholar 

  • Ding H, Trajcevski G, Wang X, Keogh E (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. In: International Conference on Very Large Data Bases, pp 1542–1552

  • Elkan C (2003) Using the triangle inequality to accelerate k-means. In: International Conference on Machine Learning, pp 147–153

  • Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. SIGMOD Rec 23:419–429. doi:10.1145/191843.191925

    Article  Google Scholar 

  • Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PC, Mark RG, Mietus JE, Moody GB, Peng CK, Stanley HE (2000) PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23):e215–e220

    Article  Google Scholar 

  • Hearn DJ (2009) Shape analysis for the automated identification of plants from images of leaves. Taxon 58:934–954(21). http://www.ingentaconnect.com/content/iapt/tax/2009/00000058/00000003/art00021

    Google Scholar 

  • Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces (survey article). ACM Trans Database Syst 28:517–580. doi:10.1145/958942.958948

    Google Scholar 

  • Hu B, Rakthanmanon T, Hao Y, Evans S, Lonardi S, Keogh E (2011) Discovering the intrinsic cardinality and dimensionality of time series using MDL. In: IEEE International Conference on Data Mining(ICDM), pp 1086–1091

  • Kaufman L, Rousseeuw PJ (2005) Finding groups in data: an introduction to cluster analysis. Wiley, New York

    Google Scholar 

  • Keogh E (2002) Exact indexing of dynamic time warping. In: International Conference on Very Large Data Bases, pp 406–417. http://portal.acm.org/citation.cfm?id=1287369.1287405

  • Keogh E (2003) Efficiently finding arbitrarily scaled patterns in massive time series databases. In: Knowledge Discovery in Databases: PKDD 2003, vol 2838, pp 253–265. doi:10.1007/978-3-540-39804-2_24

  • Keogh EJ, Xi X, Wei L, Ratanamahatana C (2006) The UCR time series classification/clustering homepage. http://www.cs.ucr.edu/~eamonn/time_series_data/ (Online)

  • Keogh E, Lonardi S, Ratanamahatana CA, Wei L, Lee SH, Handley J (2007) Compression-based data mining of sequential data. Data Min Knowl Discov 14:99–129. http://portal.acm.org/citation.cfm?id=1231311.1231321

  • Keogh E, Wei L, Xi X, Vlachos M, Lee SH, Protopapas P (2009) Supporting exact indexing of arbitrarily rotated shapes and periodic time series under euclidean and warping distance measures. VLDB J 18:611–630. doi:10.1007/s00778-008-0111-4

    Google Scholar 

  • Li M, Vitnyi PM (2008) An introduction to Kolmogorov complexity and its applications, 3rd edn. Springer Publishing Company, Incorporated, Heidelberg

  • Li K, Yan M, Yuan S (2002) A simple statistical model for depicting the cdc15-synchronized yeast cell-cycle regulated gene expression data. Stat Sin 12:141–158

    MATH  MathSciNet  Google Scholar 

  • Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms. In: 8th ACM SIGMOD Workshop on Research Issues in, Data Mining and Knowledge Discovery, pp 2–11

  • Moore A (2000) The anchors hierarchy: using the triangle inequality to survive high-dimensional data. In: Conference on Uncertainty in Artificial Intelligence, pp 397–405

  • Mueen A, Keogh E, Bigdely-Shamlo N (2009) Finding time series motifs in disk-resident data. In: IEEE International Conference on Data Mining, pp 367–376. doi:10.1109/ICDM.2009.15

  • Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems, MIT Press, Cambridge, pp 849–856

  • Orchard M (1991) A fast nearest-neighbor search algorithm. In: Acoustics, Speech, and Signal Processing, 1991. ICASSP-91, 1991 International Conference on, vol 4, pp 2297–2300. doi:10.1109/ICASSP.1991.150755

  • Protopapas P, Giammarco JM, Faccioli L, Struble MF, Dave R, Alcock C (2006) Finding outlier light curves in catalogues of periodic variable stars. Mon Notices R Astron Soc 369(2):677–696

    Article  Google Scholar 

  • Rabiner L, Schafer R (1978) Digital Processing of Speech Signals. Prentice Hall, Englewood Cliffs

    Google Scholar 

  • Rakthanmanon T, Campana B, Mueen A, Batista G, Westover B, Zhu Q, Zakaria J, Keogh E (2012) Searching and mining trillions of time series subsequences under dynamic time warping. In: ACM KDD, pp 262–270

  • Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850

    Article  Google Scholar 

  • Rezek I (1998) Stochastic complexity measures for physiological signal analysis. IEEE Trans Biomed Eng 44(9):1186–1191

    Article  Google Scholar 

  • Schroeder M (2009) Fractals Chaos, Power Laws: minutes from an infinite paradise. Dover Publications, New York

    Google Scholar 

  • Vlachos M, Hadjieleftheriou M, Gunopulos D, Keogh EJ (2003) Indexing multi-dimensional time-series with support for multiple distance measures. In: ACM KDD, pp 216–225

  • Yankov D, Keogh E, Rebbapragada U (2008) Disk aware discord discovery: finding unusual time series in terabyte sized datasets. Knowl Info Syst 17:241–262. doi:10.1007/s10115-008-0131-9

  • Ye L, Wang X, Keogh EJ, Mafra-Neto A (2009) Autocannibalistic and anyspace indexing algorithms with application to sensor data mining. In: SIAM International Conference on Data Mining, pp 85–96. http://www.siam.org/proceedings/datamining/2009/dm09_009_yel.pdf

  • Žunic J, Rosin P, Kopanja L (2006) Shape orientability. In: Computer Vision ACCV 2006, pp 11–20. doi:10.1007/11612704_2

Download references

Acknowledgments

Thanks to Abdullah Mueen and Pavlos Protopapas for their help with the star light curve experiments, to Bing Hu and Yuan Hao for their help preparing some of the datasets, and to Thanawin Rakthanmanon, Ronaldo C. Prati, Edson T. Matsubara, André G. Maletzke and Claudia R. Milaré for their suggestions on a draft version of this paper. This work is an expanded version of a paper that appeared in SDM 2011 (Batista et al. 2011), in which we show CID is useful for classification. In this expanded version we also show that the complexity invariance is useful for other data mining tasks such as clustering and anomaly detection. The first author is in a leave from Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, Brazil. This work was funded by NSF awards 0803410 and 0808770, FAPESP awards 2009/06349-0 and 2012/07295-3 and a gift from Microsoft.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gustavo E. A. P. A. Batista.

Additional information

Communicated by Kristian Kersting.

Appendix

Appendix

1.1 A \(\rho \)-relaxed triangular inequality proof

In this section we prove that CID obeys the \(\rho \)-relaxed triangular inequality:

$$\begin{aligned} D_{CID}(A,B) \le \rho (D_{CID}(A,C) + D_{CID}(C,B)) \end{aligned}$$

We start our proof by stating the triangular inequality of Euclidean distance:

$$\begin{aligned} ED(A,B) \le ED(A,C) + ED(C,A) \end{aligned}$$

Remember that the complexity correction factor \(CF\) is a quantity greater than or equal to one; therefore, we can multiply both sides of the inequality by \(CF(A,B)\):

$$\begin{aligned} ED(A,B)CF(A,B) \le CF(A,B)(ED(A,C)+ED(C,A)) \end{aligned}$$

The left-hand side of the inequality is our definition of CID; hence:

$$\begin{aligned} D_{CID}(A,B) \le \rho (ED(A,C)+ED(C,A)) \end{aligned}$$

with \(\rho = CF(A,B)\).

Finally, we can again use the fact that \(CF\) is greater than or equal to one to change Euclidean distances to CID on the right-hand side of the equation:

$$\begin{aligned} D_{CID}(A,B) \le \rho (ED(A,C)CF(A,C)+ED(C,A)CF(C,A)) \end{aligned}$$

And, therefore:

$$\begin{aligned} D_{CID}(A,B) \le \rho (D_{CID}(A,C)+D_{CID}(C,A)) \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Batista, G.E.A.P.A., Keogh, E.J., Tataw, O.M. et al. CID: an efficient complexity-invariant distance for time series. Data Min Knowl Disc 28, 634–669 (2014). https://doi.org/10.1007/s10618-013-0312-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-013-0312-3

Keywords

Navigation