Abstract
The ubiquity of time series data across almost all human endeavors has produced a great interest in time series data mining in the last decade. While dozens of classification algorithms have been applied to time series, recent empirical evidence strongly suggests that simple nearest neighbor classification is exceptionally difficult to beat. The choice of distance measure used by the nearest neighbor algorithm is important, and depends on the invariances required by the domain. For example, motion capture data typically requires invariance to warping, and cardiology data requires invariance to the baseline (the mean value). Similarly, recent work suggests that for time series clustering, the choice of clustering algorithm is much less important than the choice of distance measure used.In this work we make a somewhat surprising claim. There is an invariance that the community seems to have missed, complexity invariance. Intuitively, the problem is that in many domains the different classes may have different complexities, and pairs of complex objects, even those which subjectively may seem very similar to the human eye, tend to be further apart under current distance measures than pairs of simple objects. This fact introduces errors in nearest neighbor classification, where some complex objects may be incorrectly assigned to a simpler class. Similarly, for clustering this effect can introduce errors by “suggesting” to the clustering algorithm that subjectively similar, but complex objects belong in a sparser and larger diameter cluster than is truly warranted.We introduce the first complexity-invariant distance measure for time series, and show that it generally produces significant improvements in classification and clustering accuracy. We further show that this improvement does not compromise efficiency, since we can lower bound the measure and use a modification of triangular inequality, thus making use of most existing indexing and data mining algorithms. We evaluate our ideas with the largest and most comprehensive set of time series mining experiments ever attempted in a single work, and show that complexity-invariant distance measures can produce improvements in classification and clustering in the vast majority of cases.
Similar content being viewed by others
Notes
In practice these two options can be logically equivalent. For example, DTW can be seen as a more robust distance measure, or it can be seen as using the Euclidean distance after a dynamic programming algorithm has removed the warping.
This experiment, like all others in this work, is 100 % reproducible; see Sect. 4 for our experimental philosophy.
The running times were obtained in a Intel Core i7, 3.4 Ghz, 8 Gb of RAM computer. These results are also reproducible, the paper website has ready-to-use scripts that report running times and spreadsheet results with detailed execution times for each data set.
For data lengths/reduced dimensionality that are powers of two, PAA is exactly equivalent to Haar wavelets (Ding et al. 2008).
Therefore, Orchard’s algorithm requires \(O(m^2)\) in space, where \(m\) is the number of database objects. However, Ye et al. (2009) shows how to significantly reduce the space requirement, while producing nearly identical speedup.
Notice that \(\rho = CF(A,B)\) is not a bounded value; however, for indexing purposes this fact has no major consequences. \(\rho \)-relaxed triangular inequality implies \(2\rho \)-inframetric inequality. This means that the following property also holds: \(D_{CID}(A,B) \le 2 \rho \max {(D_{CID}(A,C), D_{CID}(C,B))}\).
References
Andino SG, de Peralta Menendez RG (2000) Measuring the complexity of time series: an application to neurophysiological signals. Hum Brain Mapp 11(1):46–57
Aziz W, Arif M (2006) Complexity analysis of stride interval time series by threshold dependent symbolic entropy. Eur J Appl Physiol 98:30–40. doi:10.1007/s00421-006-0226-5
Bandt C, Pompe B (2002) Permutation entropy: a natural complexity measure for time series. Phys Rev Lett 88(17). doi:10.1103/PhysRevLett.88.174102
Batista G (2011) Website for this paper. http://www.icmc.usp.br/~gbatista/cid (Online)
Batista G, Wang X, Keogh EJ (2011) A complexity-invariant distance measure for time series. In: Proceedings of the 2011 SIAM International Conference on Data Mining (SDM), pp 699–710. http://www.siam.omnibooksonline.com/2011datamining/data/papers/106.pdf
Chandola V, Cheboli D, Kumar V (2009) Detecting anomalies in a time series database. CS Technical Report 09–004, Computer Science Department, University of Minnesota
Chávez E, Navarro G, Baeza-Yates R, Marroquín JL (2001) Searching in metric spaces. ACM Comput Surv 33:273–321. doi10.1145/502807.502808
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Ding H, Trajcevski G, Wang X, Keogh E (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. In: International Conference on Very Large Data Bases, pp 1542–1552
Elkan C (2003) Using the triangle inequality to accelerate k-means. In: International Conference on Machine Learning, pp 147–153
Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. SIGMOD Rec 23:419–429. doi:10.1145/191843.191925
Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PC, Mark RG, Mietus JE, Moody GB, Peng CK, Stanley HE (2000) PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23):e215–e220
Hearn DJ (2009) Shape analysis for the automated identification of plants from images of leaves. Taxon 58:934–954(21). http://www.ingentaconnect.com/content/iapt/tax/2009/00000058/00000003/art00021
Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces (survey article). ACM Trans Database Syst 28:517–580. doi:10.1145/958942.958948
Hu B, Rakthanmanon T, Hao Y, Evans S, Lonardi S, Keogh E (2011) Discovering the intrinsic cardinality and dimensionality of time series using MDL. In: IEEE International Conference on Data Mining(ICDM), pp 1086–1091
Kaufman L, Rousseeuw PJ (2005) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Keogh E (2002) Exact indexing of dynamic time warping. In: International Conference on Very Large Data Bases, pp 406–417. http://portal.acm.org/citation.cfm?id=1287369.1287405
Keogh E (2003) Efficiently finding arbitrarily scaled patterns in massive time series databases. In: Knowledge Discovery in Databases: PKDD 2003, vol 2838, pp 253–265. doi:10.1007/978-3-540-39804-2_24
Keogh EJ, Xi X, Wei L, Ratanamahatana C (2006) The UCR time series classification/clustering homepage. http://www.cs.ucr.edu/~eamonn/time_series_data/ (Online)
Keogh E, Lonardi S, Ratanamahatana CA, Wei L, Lee SH, Handley J (2007) Compression-based data mining of sequential data. Data Min Knowl Discov 14:99–129. http://portal.acm.org/citation.cfm?id=1231311.1231321
Keogh E, Wei L, Xi X, Vlachos M, Lee SH, Protopapas P (2009) Supporting exact indexing of arbitrarily rotated shapes and periodic time series under euclidean and warping distance measures. VLDB J 18:611–630. doi:10.1007/s00778-008-0111-4
Li M, Vitnyi PM (2008) An introduction to Kolmogorov complexity and its applications, 3rd edn. Springer Publishing Company, Incorporated, Heidelberg
Li K, Yan M, Yuan S (2002) A simple statistical model for depicting the cdc15-synchronized yeast cell-cycle regulated gene expression data. Stat Sin 12:141–158
Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms. In: 8th ACM SIGMOD Workshop on Research Issues in, Data Mining and Knowledge Discovery, pp 2–11
Moore A (2000) The anchors hierarchy: using the triangle inequality to survive high-dimensional data. In: Conference on Uncertainty in Artificial Intelligence, pp 397–405
Mueen A, Keogh E, Bigdely-Shamlo N (2009) Finding time series motifs in disk-resident data. In: IEEE International Conference on Data Mining, pp 367–376. doi:10.1109/ICDM.2009.15
Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems, MIT Press, Cambridge, pp 849–856
Orchard M (1991) A fast nearest-neighbor search algorithm. In: Acoustics, Speech, and Signal Processing, 1991. ICASSP-91, 1991 International Conference on, vol 4, pp 2297–2300. doi:10.1109/ICASSP.1991.150755
Protopapas P, Giammarco JM, Faccioli L, Struble MF, Dave R, Alcock C (2006) Finding outlier light curves in catalogues of periodic variable stars. Mon Notices R Astron Soc 369(2):677–696
Rabiner L, Schafer R (1978) Digital Processing of Speech Signals. Prentice Hall, Englewood Cliffs
Rakthanmanon T, Campana B, Mueen A, Batista G, Westover B, Zhu Q, Zakaria J, Keogh E (2012) Searching and mining trillions of time series subsequences under dynamic time warping. In: ACM KDD, pp 262–270
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Rezek I (1998) Stochastic complexity measures for physiological signal analysis. IEEE Trans Biomed Eng 44(9):1186–1191
Schroeder M (2009) Fractals Chaos, Power Laws: minutes from an infinite paradise. Dover Publications, New York
Vlachos M, Hadjieleftheriou M, Gunopulos D, Keogh EJ (2003) Indexing multi-dimensional time-series with support for multiple distance measures. In: ACM KDD, pp 216–225
Yankov D, Keogh E, Rebbapragada U (2008) Disk aware discord discovery: finding unusual time series in terabyte sized datasets. Knowl Info Syst 17:241–262. doi:10.1007/s10115-008-0131-9
Ye L, Wang X, Keogh EJ, Mafra-Neto A (2009) Autocannibalistic and anyspace indexing algorithms with application to sensor data mining. In: SIAM International Conference on Data Mining, pp 85–96. http://www.siam.org/proceedings/datamining/2009/dm09_009_yel.pdf
Žunic J, Rosin P, Kopanja L (2006) Shape orientability. In: Computer Vision ACCV 2006, pp 11–20. doi:10.1007/11612704_2
Acknowledgments
Thanks to Abdullah Mueen and Pavlos Protopapas for their help with the star light curve experiments, to Bing Hu and Yuan Hao for their help preparing some of the datasets, and to Thanawin Rakthanmanon, Ronaldo C. Prati, Edson T. Matsubara, André G. Maletzke and Claudia R. Milaré for their suggestions on a draft version of this paper. This work is an expanded version of a paper that appeared in SDM 2011 (Batista et al. 2011), in which we show CID is useful for classification. In this expanded version we also show that the complexity invariance is useful for other data mining tasks such as clustering and anomaly detection. The first author is in a leave from Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, Brazil. This work was funded by NSF awards 0803410 and 0808770, FAPESP awards 2009/06349-0 and 2012/07295-3 and a gift from Microsoft.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Kristian Kersting.
Appendix
Appendix
1.1 A \(\rho \)-relaxed triangular inequality proof
In this section we prove that CID obeys the \(\rho \)-relaxed triangular inequality:
We start our proof by stating the triangular inequality of Euclidean distance:
Remember that the complexity correction factor \(CF\) is a quantity greater than or equal to one; therefore, we can multiply both sides of the inequality by \(CF(A,B)\):
The left-hand side of the inequality is our definition of CID; hence:
with \(\rho = CF(A,B)\).
Finally, we can again use the fact that \(CF\) is greater than or equal to one to change Euclidean distances to CID on the right-hand side of the equation:
And, therefore:
Rights and permissions
About this article
Cite this article
Batista, G.E.A.P.A., Keogh, E.J., Tataw, O.M. et al. CID: an efficient complexity-invariant distance for time series. Data Min Knowl Disc 28, 634–669 (2014). https://doi.org/10.1007/s10618-013-0312-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-013-0312-3