Abstract
Current research in indexing and mining time series data has produced many interesting algorithms and representations. However, the algorithms and the size of data considered have generally not been representative of the increasingly massive datasets encountered in science, engineering, and business domains. In this work, we introduce a novel multi-resolution symbolic representation which can be used to index datasets which are several orders of magnitude larger than anything else considered in the literature. To demonstrate the utility of this representation, we constructed a simple tree-based index structure which facilitates fast exact search and orders of magnitude faster, approximate search. For example, with a database of one-hundred million time series, the approximate search can retrieve high quality nearest neighbors in slightly over a second, whereas a sequential scan would take tens of minutes. Our experimental evaluation demonstrates that our representation allows index performance to scale well with increasing dataset sizes. Additionally, we provide analysis concerning parameter sensitivity, approximate search effectiveness, and lower bound comparisons between time series representations in a bit constrained environment. We further show how to exploit the combination of both exact and approximate search as sub-routines in data mining algorithms, allowing for the exact mining of truly massive real world datasets, containing tens of millions of time series.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
André-Jönsson H, Badal DZ (1997) Using signature files for querying time-series data. In: Proceedings of the 1st PKDD, pp 211–220
Assent I, Krieger R, Afschari F, Seidl T (2008) The TS-Tree: efficient time series search and retrieval. In: Proceedings of the 11th EDBT
Bagnall AJ, Ratanamahatan C, Keogh E, Lonardi S, Janacek GJ (2006) A Bit Level Representation for time series data mining with shape based similarity. Data Min Knowl Disc 13(1): 11–40
Batista LV, Melcher EUK, Carvalho LC (2001) Compression of ECG signals by optimized quantization of discrete cosine transform coefficients. Med Eng Phys 23(2): 127–134
Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, California, August 26–29, 2001. KDD ‘01, ACM, New York, NY, pp 245–250
Cai Y, Ng R (2004) Indexing spatio-temporal trajectories with Chebyshev polynomials. In: Proceedings of the ACM SIGMOD, pp 599–610
Chan K, Fu AW (1999) Efficient time series matching by wavelets. In: Proceedings of 15th international conference on data engineering, pp 126–133
Chen J, Itoh S (1998) A wavelet transform-based ECG compression method guaranteeing desired signal quality. IEEE Trans Biomed Eng 45(12): 1414–1419. doi:10.1109/10.730435
Chen Q, Chen L, Lian X, Liu Y, Yu JX (2007) Indexable PLA for efficient similarity search. In: Proceedings of the 33rd international conference on very large data bases
Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. In: Proceedings of the VLDB endow, 1, 2 (Aug 2008), pp 1542–1552
Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: Proceedings of the ACM SIGMOD
Fuglede B, Topsøe F (2004) Jensen-Shannon divergence and hilbert space embedding. In: Proceedings of the international symposium on information theory
Guttman A (1984) R-trees: a dynamic index structure for spatial searching. SIGMOD Rec 14(2): 47–57. doi:10.1145/971697.602266
Huang Y, Yu PS (1999) Adaptive query processing for time-series data. In: Proceedings of the 5th ACM SIGKDD, pp 282–286
Ijdo J, Baldini A, Ward DC, Reeders ST, Wells RA (1991) Origin of human chromosome 2: an ancestral telomere–telomere fusion. Proc Natl Acad Sci USA 88: 9051–9055. doi:10.1073/pnas.88.20.9051
Kaffka S, Wintermantel B, Burk M, Peterson G (2000) Protecting high-yielding sugarbeet varieties from loss to curly top. http://sugarbeet.ucdavis.edu/Notes/Nov00a.htm
Keogh E (2008) http://www.cs.ucr.edu/~eamonn/SAX.htm
Keogh E, Shieh J (2008) iSAX home page. http://www.cs.ucr.edu/~eamonn/iSAX/iSAX.htm
Keogh E, Chakrabarti K, Pazzani MJ, Mehrotra S (2001) Dimensionality reduction for fast similarity search in large time series databases. KAIS 3(3): 263–286. doi:10.1007/PL00011669
Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001b) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of ACM SIGMOD conference on management of data, May, pp 151–162
Kumar N, Lolla N, Keogh E, Lonardi S, Ratanamahatana CA, Wei L (2005) Time-series bitmaps: a practical visualization tool for working with large time series databases. In: Proceedings of SIAM international conference on data mining
Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing SAX: a novel symbolic representation of time series. Data Min Knowl Disc 15: 107–144
Megalooikonomou V, Wang Q, Li G, Faloutsos C (2005) A multiresolution symbolic representation of time series. In: Proceedings of the 21st ICDE
Morinaka Y, Yoshikawa M, Amagasa T, Uemura S (2001) The L-index: an indexing structure for efficient subsequence matching in time sequence databases. In: Proceedings of Pacific-Asian conference on knowledge discovery and data mining
Portet F, Reiter E, Hunter J, Sripada S (2007) Automatic generation of textual summaries from neonatal intensive care data. In: Proceedings of AIME 2007
Ratanamahatana CA, Keogh E (2005) Three myths about dynamic time warping. In: Proceedings of SIAM international conference on data mining (SDM ‘05), pp 506–510
Rogers J et al (2006) An initial genetic linkage map of the rhesus macaque (Macaca mulatta) genome using human microsatellite loci. Genomics 87(1):30–38. doi:10.1016/j.ygeno.2005.10.004
Scholle S, Schäfer T (1999) Atlas of states of sleep and wakefulness in infants and children. Somnologie - Schlafforschung und Schlafmedizin 3(4): 163
Shatkay H, Zdonik SB (1996) Approximate queries and representations for large data sequences. In: Su SY (ed) Proceedings of the 12th international conference on data engineering, ICDE, IEEE Computer Society, Washington, DC, February 26–March 01, 1996, pp 536–545
Steinbach M, Tan P, Kumar V, Klooster S, Potter C (2003) Discovery of climate indices using clustering. In: Proceedings of the ninth ACM SIGKDD, pp 446–455
Wei L, Keogh E, Van Herle H, Mafra-Neto A (2005) Atomic wedgie: efficient query filtering for streaming times series. In: Proceedings of the fifth IEEE international conference on data mining, pp 490–497
Xi X, Keogh E, Shelton C, Wei L, Ratanamahatana CA (2006) Fast time series classification using numerosity reduction. In: Proceedings of the 23rd ICML, pp 1033–1040
Zilberstein S, Russell S (1995) Approximate reasoning using anytime algorithms. In: Imprecise and approximate computation. Kluwer Academic Publishers
Open Access
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Bart Goethals.
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Shieh, J., Keogh, E. iSAX: disk-aware mining and indexing of massive time series datasets. Data Min Knowl Disc 19, 24–57 (2009). https://doi.org/10.1007/s10618-009-0125-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-009-0125-6