iSAX: disk-aware mining and indexing of massive time series datasets

Abstract

Current research in indexing and mining time series data has produced many interesting algorithms and representations. However, the algorithms and the size of data considered have generally not been representative of the increasingly massive datasets encountered in science, engineering, and business domains. In this work, we introduce a novel multi-resolution symbolic representation which can be used to index datasets which are several orders of magnitude larger than anything else considered in the literature. To demonstrate the utility of this representation, we constructed a simple tree-based index structure which facilitates fast exact search and orders of magnitude faster, approximate search. For example, with a database of one-hundred million time series, the approximate search can retrieve high quality nearest neighbors in slightly over a second, whereas a sequential scan would take tens of minutes. Our experimental evaluation demonstrates that our representation allows index performance to scale well with increasing dataset sizes. Additionally, we provide analysis concerning parameter sensitivity, approximate search effectiveness, and lower bound comparisons between time series representations in a bit constrained environment. We further show how to exploit the combination of both exact and approximate search as sub-routines in data mining algorithms, allowing for the exact mining of truly massive real world datasets, containing tens of millions of time series.

References

  1. André-Jönsson H, Badal DZ (1997) Using signature files for querying time-series data. In: Proceedings of the 1st PKDD, pp 211–220

  2. Assent I, Krieger R, Afschari F, Seidl T (2008) The TS-Tree: efficient time series search and retrieval. In: Proceedings of the 11th EDBT

  3. Bagnall AJ, Ratanamahatan C, Keogh E, Lonardi S, Janacek GJ (2006) A Bit Level Representation for time series data mining with shape based similarity. Data Min Knowl Disc 13(1): 11–40

    Article  Google Scholar 

  4. Batista LV, Melcher EUK, Carvalho LC (2001) Compression of ECG signals by optimized quantization of discrete cosine transform coefficients. Med Eng Phys 23(2): 127–134

    Article  Google Scholar 

  5. Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, California, August 26–29, 2001. KDD ‘01, ACM, New York, NY, pp 245–250

  6. Cai Y, Ng R (2004) Indexing spatio-temporal trajectories with Chebyshev polynomials. In: Proceedings of the ACM SIGMOD, pp 599–610

  7. Chan K, Fu AW (1999) Efficient time series matching by wavelets. In: Proceedings of 15th international conference on data engineering, pp 126–133

  8. Chen J, Itoh S (1998) A wavelet transform-based ECG compression method guaranteeing desired signal quality. IEEE Trans Biomed Eng 45(12): 1414–1419. doi:10.1109/10.730435

    Article  Google Scholar 

  9. Chen Q, Chen L, Lian X, Liu Y, Yu JX (2007) Indexable PLA for efficient similarity search. In: Proceedings of the 33rd international conference on very large data bases

  10. Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. In: Proceedings of the VLDB endow, 1, 2 (Aug 2008), pp 1542–1552

  11. Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: Proceedings of the ACM SIGMOD

  12. Fuglede B, Topsøe F (2004) Jensen-Shannon divergence and hilbert space embedding. In: Proceedings of the international symposium on information theory

  13. Guttman A (1984) R-trees: a dynamic index structure for spatial searching. SIGMOD Rec 14(2): 47–57. doi:10.1145/971697.602266

    Article  Google Scholar 

  14. Huang Y, Yu PS (1999) Adaptive query processing for time-series data. In: Proceedings of the 5th ACM SIGKDD, pp 282–286

  15. Ijdo J, Baldini A, Ward DC, Reeders ST, Wells RA (1991) Origin of human chromosome 2: an ancestral telomere–telomere fusion. Proc Natl Acad Sci USA 88: 9051–9055. doi:10.1073/pnas.88.20.9051

    Article  Google Scholar 

  16. Kaffka S, Wintermantel B, Burk M, Peterson G (2000) Protecting high-yielding sugarbeet varieties from loss to curly top. http://sugarbeet.ucdavis.edu/Notes/Nov00a.htm

  17. Keogh E (2008) http://www.cs.ucr.edu/~eamonn/SAX.htm

  18. Keogh E, Shieh J (2008) iSAX home page. http://www.cs.ucr.edu/~eamonn/iSAX/iSAX.htm

  19. Keogh E, Chakrabarti K, Pazzani MJ, Mehrotra S (2001) Dimensionality reduction for fast similarity search in large time series databases. KAIS 3(3): 263–286. doi:10.1007/PL00011669

    MATH  Article  Google Scholar 

  20. Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001b) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of ACM SIGMOD conference on management of data, May, pp 151–162

  21. Kumar N, Lolla N, Keogh E, Lonardi S, Ratanamahatana CA, Wei L (2005) Time-series bitmaps: a practical visualization tool for working with large time series databases. In: Proceedings of SIAM international conference on data mining

  22. Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing SAX: a novel symbolic representation of time series. Data Min Knowl Disc 15: 107–144

    Article  MathSciNet  Google Scholar 

  23. Megalooikonomou V, Wang Q, Li G, Faloutsos C (2005) A multiresolution symbolic representation of time series. In: Proceedings of the 21st ICDE

  24. Morinaka Y, Yoshikawa M, Amagasa T, Uemura S (2001) The L-index: an indexing structure for efficient subsequence matching in time sequence databases. In: Proceedings of Pacific-Asian conference on knowledge discovery and data mining

  25. Portet F, Reiter E, Hunter J, Sripada S (2007) Automatic generation of textual summaries from neonatal intensive care data. In: Proceedings of AIME 2007

  26. Ratanamahatana CA, Keogh E (2005) Three myths about dynamic time warping. In: Proceedings of SIAM international conference on data mining (SDM ‘05), pp 506–510

  27. Rogers J et al (2006) An initial genetic linkage map of the rhesus macaque (Macaca mulatta) genome using human microsatellite loci. Genomics 87(1):30–38. doi:10.1016/j.ygeno.2005.10.004

    Article  Google Scholar 

  28. Scholle S, Schäfer T (1999) Atlas of states of sleep and wakefulness in infants and children. Somnologie - Schlafforschung und Schlafmedizin 3(4): 163

    Article  Google Scholar 

  29. Shatkay H, Zdonik SB (1996) Approximate queries and representations for large data sequences. In: Su SY (ed) Proceedings of the 12th international conference on data engineering, ICDE, IEEE Computer Society, Washington, DC, February 26–March 01, 1996, pp 536–545

  30. Steinbach M, Tan P, Kumar V, Klooster S, Potter C (2003) Discovery of climate indices using clustering. In: Proceedings of the ninth ACM SIGKDD, pp 446–455

  31. Wei L, Keogh E, Van Herle H, Mafra-Neto A (2005) Atomic wedgie: efficient query filtering for streaming times series. In: Proceedings of the fifth IEEE international conference on data mining, pp 490–497

  32. Xi X, Keogh E, Shelton C, Wei L, Ratanamahatana CA (2006) Fast time series classification using numerosity reduction. In: Proceedings of the 23rd ICML, pp 1033–1040

  33. Zilberstein S, Russell S (1995) Approximate reasoning using anytime algorithms. In: Imprecise and approximate computation. Kluwer Academic Publishers

Download references

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jin Shieh.

Additional information

Responsible editor: Bart Goethals.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and Permissions

About this article

Cite this article

Shieh, J., Keogh, E. iSAX: disk-aware mining and indexing of massive time series datasets. Data Min Knowl Disc 19, 24–57 (2009). https://doi.org/10.1007/s10618-009-0125-6

Download citation

Keywords

  • Time series
  • Data mining
  • Representations
  • Indexing