Advertisement

Data Mining and Knowledge Discovery

, Volume 19, Issue 1, pp 24–57 | Cite as

iSAX: disk-aware mining and indexing of massive time series datasets

  • Jin Shieh
  • Eamonn Keogh
Open Access
Article

Abstract

Current research in indexing and mining time series data has produced many interesting algorithms and representations. However, the algorithms and the size of data considered have generally not been representative of the increasingly massive datasets encountered in science, engineering, and business domains. In this work, we introduce a novel multi-resolution symbolic representation which can be used to index datasets which are several orders of magnitude larger than anything else considered in the literature. To demonstrate the utility of this representation, we constructed a simple tree-based index structure which facilitates fast exact search and orders of magnitude faster, approximate search. For example, with a database of one-hundred million time series, the approximate search can retrieve high quality nearest neighbors in slightly over a second, whereas a sequential scan would take tens of minutes. Our experimental evaluation demonstrates that our representation allows index performance to scale well with increasing dataset sizes. Additionally, we provide analysis concerning parameter sensitivity, approximate search effectiveness, and lower bound comparisons between time series representations in a bit constrained environment. We further show how to exploit the combination of both exact and approximate search as sub-routines in data mining algorithms, allowing for the exact mining of truly massive real world datasets, containing tens of millions of time series.

Keywords

Time series Data mining Representations Indexing 

Notes

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

References

  1. André-Jönsson H, Badal DZ (1997) Using signature files for querying time-series data. In: Proceedings of the 1st PKDD, pp 211–220Google Scholar
  2. Assent I, Krieger R, Afschari F, Seidl T (2008) The TS-Tree: efficient time series search and retrieval. In: Proceedings of the 11th EDBTGoogle Scholar
  3. Bagnall AJ, Ratanamahatan C, Keogh E, Lonardi S, Janacek GJ (2006) A Bit Level Representation for time series data mining with shape based similarity. Data Min Knowl Disc 13(1): 11–40CrossRefGoogle Scholar
  4. Batista LV, Melcher EUK, Carvalho LC (2001) Compression of ECG signals by optimized quantization of discrete cosine transform coefficients. Med Eng Phys 23(2): 127–134CrossRefGoogle Scholar
  5. Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, California, August 26–29, 2001. KDD ‘01, ACM, New York, NY, pp 245–250Google Scholar
  6. Cai Y, Ng R (2004) Indexing spatio-temporal trajectories with Chebyshev polynomials. In: Proceedings of the ACM SIGMOD, pp 599–610Google Scholar
  7. Chan K, Fu AW (1999) Efficient time series matching by wavelets. In: Proceedings of 15th international conference on data engineering, pp 126–133Google Scholar
  8. Chen J, Itoh S (1998) A wavelet transform-based ECG compression method guaranteeing desired signal quality. IEEE Trans Biomed Eng 45(12): 1414–1419. doi: 10.1109/10.730435 CrossRefGoogle Scholar
  9. Chen Q, Chen L, Lian X, Liu Y, Yu JX (2007) Indexable PLA for efficient similarity search. In: Proceedings of the 33rd international conference on very large data basesGoogle Scholar
  10. Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. In: Proceedings of the VLDB endow, 1, 2 (Aug 2008), pp 1542–1552Google Scholar
  11. Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: Proceedings of the ACM SIGMODGoogle Scholar
  12. Fuglede B, Topsøe F (2004) Jensen-Shannon divergence and hilbert space embedding. In: Proceedings of the international symposium on information theoryGoogle Scholar
  13. Guttman A (1984) R-trees: a dynamic index structure for spatial searching. SIGMOD Rec 14(2): 47–57. doi: 10.1145/971697.602266 CrossRefGoogle Scholar
  14. Huang Y, Yu PS (1999) Adaptive query processing for time-series data. In: Proceedings of the 5th ACM SIGKDD, pp 282–286Google Scholar
  15. Ijdo J, Baldini A, Ward DC, Reeders ST, Wells RA (1991) Origin of human chromosome 2: an ancestral telomere–telomere fusion. Proc Natl Acad Sci USA 88: 9051–9055. doi: 10.1073/pnas.88.20.9051 CrossRefGoogle Scholar
  16. Kaffka S, Wintermantel B, Burk M, Peterson G (2000) Protecting high-yielding sugarbeet varieties from loss to curly top. http://sugarbeet.ucdavis.edu/Notes/Nov00a.htm
  17. Keogh E, Shieh J (2008) iSAX home page. http://www.cs.ucr.edu/~eamonn/iSAX/iSAX.htm
  18. Keogh E, Chakrabarti K, Pazzani MJ, Mehrotra S (2001) Dimensionality reduction for fast similarity search in large time series databases. KAIS 3(3): 263–286. doi: 10.1007/PL00011669 zbMATHCrossRefGoogle Scholar
  19. Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001b) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of ACM SIGMOD conference on management of data, May, pp 151–162Google Scholar
  20. Kumar N, Lolla N, Keogh E, Lonardi S, Ratanamahatana CA, Wei L (2005) Time-series bitmaps: a practical visualization tool for working with large time series databases. In: Proceedings of SIAM international conference on data miningGoogle Scholar
  21. Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing SAX: a novel symbolic representation of time series. Data Min Knowl Disc 15: 107–144CrossRefMathSciNetGoogle Scholar
  22. Megalooikonomou V, Wang Q, Li G, Faloutsos C (2005) A multiresolution symbolic representation of time series. In: Proceedings of the 21st ICDEGoogle Scholar
  23. Morinaka Y, Yoshikawa M, Amagasa T, Uemura S (2001) The L-index: an indexing structure for efficient subsequence matching in time sequence databases. In: Proceedings of Pacific-Asian conference on knowledge discovery and data miningGoogle Scholar
  24. Portet F, Reiter E, Hunter J, Sripada S (2007) Automatic generation of textual summaries from neonatal intensive care data. In: Proceedings of AIME 2007Google Scholar
  25. Ratanamahatana CA, Keogh E (2005) Three myths about dynamic time warping. In: Proceedings of SIAM international conference on data mining (SDM ‘05), pp 506–510Google Scholar
  26. Rogers J et al (2006) An initial genetic linkage map of the rhesus macaque (Macaca mulatta) genome using human microsatellite loci. Genomics 87(1):30–38. doi: 10.1016/j.ygeno.2005.10.004 CrossRefGoogle Scholar
  27. Scholle S, Schäfer T (1999) Atlas of states of sleep and wakefulness in infants and children. Somnologie - Schlafforschung und Schlafmedizin 3(4): 163CrossRefGoogle Scholar
  28. Shatkay H, Zdonik SB (1996) Approximate queries and representations for large data sequences. In: Su SY (ed) Proceedings of the 12th international conference on data engineering, ICDE, IEEE Computer Society, Washington, DC, February 26–March 01, 1996, pp 536–545Google Scholar
  29. Steinbach M, Tan P, Kumar V, Klooster S, Potter C (2003) Discovery of climate indices using clustering. In: Proceedings of the ninth ACM SIGKDD, pp 446–455Google Scholar
  30. Wei L, Keogh E, Van Herle H, Mafra-Neto A (2005) Atomic wedgie: efficient query filtering for streaming times series. In: Proceedings of the fifth IEEE international conference on data mining, pp 490–497Google Scholar
  31. Xi X, Keogh E, Shelton C, Wei L, Ratanamahatana CA (2006) Fast time series classification using numerosity reduction. In: Proceedings of the 23rd ICML, pp 1033–1040Google Scholar
  32. Zilberstein S, Russell S (1995) Approximate reasoning using anytime algorithms. In: Imprecise and approximate computation. Kluwer Academic PublishersGoogle Scholar

Copyright information

© The Author(s) 2009

Authors and Affiliations

  1. 1.Department of Computer Science & EngineeringUniversity of CaliforniaRiversideUSA

Personalised recommendations