Data Mining and Knowledge Discovery

, Volume 15, Issue 2, pp 107–144

Experiencing SAX: a novel symbolic representation of time series

  • Jessica Lin
  • Eamonn Keogh
  • Li Wei
  • Stefano Lonardi
Article

DOI: 10.1007/s10618-007-0064-z

Cite this article as:
Lin, J., Keogh, E., Wei, L. et al. Data Min Knowl Disc (2007) 15: 107. doi:10.1007/s10618-007-0064-z

Abstract

Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series, noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced over the past decades, they all suffer from two fatal flaws. First, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Second, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series.

In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. In particular, we will demonstrate the utility of our representation on various data mining tasks of clustering, classification, query by content, anomaly detection, motif discovery, and visualization.

Keywords

Time series Data mining Symbolic representation Discretize 

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • Jessica Lin
    • 1
  • Eamonn Keogh
    • 2
  • Li Wei
    • 2
  • Stefano Lonardi
    • 2
  1. 1.Information and Software Engineering DepartmentGeorge Mason UniversityFairfaxUSA
  2. 2.Computer Science & Engineering DepartmentUniversity of California-RiversideRiversideUSA

Personalised recommendations