An Algorithm for Segmenting Categorical Time Series into Meaningful Episodes

  • Paul Cohen
  • Niall Adams
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2189)


This paper describes an unsupervised algorithm for segmenting categorical time series. The algorithm first collects statistics about the frequency and boundary entropy of ngrams, then passes a window over the series and has two “expert methods” decide where in the window boundaries should be drawn. The algorithm segments text into words successfully in three languages. We claim that the algorithm finds meaningful episodes in categorical time series, because it exploits two statistical characteristics of meaningful episodes.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cohen, Paul. Fluent learning: Elucidating the structure of episodes. This volume.Google Scholar
  2. 2.
    M. Garofalakis, R. Rastogi, and K. Shim. Spirit: sequential pattern mining with regular expression constraints. In Proc. of the VLDB Conference, Edinburgh, Scotland, September 1999.Google Scholar
  3. 3.
    Magerman D. and Marcus, M. 1990. Parsing a natural language using mutual information statistics. In Proceedings of AAAI-90, Eighth National Conference on Artificial Intelligence, 984989Google Scholar
  4. 4.
    H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1(3), 1997.Google Scholar
  5. 5.
    Nevill-Manning, C.G. and Witten, I.H. (1997) Identifying Hierarchical Structure in Sequences: A linear-time algorithm, Volume 7, pages 67–82.MATHGoogle Scholar
  6. 6.
    Tim Oates, Laura Firoiu, Paul Cohen. Using Dynamic Time Warping to Bootstrap HMM-Based Clustering of Time Series. In Sequence Learning: Paradigms, Algorithms and Applications. Ron Sun and C. L. Giles (Eds.) Springer-Verlag: LNAI 1828. 2001Google Scholar
  7. 7.
    Paola Sebastiani, Marco Ramoni, Paul Cohen. Sequence Learning via Bayesian Clustering by Dynamics. In Sequence Learning: Paradigms, Algorithms and Applications. Ron Sun and C. L. Giles (Eds.) Springer-Verlag: LNAI 1828. 2001Google Scholar
  8. 8.
    Teahan, W.J., Y. Wen, R. McNab and I.H. Witten. A compression-based algorithm for Chinese word segmentation. Computational Linguistics, v 26, no 3, September, 2000, p 375–393.CrossRefGoogle Scholar
  9. 9.
    Weiss, G. M., and Hirsh, H. 1998. Learning to Predict Rare Events in Categorical Time-Series Data, Proceedings of the 1998 AAAI/ICML Workshop on Time-Series Analysis, Madison, Wisconsin.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Paul Cohen
    • 1
  • Niall Adams
    • 2
  1. 1.Department of Computer ScienceUniversity of MassachusettsUSA
  2. 2.Department of MathematicsImperial CollegeLondon

Personalised recommendations