# Fast sequence segmentation using log-linear models

- 707 Downloads

## Abstract

Sequence segmentation is a well-studied problem, where given a sequence of elements, an integer *K*, and some measure of homogeneity, the task is to split the sequence into *K* contiguous segments that are maximally homogeneous. A classic approach to find the optimal solution is by using a dynamic program. Unfortunately, the execution time of this program is quadratic with respect to the length of the input sequence. This makes the algorithm slow for a sequence of non-trivial length. In this paper we study segmentations whose measure of goodness is based on log-linear models, a rich family that contains many of the standard distributions. We present a theoretical result allowing us to prune many suboptimal segmentations. Using this result, we modify the standard dynamic program for 1D log-linear models, and by doing so reduce the computational time. We demonstrate empirically, that this approach can significantly reduce the computational burden of finding the optimal segmentation.

## Keywords

Segmentation Pruning Change-point detection Dynamic program## Notes

### Acknowledgments

Nikolaj Tatti was partly supported by a Post-Doctoral Fellowship of the Research Foundation-Flanders (fwo).

## References

- Basseville M, Nikiforov IV (1993) Detection of abrupt changes—theory and application. Prentice-Hall, Englewood CliffsGoogle Scholar
- Bellman R (1961) On the approximation of curves by line segments using dynamic programming. Commun ACM 4(6):284zbMATHCrossRefGoogle Scholar
- Bernaola-Galván P, Román-Roldán R, Oliver JL (1996) Compositional segmentation and long-range fractal correlations in dna sequences. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics 53(5): 5181–5189Google Scholar
- Calders T, Dexters N, Goethals B (2007) Mining frequent itemsets in a stream. In: ICDM, pp 83–92Google Scholar
- Douglas D, Peucker T (1973) Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Can Cartogr 10(2):112–122CrossRefGoogle Scholar
- Džeroski S, Goethals B, Panov P (eds) (2011) Inductive databases and constraint-based data mining. Springer, New YorkGoogle Scholar
- Gedikli A, Aksoy H, Unal NE, Kehagias A (2010) Modified dynamic programming approach for offline segmentation of long hydrometeorological time series. Stoch Environ Res Risk Assess 24(5):547–557CrossRefGoogle Scholar
- Gionis A, Mannila H (2003) Finding recurrent sources in sequences. In: Proceedings of the seventh annual international conference on research in computational molecular biology, RECOMB ’03, pp 123–130Google Scholar
- Grünwald P (2007) The minimum description length principle. MIT Press, CambridgeGoogle Scholar
- Haiminen N, Gionis A (2004) Unimodal segmentation of sequences. In: ICDM, pp 106–113Google Scholar
- Himberg J, Korpiaho K, Mannila H, Tikanmäki J, Toivonen H (2001) Time series segmentation for context recognition in mobile devices. In: ICDM, pp 203–210Google Scholar
- Keogh EJ, Lin J, Fu AWC (2005) HOT SAX: efficiently finding the most unusual time series subsequence. In: ICDM, pp 226–233Google Scholar
- Kifer D, Ben-David S, Gehrke J (2004) Detecting change in data streams. In: VLDB, pp 180–191Google Scholar
- Lavrenko V, Schmill M, Lawrie D, Ogilvie P, Jensen D, Allan J (2000) Mining of concurrent text and time series. In: KDD workshop on text mining, pp 37–44Google Scholar
- Palpanas T, Vlachos M, Keogh EJ, Gunopulos D, Truppel W (2004) Online amnesic approximation of streaming time series. In: ICDE, pp 339–349Google Scholar
- Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464zbMATHCrossRefGoogle Scholar
- Shatkay H, Zdonik SB (1996) Approximate queries and representations for large data sequences. In: ICDE, pp 536–545Google Scholar
- Terzi E, Tsaparas P (2006) Efficient algorithms for sequence segmentation. In: SIAM data miningGoogle Scholar