## Abstract

Sequence segmentation is a well-studied problem, where given a sequence of elements, an integer *K*, and some measure of homogeneity, the task is to split the sequence into *K* contiguous segments that are maximally homogeneous. A classic approach to find the optimal solution is by using a dynamic program. Unfortunately, the execution time of this program is quadratic with respect to the length of the input sequence. This makes the algorithm slow for a sequence of non-trivial length. In this paper we study segmentations whose measure of goodness is based on log-linear models, a rich family that contains many of the standard distributions. We present a theoretical result allowing us to prune many suboptimal segmentations. Using this result, we modify the standard dynamic program for 1D log-linear models, and by doing so reduce the computational time. We demonstrate empirically, that this approach can significantly reduce the computational burden of finding the optimal segmentation.

### Similar content being viewed by others

## Notes

Calders et al. (2007) deal only with binary sequences but we can easily extend these results to the general case.

The implementation of the algorithm is given at http://adrem.ua.ac.be/segmentation.

For clarity sake, figures show average lifetimes of bins containing 40 points

The datasets were obtained from http://www.cs.ucr.edu/~eamonn/discords/.

## References

Basseville M, Nikiforov IV (1993) Detection of abrupt changes—theory and application. Prentice-Hall, Englewood Cliffs

Bellman R (1961) On the approximation of curves by line segments using dynamic programming. Commun ACM 4(6):284

Bernaola-Galván P, Román-Roldán R, Oliver JL (1996) Compositional segmentation and long-range fractal correlations in dna sequences. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics 53(5): 5181–5189

Calders T, Dexters N, Goethals B (2007) Mining frequent itemsets in a stream. In: ICDM, pp 83–92

Douglas D, Peucker T (1973) Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Can Cartogr 10(2):112–122

Džeroski S, Goethals B, Panov P (eds) (2011) Inductive databases and constraint-based data mining. Springer, New York

Gedikli A, Aksoy H, Unal NE, Kehagias A (2010) Modified dynamic programming approach for offline segmentation of long hydrometeorological time series. Stoch Environ Res Risk Assess 24(5):547–557

Gionis A, Mannila H (2003) Finding recurrent sources in sequences. In: Proceedings of the seventh annual international conference on research in computational molecular biology, RECOMB ’03, pp 123–130

Grünwald P (2007) The minimum description length principle. MIT Press, Cambridge

Haiminen N, Gionis A (2004) Unimodal segmentation of sequences. In: ICDM, pp 106–113

Himberg J, Korpiaho K, Mannila H, Tikanmäki J, Toivonen H (2001) Time series segmentation for context recognition in mobile devices. In: ICDM, pp 203–210

Keogh EJ, Lin J, Fu AWC (2005) HOT SAX: efficiently finding the most unusual time series subsequence. In: ICDM, pp 226–233

Kifer D, Ben-David S, Gehrke J (2004) Detecting change in data streams. In: VLDB, pp 180–191

Lavrenko V, Schmill M, Lawrie D, Ogilvie P, Jensen D, Allan J (2000) Mining of concurrent text and time series. In: KDD workshop on text mining, pp 37–44

Palpanas T, Vlachos M, Keogh EJ, Gunopulos D, Truppel W (2004) Online amnesic approximation of streaming time series. In: ICDE, pp 339–349

Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

Shatkay H, Zdonik SB (1996) Approximate queries and representations for large data sequences. In: ICDE, pp 536–545

Terzi E, Tsaparas P (2006) Efficient algorithms for sequence segmentation. In: SIAM data mining

## Acknowledgments

Nikolaj Tatti was partly supported by a Post-Doctoral Fellowship of the Research Foundation-Flanders (fwo).

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

Communicated by Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, Filip Zelezny.

## Appendix: Proofs

### Appendix: Proofs

### 1.1 A. 1 Proof of Theorem 1

Theorem 1 will follow from the following theorem.

###
**Theorem 9**

Let \(D = \left(D_1 ,\ldots , D_e\right)\). Let \(1 \le m < e\). Assume that diff (D[1, m], D[m +1, e]) is a cover. Then there exists \(n > m\) such that \({ sc }\mathopen {}\left([1, n], [n + 1, e]\right) > { sc }\mathopen {}\left([1, m], [m + 1, e]\right)\) or there exists \(l < m\) such that \({ sc }\mathopen {}\left([1, l], [l + 1, e]\right) \ge { sc }\mathopen {}\left([1, m], [m + 1, e]\right)\).

In order to prove the theorem we will introduce some helpful notation. First, given a parameter vector \(s\) and \(r\), we define

Note that \(h(k \mid s, r) \le { sc }\mathopen {}\left([1, k], [k + 1, e]\right)\). We also define

This function is essentially the difference between two scores.

###
**Lemma 1**

Let \(k > l\). We have \(h(k \mid s, r) - h(l \mid s, r) =g(k - l, { av }\mathopen {}\left(l + 1, k\right) \mid s, r)\).

###
*Proof*

Note that

The last two terms do not depend on \(k\). This allows us to write

This completes the proof. \(\square \)

*Proof of Theorem 9* Write \(y = { sc }\mathopen {}\left([1, m], [m + 1, e]\right)\) and define

We need to show that either \(x \ge y\) or \(z > y\). Assume that \(z \le y\). Fix \(\epsilon > 0\). By definition, there exist \(s\) and \(r\) such that

From now on we will write \(h(k)\) to mean \(h(k \mid s, r)\) and \(g(k, \delta )\) to mean \(g(k, \delta \mid s, r)\). We must have \(h(m) + \epsilon \ge y \ge z\) or, equivalently, \(\epsilon \ge z - h(m)\).

Since \({ diff }\mathopen {}\left(D[1, m], D[m + 1, e]\right)\) is a cover, there exist integers \(l\) and \(n, 0 \le l < m < n \le e\), such that \((\alpha - \beta )^T(s - r) \ge 0\), where \(\alpha = { av }\mathopen {}\left(m + 1, n\right)\) and \(\beta = { av }\mathopen {}\left(l + 1, m\right)\).

Define \(c = (n - m) / (m - l)\). We now have

which implies \(y - x \le \epsilon (1 + c^{-1}) \le \epsilon (1 + e)\). Since this holds for any \(\epsilon > 0\), we conclude that \(y \le x\). This proves the theorem. \(\square \)

*Proof of Theorem 1* Let \(P\) a segmentation and let \(I\) and \(J\) be two consecutive segments such that \({ diff }\mathopen {}\left(D[I], D[J]\right)\) is a cover. We can now apply Theorem 9 to find alternative segments \(I^{\prime }\) and \(J^{\prime }\) such that if we define \(P^{\prime }\) by replacing \(I\) and \(J\) from \(P\) with \(I^{\prime }\) and \(J^{\prime }\) then either \({ sc }\mathopen {}\left(P^{\prime } \mid D\right) > { sc }\mathopen {}\left(P^{\prime } \mid D\right)\) or \({ sc }\mathopen {}\left(P^{\prime } \mid D\right) \ge { sc }\mathopen {}\left(P^{\prime } \mid D\right)\) and \(I^{\prime }\) ends before \(I\). We repeat this until no consecutive segments constitute a cover. This repetition ends because no segmentation will occur twice during these steps and there is a finite number of segmentations. The reason why no segmentation occur twice is because either the score properly increases or the score stays the same and we move a breakpoint to the left. \(\square \)

### 1.2 A.2 Proof of Theorem 8

Let \(U\) be the resulting tree from \({\textsc {UpdateTree}}(T, C, D, i)\). To prove the theorem we need to show that the paths of \(U\) from leafs to the root consists of borders, there are no nodes in \(U\) outside the borders, and that children are ordered. We will prove these results in a series of lemmata.

###
**Lemma 2**

Let \(T^{\prime }\) be a tree after we have added a node \(i\) in UpdateTree. Let \(n \ne i\) be a node in \(T^{\prime }\) and let \(m\) be its parent. Let \(c \in C\) be such that \(n \in { borders }\mathopen {}\left(c, i - 1\right)\). If \(m \notin { borders }\mathopen {}\left(c, i\right)\), then \(n\) will cease to be a child of \(m\) during some stage of UpdateTree.

###
*Proof*

Let \(r\) be a root node of \(T^{\prime }\). Consider a pre-order of nodes of \(T^{\prime }\), that is, parents and earlier siblings come first. We will prove the lemma using induction on the pre-order.

To prove the first step, let \(n\) be the first child of \(i\). If \(i \notin { borders }\mathopen {}\left(c, i\right)\), then Theorem 5 implies that \({ av }\mathopen {}\left(n, i\right) \ge { av }\mathopen {}\left(i, i\right)\) which is exactly the test on Line 9. Hence, \(n\) will be disconnected from \(i\).

Let us now prove the induction step. Let \(p\) be the parent of \(m\) in \(T^{\prime }\). Assume that \(p \ne r\). Note that \(p\) is the border next to \(m\) in \({ borders }\mathopen {}\left(c, i - 1\right)\). Theorem 5 implies that \(p \notin { borders }\mathopen {}\left(c, i\right)\), hence the induction assumption implies that \(m\) and \(p\) are disconnected and \(m\) becomes a child of \(r\) at some point.

Assume now that \(n\) is not the first child of \(m\) and let \(q\) be the sibling left to \(n\), and let \(p\) be such that \(q \in { borders }\mathopen {}\left(p, i - 1\right)\). Theorem 3 implies that \({ av }\mathopen {}\left(q, m - 1\right) \ge { av }\mathopen {}\left(j, m - 1\right)\) for any \(q \le j < m\). Since \(n > q\), we must have \({ av }\mathopen {}\left(q, m - 1\right) \ge { av }\mathopen {}\left(n, m - 1\right) \ge { av }\mathopen {}\left(m, i\right)\), which implies that \(m \notin { borders }\mathopen {}\left(p, i\right)\). Again, the induction assumption implies that \(q\) and \(m\) will be disconnected. Consequently, \(n\) will be the first child of \(m\) at some point.

Note that while moving \(m\) or left siblings of \(n\) to be children of \(r\) we move the current node \(a\) in UpdateTree to the left. Hence, there will be a point where \(a = m\) and \(n\) is the first child of \(m\). Theorem 5 implies that \({ av }\mathopen {}\left(n, i\right) \ge { av }\mathopen {}\left(m, i\right)\) which is exactly the test on Line 9. Hence, \(n\) will be disconnected from \(m\). This proves the lemma. \(\square \)

###
**Lemma 3**

For every \(c \in C\), a path in \(U\) from \(c\) to a child of the root node \(r\) equals \({ borders }\mathopen {}\left(c, i\right)\).

###
*Proof*

Fix \(c \in C\) and let \(\left(b_1 ,\ldots , b_M\right) = { borders }\mathopen {}\left(c, i - 1\right)\) and define \(b_{M + 1} = i\). Theorem 5 implies that there is \(1 \le N \le M + 1\) such that \(\left(b_1 ,\ldots , b_N\right) = { borders }\mathopen {}\left(c, i\right)\).

After adding \(i\) to \(T\), UpdateTree will not add new nodes into the path from \(c\) to \(r\). Lemma 2 now implies that the path from \(c\) to \(r\) will be \(\left(b_1 ,\ldots , b_K\right)\), where \(K \le N\). If \(N = 1\), then immediately \(K = 1\). To conclude that \(K = N\) in general, assume that \(N > 1\) and assume that at some point in UpdateTree we have \(a = b_N\) and \(b = b_{N - 1}\). Then, according to Theorem 5, the test on Line 9 will fail and \(b_{N - 1}\) remains as a child of \(b_N\). \(\square \)

###
**Lemma 4**

Let \(n\) be a node in \(U\), then there is \(c \in C\) such that \(n \in { borders }\mathopen {}\left(c, i\right)\).

###
*Proof*

Let \(m\) be a node that occurs in \(T\) but not in \({ btree }\mathopen {}\left(D[1, i], C\right)\). The lemma will follow if we can show that \(m\) is not in \(U\). Let \(n\) be the last child of \(m\). Lemma 2 implies that at some point \(n\) will be disconnected from \(m\) and we will visit \(m\) when it is a leaf, since \(m \notin C\), we will delete \(m\). \(\square \)

###
**Lemma 5**

Consider a post-order of nodes of \(T = { btree }\mathopen {}\left(D[1, i - 1], C\right)\), that is, parents and later siblings come first. Node values decrease with respect to this order.

###
*Proof*

We will prove that the following holds: Let \(n\) be a node and let \(m\) be its left sibling. Let \(q\) be the smallest child of \(n\). Then \(m < q\). Note that this automatically proves the lemma.

Note that \(q \in C\). To prove that \(m < q\), let \(c \in C\) such that \(m \in { borders }\mathopen {}\left(c, i - 1\right)\). If \(c \ge q\), then since \(n > m \ge c\), Theorem 7 implies that \(n \in { borders }\mathopen {}\left(c, i - 1\right)\) which is a contradiction. Consequently, \(c < q\). If \(q \le m\), then again Theorem 7 implies that \(m \in { borders }\mathopen {}\left(q, i - 1\right)\) which is a contradiction. This proves that \(m < q\). \(\square \)

###
**Lemma 6**

Child nodes of each node in \(U\) are ordered from smallest to largest.

###
*Proof*

UpdateTree modifies the tree by moving the first child of a node \(a\) to be the left sibling of \(a\). This does not change the post-order of the nodes. This implies that, since node values decrease with respect to the post-order in \(T\), they will also decrease in \(U\). This proves the lemma. \(\square \)

## Rights and permissions

## About this article

### Cite this article

Tatti, N. Fast sequence segmentation using log-linear models.
*Data Min Knowl Disc* **27**, 421–441 (2013). https://doi.org/10.1007/s10618-012-0301-y

Received:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s10618-012-0301-y