1 Introduction

1.1 Minimal Unique Substrings and Shortest Unique Substrings

A unique substring of string T is a substring of T which appears exactly once in T. Finding unique substrings of DNA sequences has gained attention in bioinformatics [8, 9, 15, 24]. For example, it can be applied in PCR primer design [24] and alignment-free genome comparison [9].

In the last decade, problems that relate to computing unique substrings in a given string have been studied in the field of string algorithmics. A unique substring u of T is said to be a minimal unique substring (MUS) of T if any proper substring of u is not a unique substring. Ilie and Smyth [13] formalized MUSs and proposed a linear time algorithm to compute all MUSs of a given string T.

MUSs has been heavily utilized for solving the shortest unique substring (SUS) problems: A unique substring \(v = T[s\ldots t]\) of T is said to be a shortest unique substring (SUS) of T for a text position p if v contains the position p (i.e., \(p \in [s, t]\)) and any proper substring of v which contains p is not a unique substring. The single-SUS problem is to preprocess a given string T of length n so that for any subsequent query position p, a SUS for p can be answered quickly. Pei et al. [20] introduced the single-SUS problem and gave an \(O(n^2)\)-time preprocessing scheme which can answer single-SUS queries in constant time. Tsuruta et al. [22], Ileri et al. [12], and Hu et al. [11] independently showed O(n)-time preprocessing schemes which can answer single-SUS queries in constant time. Also, Hon et al. [10] proposed an in-place algorithm for computing SUSs for all positions in linear time. The all-SUS problem is a generalization of the single-SUS problem which requires to output all SUSs for given position p. The methods of Tsuruta et al. [22] and Hu et al. [11] can answer all-SUS queries in \(O( occ )\) time, where \(occ\) is the number of SUSs to output. Note that the SUS problem studied by [11] is more general, that is, they find SUSs covering a given interval in the string, instead of a text position. Moreover, Mieno et al. [16] considered the all-SUS problem on a run-length encoded string, and proposed an O(r)-space data structure which can answer all-SUS (interval) queries in \(O(\sqrt{\log r/\log \log r}+ occ )\) time, where r is the size of a given run-length encoded string. Although not mentioned explicitly in [16], the size of their data structure (except for the input string) and query time can be respectively written as O(m) space and \(O(\sqrt{\log m/\log \log m}+ occ )\) time with respect to the number m of MUSs of the input string T. Note that all the above algorithms for the SUS problems compute all MUSs of the given string (or some data structure which is essentially equivalent to MUSs) in the preprocessing. We also refer to [1, 3, 7, 17] for related results on the SUS problems.

1.2 Sliding Window Model

In this paper, we tackle the problem of computing MUSs in the sliding window model. In the sliding window model, the input string is given in an online fashion, one character at a time from left to right, and the memory usage is limited to some pre-determined space. The task of the sliding window model is to process all substrings \(T[i\ldots i+d-1]\) of pre-fixed length d in a string T of length n in an incremental fashion, for increasing \(i = 0, \ldots , n-d\). Usually the window size d is set to be much smaller than the string length n, and thus the challenge here is to design efficient algorithms that process all such substrings using only O(d) working space.

A typical application to the sliding window model is data compression; examples are the famous Lempel-Ziv 77 (the original version) [25] and PPM [4]. Recently, Crochemore et al. [5] introduced the problem of computing Minimal Absent Words for a sliding window, and proposed an \(O(n\sigma )\)-time and \(O(d\sigma )\)-space algorithm using suffix trees for a sliding window where \(\sigma\) is the alphabet size. This paper deals with the problem of computing MUSs in a sliding window. This problem can be directly applied to compute uniqueness score of oligonucleotides for designing tilling arrays [8].

1.3 Our Contributions

We begin with combinatorial results on MUSs for a sliding window. Namely, we show that the number of MUSs that are added or deleted by one slide of the window is always constant (Sect. 3). We then present the first efficient algorithm that maintains the set of MUSs for a sliding window of length d over a string of length n in a total of \(O(n \log \sigma ')\) time and O(d) working space where \(\sigma '\le d\) is the maximum number of distinct characters in every window (Sect. 4). Our main algorithmic tool is the suffix tree for a sliding window that requires O(d) space and can be maintained in \(O(n \log \sigma ')\) time [6, 14, 21]. Our algorithm for computing MUSs for a sliding window is built on our combinatorial results, and it keeps track of three different loci over the suffix tree, all of which can be maintained in \(O(\log \sigma ')\) amortized time per each sliding step.

A part of the results reported in this article appeared in a preliminary version of this paper, [18]. The preliminary paper [18] consists of two parts: (1) efficient computation and combinatorial properties of MUSs for a sliding window, and (2) combinatorial properties of minimal absent words (MAWs) [19] for a sliding window. This current article is a full version of the former part (1) which contains complete proofs and supplemental figures which were omitted in the preliminary version [18]. We remark that an extended version of the latter part (2) can be found as an independent article [2].

2 Preliminaries

2.1 Strings

Let \(\Sigma\) be an alphabet of size \(\sigma\). An element of \(\Sigma\) is called a character. An element of \(\Sigma ^*\) is called a string. The length of a string T is denoted by |T|. The empty string \(\varepsilon\) is the string of length 0. If \(T = xyz\), then x, y, and z are called a prefix, substring, and suffix of T, respectively. They are called a proper prefix, proper substring, and proper suffix of T if \(x \ne T\), \(y \ne T\), and \(z \ne T\), respectively. If a string b is a proper prefix of T and is a proper suffix of T, b is called a border of T.

For any \(0 \le i \le |T|-1\), T[i] denotes the ith character of T. For any \(0 \le i \le j \le |T|-1\), \(T[i\ldots j]\) denotes the substring of T starting at position i and ending at position j, i.e., \(T[i\ldots j] = T[i]T[i+1]\cdots T[j]\). For convenience, let \(T[i'\ldots j'] = \varepsilon\) for any \(i' > j'\). For any \(0 \le i \le |T|-1\), \(T[i\ldots ]\) denotes the suffix starting at position i, i.e., \(T[i\ldots ] = T[i\ldots |T|-1]\).

For a non-empty string w, the set of beginning positions of occurrences of w in T is denoted by \(occ _T(w) = \{i \mid T[i\ldots i+|w|-1] = w\}\). Let \(\# occ _T(w) = | occ _T(w)|\). For any substring w of T, w is called unique in T if \(\# occ _T(w) = 1\), quasi-unique in T if \(1 \le \# occ _T(w) \le 2\), and repeating in T if \(\# occ _T(w) \ge 2\). For convenience, let \(\# occ _T(\varepsilon ) = |T|+1\), and thus, \(\varepsilon\) is always repeating in any non-empty string. For any \(0 \le i \le j \le |T|-1\), \(lrSuf _{i,j}\) denotes the longest repeating suffix of \(T[i\ldots j]\), \(sqSuf _{i,j}\) denotes the shortest quasi-unique suffix of \(T[i\ldots j]\), and \(sqPref _{i,j}\) denotes the shortest quasi-unique prefix of \(T[i\ldots j]\). While \(lrSuf _{i,j}\) can be the empty string, both \(sqSuf _{i,j}\) and \(sqPref _{i,j}\) are always non-empty strings for any ij with \(0 \le i \le j \le |T|-1\). See Fig. 1 for examples.

In what follows, we consider an arbitrarily fixed string T of length \(n \ge 1\) over an alphabet \(\Sigma\) of size \(\sigma \ge 2\).

2.2 Minimal Unique Substrings

A unique substring \(u = T[s\ldots t]\) of T is called a minimal unique substring (MUS) of T if and only if both \(T[s+1\ldots t]\) and \(T[s\ldots t-1]\) are repeating in T. Since a unique substring u of T has exactly one occurrence in T, it can be identified with a unique interval [st] such that \(0 \le s \le t \le n-1\) and \(u = T[s\ldots t]\). We denote by \(\mathsf {MUS}(T) = \{[s, t] \mid T[s\ldots t]~\text {is a MUS of}~T\}\) the set of intervals corresponding to the MUSs of T. See Fig. 1 for examples of MUSs.

This paper deals with the problem of computing MUSs for a sliding window of fixed length d over a given string T, formalized as follows:

Input:

String T of length n and positive integer d (\(< n\)).

Output:

\(\mathsf {MUS}(T[i\ldots i+d-1])\) for all \(0 \le i \le n - d\).

Fig. 1
figure 1

String \(T = \mathtt {bababbabaabbba}\) of length 14 and its substrings \(lrSuf _{2,11}\), \(sqSuf _{2,11}\), and \(sqPref _{2,11}\) for the current window \(T[2\ldots 11]\)

2.3 Suffix Trees

The suffix tree of string T, denoted \(\mathsf {STree}(T)\), is a compacted trie that represents all suffixes of T. We consider a version of suffix trees a.k.a. Ukkonen trees [23]: Namely, \(\mathsf {STree}(T)\) is a rooted tree such that

  1. 1.

    each edge is labeled by a non-empty substring of T,

  2. 2.

    each internal node has at least two children,

  3. 3.

    the out-going edges of each node begin with mutually distinct characters, and

  4. 4.

    the suffixes of T that are unique in T are represented by paths from the root to the leaves, and the other suffixes of T that are repeating in T are represented by paths from the root that end either on internal nodes or on edges.

To simplify the description of our algorithm, we assume that there is an auxiliary node \(\perp\) which is the parent of only the root node. The out-going edge of \(\perp\) is labeled with \(\Sigma\); This means that we can go down from \(\perp\) by reading any character in \(\Sigma\). See Fig. 2 for an example of \(\mathsf {STree}(T)\).

Fig. 2
figure 2

The suffix tree of string \(T = \mathtt {babbabaabb}\), where the suffix links are depicted by broken arrows, the implicit suffix nodes are depicted by black circles, as well as the three kinds of active points are marked. For example of other notions on the suffix tree, substring \(w = \mathtt {abaab}\) of T is considered here

For each node v in \(\mathsf {STree}(T)\), \(parent (v)\) denotes the parent of v, \(str (v)\) denotes the path string from the root to v, \(depth (v)\) denotes the string depth of v (i.e., \(depth (v) = | str (v)|\)), and \(subtree (v)\) denotes the subtree of \(\mathsf {STree}(T)\) rooted at v. For each leaf \(\ell\) in \(\mathsf {STree}(T)\), \(start (\ell )\) denotes the starting position of \(str (\ell )\) in T. For each non-empty substring w of T, \(hed (w) = v\) denotes the highest explicit descendant where w is a prefix of \(str (v)\) and \(depth ( parent (v)) < |w| \le depth (v)\). For each substring w of T, \(locus (w) = \langle {u},{h}\rangle\) represents the locus in \(\mathsf {STree}(T)\) where the path that spells out w from the root terminates, such that \(u = hed (w)\) and \(h = depth (u) - |w| \ge 0\). Namely, h is the off-set length from the child u of the locus for w when w is on an edge, and \(h = 0\) when w is on a node (namely u). We say that a substring w of T with \(locus (w) = \langle {u},{h}\rangle\) is represented by an explicit node if \(h = 0\), and by an implicit node if \(h \ge 1\). We remark that in the Ukkonen tree \(\mathsf {STree}(T)\) of a string T, some repeating suffixes may be represented by implicit nodes. An implicit node which represents a suffix of T is called an implicit suffix node. For any internal node v except for the root, the suffix link of v is a reversed edge from v to the explicit node that represents \(str (v)[1\ldots ]\). The suffix link of the root that represents \(\varepsilon\) points to \(\perp\).

3 Combinatorial Results on MUSs for a Sliding Window

Throughout this section, we consider positions i and j with \(0 \le i \le j \le n-1\) such that \(T[i\ldots j]\) denotes the sliding window for the ith position over the input string T. The following arguments hold for any values of i and j, and hence, they will be useful for sliding windows of any length d. The next lemmas are useful for analyzing combinatorial properties of MUSs and for designing an efficient algorithm for computing MUSs for a sliding window.

Lemma 1

The following three statements are equivalent:

  1. (1)

    \(| lrSuf _{i,j}| \ge | sqSuf _{i,j}|\),

  2. (2)

    \(\# occ _{T[i\ldots j]}( lrSuf _{i,j}) = 2\), and

  3. (3)

    \(\# occ _{T[i\ldots j]}( sqSuf _{i,j}) = 2\).

Proof

(1) \(\Rightarrow\) (2) and (3): Since \(| lrSuf _{i,j}| \ge | sqSuf _{i,j}|\), \(sqSuf _{i,j}\) is a suffix of \(lrSuf _{i,j}\) and thus \(\# occ _{T[i\ldots j]}( sqSuf _{i,j}) \ge \# occ _{T[i\ldots j]}( lrSuf _{i,j})\). By the definitions of \(sqSuf _{i,j}\) and \(lrSuf _{i,j}\), \(\# occ _{T[i\ldots j]}( sqSuf _{i,j}) \le 2\) and \(\# occ _{T[i\ldots j]}( lrSuf _{i,j}) \ge 2\). Thus \(\# occ _{T[i\ldots j]}( lrSuf _{i,j})\) \(= \# occ _{T[i\ldots j]}( sqSuf _{i,j}) = 2\).

(2) \(\Rightarrow\) (1): Since \(\# occ _{T[i\ldots j]}( lrSuf _{i,j}) = 2\), the shortest suffix \(sqSuf _{i,j}\) of \(T[i\ldots j]\) that occurs at most twice in \(T[i\ldots j]\) cannot be longer than \(lrSuf _{i,j}\), i.e., \(| lrSuf _{i,j}| \ge | sqSuf _{i,j}|\).

(3) \(\Rightarrow\) (1): Since \(\# occ _{T[i\ldots j]}( sqSuf _{i,j}) = 2\), the longest suffix \(lrSuf _{i,j}\) of \(T[i\ldots j]\) that occurs at least twice in \(T[i\ldots j]\) is at least as long as \(sqSuf _{i,j}\), i.e., \(| lrSuf _{i,j}| \ge | sqSuf _{i,j}|\).

Figure 1 shows a concrete example where (1) of Lemma 1 holds (and hence both (2) and (3) also hold.)

Lemma 2

\(| lrSuf _{i,j+1}| \le | lrSuf _{i, j}| + 1\).

Proof

Assume on the contrary that \(| lrSuf _{i,j+1}| > | lrSuf _{i,j}| + 1\). By the definition of \(lrSuf _{i,j+1}\), \(lrSuf _{i,j+1} = T[j+2-| lrSuf _{i,j+1}|\ldots j+1]\) occurs at least twice in \(T[i\ldots j+1]\). Hence, \(T[j+2-| lrSuf _{i,j+1}|\ldots j]\) which is a proper prefix of \(lrSuf _{i,j+1}\) also occurs at least twice in \(T[i\ldots j]\). In addition, \(lrSuf _{i,j} = T[j+2-| lrSuf _{i,j}|\ldots j]\) is a proper suffix of \(T[j+2-| lrSuf _{i,j+1}|\ldots j]\) since \(| lrSuf _{i,j+1}| > | lrSuf _{i,j}| + 1\). However, this contradicts the definition of \(lrSuf _{i,j}\). Therefore, \(| lrSuf _{i,j+1}| \le | lrSuf _{i,j}| + 1\).

3.1 Changes to MUSs When Appending a Character to the Right

In this subsection, we consider an operation that slides the right-end of the current window \(T[i\ldots j]\) with one character by appending the next character \(T[j+1]\) to \(T[i\ldots j]\). We use the following observation.

Observation 1

For any non-empty substring s of \(T[i\ldots j]\),

$$\begin{aligned} \# occ _{T[i\ldots j+1]}(s) \le \# occ _{T[i\ldots j]}(s) + 1. \end{aligned}$$

Also, the equality holds if and only if s is a suffix of \(T[i\ldots j+1]\).

3.1.1 MUSs to be Deleted When Appending a Character to the Right

Due to Observation 1, we obtain Lemma 3 which describes MUSs to be deleted when a new character \(T[j+1]\) is appended to the current window \(T[i\ldots j]\).

Lemma 3

For any [st] with \(i \le s < t \le j\), \([s, t] \in \mathsf {MUS}(T[i\ldots j])\) and \([s, t] \not \in \mathsf {MUS}(T[i\ldots j\) \(+1])\) if and only if \(T[s\ldots t] = sqSuf _{i,j+1}\) and \(\# occ _{T[i\ldots j+1]}( sqSuf _{i,j+1}) = 2\).

Proof

(\(\Rightarrow\)) Let \(w = T[s\ldots t]\). Since \([s, t] \in \mathsf {MUS}(T[i\ldots j])\) and \([s, t] \not \in \mathsf {MUS}(T[i\ldots j+1])\), \(\# occ _{T[i\ldots j]}(w) = 1\) and \(\# occ _{T[i\ldots j+1]}(w) \ge 2\). It follows from Observation 1 that \(\# occ _{T[i\ldots j+1]}(w) = 2\) and w is a suffix of \(T[i\ldots j+1]\). If we assume that w is a proper suffix of \(sqSuf _{i,j+1}\), then \(\# occ _{T[i\ldots j+1]}(w) \ge 3\) by the definition of \(sqSuf _{i,j+1}\), but this contradicts with \(\# occ _{T[i\ldots j+1]}(w) = 2\). If we assume that \(sqSuf _{i,j+1}\) is a proper suffix of w, then \(\# occ _{T[i\ldots j]}( sqSuf _{i,j+1}) \ge \# occ _{T[i\ldots j]}(T[s+1\ldots t]) \ge 2\). Also, \(\# occ _{T[i\ldots {j+1}]}( sqSuf _{i,j+1})\) \(= \# occ _{T[i\ldots j]}( sqSuf _{i,j+1}) + 1 \ge 3\) by Observation 1, but this contradicts the definition of \(sqSuf _{i,j+1}\). Therefore, \(w = sqSuf _{i,j+1}\). Moreover, \(\# occ _{T[i\ldots j+1]}( sqSuf _{i,j+1}) = 2\) since \(w = sqSuf _{i, j+1}\) is a substring of \(T[i\ldots j]\).

(\(\Leftarrow\)) Since \(w = T[s\ldots t]\) is a suffix of \(T[i\ldots j+1]\) and \(\# occ _{T[i\ldots j+1]}(w) = 2\), w is unique in \(T[i\ldots j]\). By the definition of \(sqSuf _{i,j+1}\), a proper suffix \(w[1\ldots ] = T[s+1\ldots t]\) of \(w = sqSuf _{i,j+1}\) occurs at least three times in \(T[i\ldots j+1]\), i.e., \(T[s+1\ldots t]\) is repeating in \(T[i\ldots j]\) (see also Fig. 3 for illustration).

Fig. 3
figure 3

Illustration for the case where \(\# occ _{T[i\ldots j+1]}( sqSuf _{i,j+1}) = 2\). In this case, \(T[s\ldots t] = sqSuf _{i, j+1}\) is unique in \(T[i\ldots j]\) and \(T[s+1\ldots t]\) is repeating in \(T[i\ldots j]\)

Also, a prefix \(w[0\ldots |w|-2] = T[s\ldots t-1]\) of \(w = sqSuf _{i,j+1}\) is clearly repeating in \(T[i\ldots j]\). Therefore, \(w = T[s\ldots t]\) is a MUS of \(T[i\ldots j]\) and is not a MUS of \(T[i\ldots j+1]\).

By Lemma 3, at most one MUS can be deleted when appending \(T[j+1]\) to the current window \(T[i\ldots j]\), and such a deleted MUS must be \(sqSuf _{i,j+1}\).

3.1.2 MUSs to be Added When Appending a Character to the Right

First, we consider a MUS to be added when appending \(T[j+1]\) to \(T[i\ldots j]\), which is a suffix of \(T[i\ldots j+1]\). The next observation follows from the definition of \(lrSuf _{i,j}\):

Observation 2

If \([s, j] \in \mathsf {MUS}(T[i\ldots j])\), then \(s = j - | lrSuf _{i,j}|\). Namely, if there is a MUS of \(T[i\ldots j]\) that is a suffix of \(T[i\ldots j]\), then it must be the suffix of \(T[i\ldots j]\) that is exactly one character longer than \(lrSuf _{i,j}\).

Lemma 4

The interval \([j+1 - \ell , j+1] \in \mathsf {MUS}(T[i\ldots j+1])\) if and only if \(T[j+1 - \ell \ldots j+1] = \alpha ^{\ell +1}\) or \(\ell \le | lrSuf _{i,j}|\), where \(\ell = | lrSuf _{i,j+1}|\) and \(\alpha = T[j+1]\).

Proof

(\(\Rightarrow\)) Assume on the contrary that \(T[j+1 - \ell \ldots j+1] \ne \alpha ^{\ell +1}\) and \(\ell > | lrSuf _{i, j}|\). By the assumptions and Lemma 2, \(| lrSuf _{i, j}| = \ell - 1\), and thus, \(T[j-| lrSuf _{i, j}|\ldots j] = T[j+1-\ell \ldots j]\). Since \(T[j+1-\ell \ldots j+1]\) is a MUS of \(T[i\ldots j+1]\), \(T[j+1-\ell \ldots j] = T[j-| lrSuf _{i, j}|\ldots j]\) occurs at least twice in \(T[i\ldots j+1]\). On the other hand, \(T[j-| lrSuf _{i,j}|\ldots j]\) is unique in \(T[i\ldots j]\) by the definition of \(lrSuf _{i, j}\), hence \(T[j-| lrSuf _{i, j}|\ldots j]\) occurs in \(T[i\ldots j+1]\) as a suffix of \(T[i\ldots j+1]\). Consequently, we have \(T[j-| lrSuf _{i, j}|\ldots j] = T[j+1-| lrSuf _{i, j}|\ldots j+1]\), i.e., \(T[j-\ell \ldots j] = T[j+1-\ell \ldots j+1] = \alpha ^{\ell +1}\) with \(\alpha = T[j+1]\), a contradiction.

(\(\Leftarrow\)) By the definition of \(lrSuf _{i, j+1}\), \(T[j+2-\ell \ldots j+1] = lrSuf _{i,j+1}\) is repeating in \(T[i\ldots j+1]\), and \(T[j+1-\ell \ldots j+1]\) is unique in \(T[i\ldots j+1]\). Now it suffices to show \(T[j+1-\ell \ldots j]\) is repeating in \(T[i\ldots j+1]\). If \(T[j+1-\ell \ldots j+1] = \alpha ^{\ell +1}\), then clearly \(T[j+1-\ell \ldots j] = \alpha ^\ell\) is repeating in \(T[i\ldots j+1]\). If \(\ell \le | lrSuf _{i, j}|\), then \(T[j+1-\ell \ldots j]\) is a suffix of \(T[j+1 - | lrSuf _{i, j}|\ldots j]\) (see Fig. 4).

Fig. 4
figure 4

Illustration for the case where \(| lrSuf _{i, j+1}| \le | lrSuf _{i, j}|\). In this case, \(T[j+1-| lrSuf _{i, j+1}|\ldots j+1]\) is a MUS of \(T[i\ldots j+1]\)

Thus \(\# occ _{T[i\ldots j+1]}(T[j+1-\ell \ldots j]) \ge \# occ _{T[i\ldots j]}(T[j+1-\ell \ldots j])\) \(\ge \# occ _{T[i\ldots j]}(T[j+1-| lrSuf _{i, j}|\ldots j]) \ge 2\).

Next, we consider MUSs to be added when appending \(T[j+1]\) to \(T[i\ldots j]\), which are not suffixes of \(T[i\ldots j+1]\).

Lemma 5

For each \([s, t] \in \mathsf {MUS}(T[i\ldots j+1])\) with \(t \ne j+1\), if \([s, t] \not \in \mathsf {MUS}(T[i\ldots j])\) then \(\# occ _{T[i\ldots j+1]}( sqSuf _{i,j+1}) = 2\) and \(sqSuf _{i, j+1}\) is a proper substring of \(T[s\ldots t]\).

Proof

Since \([s, t] \in \mathsf {MUS}(T[i\ldots j+1])\) and \(t \ne j+1\), \(T[s\ldots t]\) is unique in \(T[i\ldots j]\). Moreover, since \(T[s\ldots t]\) is not a MUS of \(T[i\ldots j]\), there exists a MUS u of \(T[i\ldots j]\) which is a proper substring of \(T[s\ldots t]\). Since \(T[s\ldots t]\) is a MUS of \(T[i\ldots j+1]\), u is repeating in \(T[i\ldots j+1]\). Then, it follows from Lemma 3 that \(u = sqSuf _{i, j+1}\) and u occurs exactly twice in \(T[i\ldots j+1]\).

Namely, a MUS which is not a suffix is added by appending one character only if there is a MUS to be deleted by the same operation. Moreover, such added MUSs must contain the deleted MUS.

Lemma 6

If \(\# occ _{T[i\ldots j+1]}( sqSuf _{i, j+1}) = 2\), then there are three integers \(p_l, p_s, q\) such that \(i \le p_l \le p_s \le q < j+1\), \(T[p_s\ldots q] = sqSuf _{i, j+1}\) and \(T[p_l\ldots q] = lrSuf _{i, j+1}\). Also, the following propositions hold:

  1. (a)

    If there is no MUS of \(T[i\ldots j]\) ending at \(q+1\), then \([p_s, q+1] \in \mathsf {MUS}(T[i\ldots j+1])\).

  2. (b)

    If there is no MUS of \(T[i\ldots j]\) starting at \(p_l-1\) and \(p_l \ge i+1\), then \([p_l-1, q] \in \mathsf {MUS}(T[i\ldots j+1])\).

Proof

Since \(\# occ _{T[i\ldots j+1]}( sqSuf _{i,j+1}) = 2\), it follows from Lemma 1 that \(\# occ _{T[i\ldots j+1]}\) \(( lrSuf _{i, j+1}) = 2\) and \(sqSuf _{i, j+1}\) is a suffix of \(lrSuf _{i, j+1}\). Hence, the ending positions of the occurrences of \(sqSuf _{i, j+1}\) in \(T[i\ldots j]\) and that of \(lrSuf _{i, j+1}\) in \(T[i\ldots j]\) are the same (see Fig. 5). Hence, there exist indices \(p_s\), \(p_l\), and q such that \(T[p_s\ldots q] = sqSuf _{i, j+1}\) and \(T[p_l\ldots q] = lrSuf _{i, j+1}\).

Fig. 5
figure 5

Illustration of the situation when \(sqSuf _{i, j+1}\) is repeating in \(T[i\ldots j+1]\). In this situation, \([p_l-1, q]\) and \([p_s, q+1]\) are the only candidates for MUSs in \(\mathsf {MUS}(T[i\ldots j+1]) {\setminus } \mathsf {MUS}(T[i\ldots j])\) each of which is not a suffix of \(T[i\ldots j+1]\)

Next, we consider MUSs to be added.

  1. (a)

    First, for the sake of contradiction, assume that \(T[p_s\ldots q+1]\) is repeating in \(T[i\ldots j+1]\). By the definition, \(T[p_s\ldots q] = sqSuf _{i, j+1}\) occurs in \(T[i\ldots j+1]\) as a suffix. Also, \(T[p_s\ldots q]\) occurs at least twice in \(T[i\ldots j+1]\) as a proper prefix of \(T[p_s\ldots q+1]\). These implies that \(\# occ _{T[i\ldots j+1]}(T[p_s\ldots q]) \ge 3\), however, this contradicts the definition of \(sqSuf _{i,j+1}~(= T[p_s\ldots q])\). Hence, \(T[p_s\ldots q+1]\) is unique in \(T[i\ldots j+1]\). Next, \(T[p_s\ldots q]\) is repeating in \(T[i\ldots j+1]\) by the assumption. Further, by Lemma 3, \(T[p_s\ldots q] = sqSuf _{i,j+1}\) is a MUS of \(T[i\ldots j]\) since \(\# occ _{T[i\ldots j+1]}( sqSuf _{i, j+1})\) \(= 2\). Thus, \(T[p_s+1\ldots q]\) is repeating in \(T[i\ldots j]\). Finally, for the sake of contradiction, assume that \(T[p_s+1\ldots q+1]\) is unique in \(T[i\ldots j]\). Let u be a MUS of \(T[i\ldots j]\) which is a substring of \(T[p_s+1\ldots q+1]\). Since \(T[p_s+1\ldots q]\) is repeating in \(T[i\ldots j]\), the ending position of u must be \(q+1\). This contradicts the assumption that there is no MUS of \(T[i\ldots j]\) ending at \(q+1\). Thus, \(T[p_s+1\ldots q+1]\) is repeating in \(T[i\ldots j]\), as well as in \(T[i\ldots j+1]\). Therefore, \(T[p_s\ldots q+1]\) is a MUS of \(T[i\ldots j+1]\).

  2. (b)

    First, for the sake of contradiction, assume that \(T[p_l-1\ldots q]\) is repeating in \(T[i\ldots j+1]\). From the discussion at the beginning of the proof, the starting positions of the occurrences of \(lrSuf _{i,j+1}\) are \(p_l\) and \(j+2-| lrSuf _{i,j+1}|\) (see also Fig. 5). Since \(lrSuf _{i,j+1}\) is a proper suffix of \(T[p_l-1\ldots q]\) and \(T[p_l-1\ldots q]\) is repeating, the starting positions of the occurrences of \(T[p_l-1\ldots q]\) are \(p_l-1\) and \(j+1-| lrSuf _{i,j+1}|\). Then, \(T[j+1-| lrSuf _{i,j+1}\ldots j+1]\) of length \(| lrSuf _{i,j+1}|+1\) is a repeating suffix of \(T[i\ldots j+1]\), however, it contradicts the definition of \(lrSuf _{i, j+1}\). Thus, \(T[p_l-1\ldots q]\) is unique in \(T[i\ldots j+1]\). Also, by the definition, \(T[p_l\ldots q] = lrSuf _{i,j+1}\) is repeating in \(T[i\ldots j+1]\). Finally, for the sake of contradiction, assume that \(T[p_l-1\ldots q-1]\) is unique in \(T[i\ldots j]\). Let v be a MUS of \(T[i\ldots j]\) which is a substring of \(T[p_l-1\ldots q-1]\). Since \(T[p_l\ldots q-1]\) is repeating in \(T[i\ldots j]\), the starting position of v must be \(p_l-1\). This contradicts the assumption that there is no MUS of \(T[i\ldots j]\) starting at \(p_l-1\). Thus, \(T[p_l-1\ldots q-1]\) is repeating in \(T[i\ldots j]\), as well as in \(T[i\ldots j+1]\). Therefore, \(T[p_l-1\ldots q]\) is a MUS of \(T[i\ldots j+1]\).

Now we have the main result of this subsection:

Theorem 1

For any \(0 \le i \le j < n-1\), \(|\mathsf {MUS}(T[i\ldots j+1]) \bigtriangleup \mathsf {MUS}(T[i\ldots j])| \le 4\) and \(-1 \le |\mathsf {MUS}(T[i\ldots j+1])| - |\mathsf {MUS}(T[i\ldots j])| \le 2\). Furthermore, these bounds are tight for any \(\sigma , i, j\) with \(\sigma \ge 3\), \(0 \le i \le j < n-1\), and \(j-i+1 \ge 5\).

Proof

First, we show that \(|\mathsf {MUS}(T[i\ldots j+1]) \bigtriangleup \mathsf {MUS}(T[i\ldots j])| \le 4\). By Lemma 3, \(|\mathsf {MUS}(T[i\ldots j]) {\setminus } \mathsf {MUS}(T[i\ldots j+1])| \le 1\). By Observation 2 and Lemma 6, \(|\mathsf {MUS}(T[i\ldots j+1]) {\setminus } \mathsf {MUS}(T[i\ldots j])| \le 3\). Thus, \(|\mathsf {MUS}(T[i\ldots j+1]) \bigtriangleup \mathsf {MUS}(T[i\ldots j])| = |\mathsf {MUS}(T[i\ldots j+1]) {\setminus } \mathsf {MUS}(T[i\ldots j])| + |\mathsf {MUS}(T[i\ldots j]) {\setminus } \mathsf {MUS}(T[i\ldots j+1])| \le 4\). Also, we show that the upper bound is tight if \(\sigma \ge 3\). For an integer \(k \ge 2\), we consider two strings u and \(u'\) such that \(u = \mathtt {a}^k\mathtt {b}\mathtt {c}\mathtt {c}\) of length \(k + 3 \ge 5\) and \(u' = u\mathtt {b} = \mathtt {a}^k\mathtt {b}\mathtt {c}\mathtt {c}\mathtt {b}\) of length \(k + 4 \ge 6\). Then, \(\mathsf {MUS}(u) = \{[0, k-1], [k, k], [k+1, k+2]\}\) and \(\mathsf {MUS}(u') = \{[0, k-1], [k-1, k], [k, k+1], [k+1, k+2], [k+2, k+3]\}\). Therefore, \(|\mathsf {MUS}(u') \bigtriangleup \mathsf {MUS}(u)| = 4\).

Next, we show that \(-1 \le |\mathsf {MUS}(T[i\ldots j+1])| - |\mathsf {MUS}(T[i\ldots j])| \le 2\). By Lemma 3, it is clear that \(-1 \le |\mathsf {MUS}(T[i\ldots j+1])| - |\mathsf {MUS}(T[i\ldots j])|\). By Observation 2, the number of added MUSs which are suffixes of \(T[i\ldots j+1]\) is at most one. Also, by Lemma 6, the number of added MUSs which are not suffixes of \(T[i\ldots j+1]\) is at most two, however, if such an added MUS exists, exactly one MUS (\(= sqSuf _{i, j+1}\)) must be deleted (cf. Lemmas 3, 5). Therefore, \(|\mathsf {MUS}(T[i\ldots j+1])| - |\mathsf {MUS}(T[i\ldots j])| \le 2\). Also, we show that each bound is tight when \(\sigma \ge 3\). We consider strings u and \(u'\) that are described in the case (a), and we then obtain \(|\mathsf {MUS}(u')| - |\mathsf {MUS}(u)| = 2\). On the other hand, for any integer \(\ell\) with \(\ell \ge 1\), we consider two strings v and \(v'\); \(v = \mathtt {a}^\ell \mathtt {b}\mathtt {c}\mathtt {a}\mathtt {c}\) of length \(\ell +4 \ge 5\) and \(v' = v\mathtt {a} = \mathtt {a}^\ell \mathtt {b}\mathtt {c}\mathtt {a}\mathtt {c}\mathtt {a}\) of length \(\ell +5\ge 6\). If \(\ell = 1\), then \(\mathsf {MUS}(v) = \{[1, 1], [2, 3], [3, 4]\}\), and \(\mathsf {MUS}(v') = \{[1, 1], [3, 4]\}\). If \(\ell \ge 2\), then \(\mathsf {MUS}(v) = \{[0, \ell -1], [\ell , \ell ], [\ell +1, \ell +2], [\ell +2, \ell +3]\}\), and \(\mathsf {MUS}(v') = \{[0, \ell -1], [\ell , \ell ], [\ell +2, \ell +3]\}\). Therefore, \(|\mathsf {MUS}(v')| - |\mathsf {MUS}(v)| = -1\).

3.2 Changes to MUSs When Deleting the Leftmost Character

In this subsection, we consider an operation that deletes the leftmost character \(T[i-1]\) from \(T[i-1\ldots j]\). Basically, we can use symmetric arguments to the previous subsection where we considered appending a character to the right of the window.

Observation 3

For each non-empty substring s of \(T[i-1\ldots j]\), \(\# occ _{T[i-1\ldots j]}(s) \le \# occ _{T[i\ldots j]}(s) + 1\). Also, \(\# occ _{T[i-1\ldots i]}(s) = \# occ _{T[i\ldots j]}(s) + 1\) if and only if s is a prefix of \(T[i-1\ldots j]\).

3.2.1 MUSs to be Added When Deleting the Leftmost Character

Lemma 7

For any \(i \le s \le t \le j\), \([s, t] \not \in \mathsf {MUS}(T[i-1\ldots j])\) and \([s, t] \in \mathsf {MUS}(T[i\ldots j])\) if and only if \(T[s\ldots t] = sqPref _{i-1,j}\) and \(\# occ _{T[i-1\ldots j]}( sqPref _{i-1,j}) = 2\).

Proof

Symmetric to the proof of Lemma 3.

3.2.2 MUSs to be Deleted When Deleting the Leftmost Character

Next, we consider MUSs to be deleted by removing \(T[i-1]\) from \(T[i-1\ldots j]\). If there is a MUS w of \(T[i-1\ldots j]\) which is a prefix of \(T[i-1\ldots j]\), clearly, w is not a MUS of \(T[i\ldots j]\). Then, we consider MUSs to be deleted which are not prefixes of \(T[i-1\ldots j]\).

Lemma 8

For each \([s, t] \in \mathsf {MUS}(T[i-1\ldots j])\) with \(s \ne i-1\), if \([s, t] \not \in \mathsf {MUS}(T[i\ldots j])\) then \(\# occ _{T[i-1\ldots j]}( sqPref _{i-1,j}) = 2\) and \(sqPref _{i-1,j}\) is a proper substring of \(T[s\ldots t]\).

Proof

Symmetric to the proof of Lemma 5.

Namely, when deleting the leftmost character, a MUS which is not a prefix is deleted only if an added MUS exists. Moreover, such deleted MUSs must contains the added MUS.

Lemma 9

If \(\# occ _{T[i-1\ldots j]}( sqPref _{i-1,j}) = 2\), then following propositions hold:

  1. (a)

    If there is a MUS w starting at s in \(T[i-1\ldots j]\), w is not a MUS of \(T[i\ldots j]\),

  2. (b)

    If there is a MUS \(w'\) ending at t in \(T[i-1\ldots j]\), \(w'\) is not a MUS of \(T[i\ldots j]\),

where \(T[s\ldots t] = sqPref _{i-1,j}\) and \(s \ne i-1\).

Proof

Symmetric to the proof of Lemma 6. See also Fig. 6 for illustration.

Fig. 6
figure 6

Illustration of the situation when \(sqPref _{i-1,j}\) is repeating in \(T[i-1\ldots j]\). In this situation, \(T[s\ldots t] = sqPref _{i-1,j}\) is a new MUS of \(T[i\ldots j]\) by Lemma 7

The main result of this subsection is the following:

Theorem 2

For any \(0 < i \le j \le n-1\), \(|\mathsf {MUS}(T[i-1\ldots j]) \bigtriangleup \mathsf {MUS}(T[i\ldots j])| \le 4\) and \(-1 \le |\mathsf {MUS}(T[i-1\ldots j])| - |\mathsf {MUS}(T[i\ldots j])| \le 2\). Furthermore, these bounds are tight for any \(\sigma , i, j\) with \(\sigma \ge 3\), \(0 < i \le j \le n-1\), and \(j-i+1 \ge 5\).

Proof

Symmetric to the proof of Theorem 1.

The next corollary is immediate from Theorems 1 and  2.

Corollary 1

Let \(0< d < n\). For every i with \(0 \le i \le n-d-1\), \(|\mathsf {MUS}(T[i\ldots i+d-1]) \bigtriangleup \mathsf {MUS}(T[i+1\ldots i+d])| \in O(1)\).

4 Algorithm for Computing MUSs for a Sliding Window

This section presents our algorithm for computing MUSs for a sliding window.

4.1 Updating Suffix Tree and Its Three Loci

First, we introduce some additional notions. Since we use Ukkonen’s algorithm [23] for updating the suffix tree when a new character \(T[j+1]\) is appended to the right end of the window \(T[i\ldots j]\), we maintain the locus for \(lrSuf _{i, j}\) as in [23]. Also, in order to compute the changes of MUSs, we use \(sqSuf _{i,j}\) (c.f. Lemma 36). Thus, we also maintain the locus for \(sqSuf _{i, j}\).

The locus for \(lrSuf _{i,j}\) (resp. \(sqSuf _{i,j}\)) in \(\mathsf {STree}(T[i\ldots j])\) is called the primary active point (resp. the secondary active point) and is denoted by \(\mathsf {pp}_{i,j}\) (resp. \(\mathsf {sp}_{i,j}\)). Additionally, in order to maintain \(\mathsf {sp}_{i,j}\) efficiently, we also maintain the locus for the longest suffix of \(T[i\ldots j]\) which occurs at least three times in \(T[i\ldots j]\). We call this locus the tertiary active point that is denoted by \(\mathsf {tp}_{i,j}\). See Fig. 2 for concrete examples of these three loci in a suffix tree.

4.1.1 Appending One Character

When \(T[i\ldots j]\) is the empty string (the base case, where \(i = 0\) and \(j = -1\)), we set all the three active points \(\langle { root },{0}\rangle\). Then we increase j, and the suffix tree grows in an online manner until \(j = d - 1\) using Ukkonen’s algorithm. Then, for each \(j > d-1\), we also increase i each time j increases, so that the sliding window is shifted to the right, by using sliding window algorithm for the suffix tree [21].

When \(T[j+1]\) is appended to the right end of \(T[i\ldots j]\), we first update the suffix tree to \(\mathsf {STree}(T[i\ldots j+1])\) and compute \(\mathsf {pp}_{i, j+1}\). Since \(\mathsf {pp}_{i, j+1}\) coincides with the active point, \(\mathsf {pp}_{i, j+1}\) can be found in amortized \(O(\log \sigma ')\) time [21].

After updating the suffix tree, we can compute \(\mathsf {tp}_{i, j+1}\) and \(\mathsf {sp}_{i, j+1}\) as follows:

  • Traverse character \(T[j+1]\) from \(\mathsf {tp}_{i,j}\), and set \(w \leftarrow str (\mathsf {tp}_{i,j})T[i+1]\).

  • While \(\# occ _{T[i\ldots j+1]}(w) < 3\), set \(w \leftarrow w[1\ldots ]\) and search for the locus \(\mathsf {p}\) for w by using suffix links in \(\mathsf {STree}(T[i\ldots j+1])\).

  • After breaking from the while-loop, obtain \(\mathsf {tp}_{i,j+1} = \mathsf {p}\).

  • \(\mathsf {sp}_{i,j+1}\) equals the locus stored in \(\mathsf {p}\) at the penultimate iteration of the while-loop.

Let us show the correctness of the above algorithm. After the first step, w is the longest suffix which possibly corresponds to \(\mathsf {tp}_{i, j+1}\). In the while loop of the second step, we search for the suffix corresponding to \(\mathsf {tp}_{i, j+1}\) by deleting the first characters from w one-by-one. After breaking from the while-loop, we store in w the longest suffix of \(T[i\ldots j+1]\) which occurs more than twice in \(T[i\ldots j+1]\), i.e., \(\mathsf {tp}_{i,j+1} = locus (w)\). Also, by the definitions of \(\mathsf {sp}\) and \(\mathsf {tp}\), \(\mathsf {sp}_{i,j+1}\) is the locus for the suffix of \(T[i\ldots j+1]\) which is one character longer than \(w = str (\mathsf {tp}_{i,j+1})\).

As is described in the above algorithm, we can locate \(\mathsf {tp}_{i,j+1}\) using suffix links, in a similar manner to the active point \(\mathsf {pp}_{i,j+1}\). Thus, the time cost for locating \(\mathsf {tp}_{i,j+1}\) for each increasing j is amortized \(O(\log \sigma ')\), again by a similar argument to the active point \(\mathsf {pp}_{i,j+1}\). What remains is, for each candidate w for \(\mathsf {tp}_{i,j+1}\), how to quickly determine whether \(\# occ _{T[i\ldots j+1]}(w) < 3\) or not. In what follows, we show that it can be checked in O(1) time for each candidate.

Observation 4

For each suffix s of string \(T[i\ldots j+1]\), let \(locus (s) = \langle {u},{h}\rangle\).

Case 1:

If u is an internal node, s occurs at least three times in \(T[i\ldots j+1]\).

Case 2:

If u is a leaf and \(h = 0\), s occurs exactly once in \(T[i\ldots j+1]\).

Case 3:

If u is a leaf and \(h \ne 0\),

Case 3.1:

if there is a suffix \(s'\) of \(T[i\ldots j+1]\) with \(hed (s') = hed (s)\) which is longer than s, s occurs at least three times in \(T[i\ldots j+1]\) (see Fig. 7 for examples).

Case 3.2:

otherwise, s occurs exactly twice in \(T[i\ldots j+1]\).

Fig. 7
figure 7

The suffix tree of string \(T = \mathtt {aabbabbab}\) as an example of the Case 3.1 in Observation 4. Black circles represent implicit suffix nodes. For two suffixes \(s = \mathtt {ab}\) and \(s' = \mathtt {abbab}\) of T, \(hed (s') = hed (s)\) and s occurs three times in T

For any suffix s of \(T[i\ldots j+1]\), if we are given \(locus (s) = \langle {u},{h}\rangle\), then we can obviously determine in constant time whether s occurs at least three times in \(T[i\ldots j+1]\) or not, except Case 3. The next lemma allows us to determine it in constant time in Case 3 as well.

Lemma 10

Suppose the locus \(\mathsf {pp}_{i, j+1}\) in \(\mathsf {STree}(T[i\ldots j+1])\) is already computed. Given a leaf \(\ell\) of \(\mathsf {STree}(T[i\ldots j+1])\), it can be determined in O(1) time whether there is an implicit suffix node on the edge \(( parent (\ell ), \ell )\) and if so, the locus of the lowest implicit suffix node on \(( parent (\ell ), \ell )\) can be computed in O(1) time.

Proof

By Observation 4, for each leaf \(\ell\), the suffix corresponding to the lowest implicit suffix node on \(( parent (\ell ), \ell )\) occurs exactly twice in \(T[i\ldots j+1]\) if such an implicit suffix node exists. Let \(x = lrSuf _{i,j+1}\) and \(\mathsf {pp}_{i,j+1} = \langle {u},{h}\rangle\).

If u is not a leaf, there is no implicit suffix node on the edge \(( parent (\ell ), \ell )\) for any leaf \(\ell\), since every suffix of \(T[i\ldots j+1]\) which is shorter than |x| occurs more than twice in \(T[i\ldots j+1]\).

Fig. 8
figure 8

For an example of Lemma 10. The situation of this figure is that each of u and \(\ell\) is a leaf with \(t_{\ell } = start (\ell ) \ge s = start (u)\) and \(| lrSuf _{i,j+1}| - (t_{\ell } - s) > depth ( parent (\ell ))\) where \(\langle {u},{h}\rangle\) represents the primary active point. Also, black nodes represent implicit suffix nodes

If u is a leaf, then \(\# occ _{T[i\ldots j+1]}(x) = 2\). Let \(s = start (u)\) and \(t_{\ell } = start (\ell )\) for each leaf \(\ell\). Notice that x is a border of \(T[s\ldots j+1]\). There are two sub-cases:

  • First, we consider the case where \(t_{\ell } < s\). Suppose that there is an implicit suffix node on \(( parent (\ell ), \ell )\) for the sake of contradiction. Let w be a string corresponding to the lowest implicit suffix node on \(( parent (\ell ), \ell )\). Then, w is a proper suffix of x, and occurs exactly twice in \(T[i\ldots j+1]\). Furthermore, w occurs exactly twice in \(T[s\ldots j+1]\) since x is a border of \(T[s\ldots j+1]\). However, w is also a prefix of \(T[t_{\ell }\ldots j+1]\), hence w occurs at least three times in \(T[i\ldots j+1]\), it is a contradiction. Thus, if \(t_{\ell } < s\), there is no implicit suffix node on \(( parent (\ell ), \ell )\).

  • Second, we consider the case where \(t_{\ell } \ge s\) (see Fig. 8). In this case, \(T[t_{\ell }\ldots s+|x|-1]\) which is a prefix of \(T[t_\ell \ldots j+1]\) matches the suffix of x which is \(t_\ell -s\) characters shorter than x, i.e., \(x[t_\ell -s\ldots ]\). Thus, there is an implicit suffix node on \(( parent (\ell ), \ell )\) if and only if \(|T[t_{\ell }\ldots s+|x|-1]| = |x|-(t_\ell -s) > depth ( parent (\ell ))\). Also, if there is an implicit suffix node on \(( parent (\ell ), \ell )\), the locus of the lowest one is \(\langle {\ell },{h}\rangle\).

4.1.2 Deleting the Leftmost Character

When the leftmost character \(T[i-1]\) is deleted from \(T[i-1\ldots j]\), we first update the suffix tree and compute \(\mathsf {pp}_{i, j}\) by using the sliding window algorithm for the suffix tree [21]. Each pair of position pointers for the edge-labels of the suffix tree can be maintained in amortized O(1) time so that these pointers always refer to positions within the current sliding window, by a simple batch update technique (see [21] for details). After that, we compute \(\mathsf {tp}_{i,j}\) and \(\mathsf {sp}_{i, j}\) in a similar way to the case of appending a new character shown previously.

It follows from the above arguments in this subsection that we can update the suffix tree and the three active points in amortized \(O(\log \sigma ')\) time, each time the window is shifted by one character.

4.2 Computing \(sqPref _{i-1,j}\)

In order to compute the changes of MUSs when the leftmost character \(T[i-1]\) is deleted from \(T[i-1\ldots j]\), we use \(sqPref _{i-1,j}\) (c.f. Lemmas 7 and 9) before updating the suffix tree. In this subsection, we present an efficient algorithm for computing \(sqPref _{i-1,j}\). First, we consider the following cases (see Fig. 9), where \(\ell\) is the leaf corresponding to \(T[i-1\ldots j]\):

Case A:

\(hed ( lrSuf _{i-1,j}) = \ell\).

Case B:

\(hed ( lrSuf _{i-1,j}) \ne \ell\) and \(subtree ( parent (\ell ))\) has more than two leaves.

Case C:

\(hed ( lrSuf _{i-1,j}) \ne \ell\) and \(subtree ( parent (\ell ))\) has exactly two leaves.

Fig. 9
figure 9

Illustration for the three cases that are described in Sect. 4.2

For Case A, the next lemma holds:

Lemma 11

Given \(\mathsf {STree}(T[i-1\ldots j])\) and \(\mathsf {pp}_{i-1, j}\). Let \(\ell\) be the leaf corresponding to \(T[i-1\ldots j]\). If \(\mathsf {pp}_{i-1,j}\) is on the edge \(( parent (\ell ), \ell )\), the following propositions hold:

  1. (a)

    \(occ _{T[i-1\ldots j]}( sqPref _{i-1, j}) = \{i-1, j-| lrSuf _{i-1,j}|+1\}\).

  2. (b)

    If there is exactly one implicit suffix node on \(( parent (\ell ),\ell )\), \(sqPref _{i-1,j} = T[i-1\ldots i-1+ depth ( parent (\ell ))]\).

  3. (c)

    If there are more than one implicit suffix node on \(( parent (\ell ),\ell )\), then \(| lrSuf _{i-1,j}| > \lfloor (j-i+2)/2 \rfloor\) and \(sqPref _{i-1,j} = T[i-1\ldots j-2h+1]\), where \(\mathsf {pp}_{i-1,j} = \langle {\ell },{h}\rangle\).

Proof

Let \(\mathsf {pp}_{i-1,j} = \langle {\ell },{h}\rangle\) and \(m =| lrSuf _{i-1,j}|\).

  1. (a)

    Since \(\mathsf {pp}_{i-1,j}\) is on the edge \(( parent (\ell ), \ell )\), \(sqPref _{i-1, j}\) is a prefix of \(lrSuf _{i-1,j}\), and \(\# occ _{T[i-1\ldots j]}( lrSuf _{i-1,j}) = \# occ _{T[i-1\ldots j]}( sqPref _{i-1,j}) = 2\). Therefore, we obtain that \(occ _{T[i-1\ldots j]}( sqPref _{i-1, j}) = occ _{T[i-1\ldots j]}( lrSuf _{i-1,j}) = \{i-1, j-m+1\}\).

  2. (b)

    In this case, it is clear that \(sqPref _{i-1,j} = T[i-1\ldots i-1+ depth ( parent (\ell ))]\).

  3. (c)

    Let \(\langle {\ell },{h'}\rangle\) be the locus of the implicit suffix node which is the lowest on the edge \(( parent (\ell ),\ell )\) except \(\mathsf {pp}_{i-1,j}\). Also, let x be the string corresponding to the locus \(\langle {\ell },{h'}\rangle\). In this case, x occurs exactly three times in \(T[i-1\ldots j]\). Also, x is the longest border of \(lrSuf _{i-1,j}\). Assume on the contrary that \(m \le \lfloor (j-i+2)/2 \rfloor\). Then, two occurrences of \(lrSuf _{i-1,j}\) in \(T[i-1\ldots j]\) are not overlapping, and thus \(\# occ _{T[i-1\ldots j]}(x) \ge 2 \times \# occ _{T[i-1\ldots j]}( lrSuf _{i-1,j}) = 4\), it is a contradiction. Therefore, \(m > \lfloor (j-i+2)/2 \rfloor\) (see Fig. 10). Next, we consider a relation between h and \(h'\). By the definition, \(h = |T[i-1\ldots j]| - m = j - i + 2 - m\). Since \(m > \lfloor (j-i+2)/2 \rfloor\), x matches the intersection of two occurrences of \(lrSuf _{i-1,j}\), i.e., \(x = T[j-m+1\ldots i+m-2]\). Thus, \(h' = |T[i-1\ldots j]| - |x| = j-i+2-(2m-j+i-2) = 2(j-i+2-m) = 2h\). Therefore \(sqPref _{i-1,j} = T[i-1\ldots j-h'+1] = T[i-1\ldots j-2h+1]\).

Fig. 10
figure 10

Illustration for the proposition (c) in Lemma 11. For the sake of simplicity, this figure shows a simple case where there are only two implicit suffix nodes on the edge \((parent(\ell ), \ell )\). However, the lemma also holds for the other cases

In Case B, it is clear that \(sqPref _{i-1, j} = T[i-1\ldots i-1+ depth (p)]\) since \(str (p)\) occurs at least three times in \(T[i-1\ldots j]\) (see Fig. 9).

For Case C, the next lemma holds:

Lemma 12

Suppose that \(\mathsf {STree}(T[i-1\ldots j])\) and \(\mathsf {pp}_{i-1, j}\) have already been computed. Let \(\ell\) be the leaf corresponding to \(T[i-1\ldots j]\), \(p = parent (\ell )\), and \(q = parent (p)\). If \(subtree (p)\) has exactly two leaves and there are no implicit suffix nodes on any edges in \(subtree (p)\), then it can be determined in O(1) time whether there is an implicit suffix node on (qp). If such an implicit node exists, then the locus of the lowest implicit suffix node on (qp) can be computed in O(1) time.

Proof

Note that the suffix corresponding to the lowest implicit suffix node on (qp) occurs exactly three times in \(T[i-1\ldots j]\) from the assumptions. Let \(\mathsf {pp}_{i-1,j} = \langle {u},{h}\rangle\). If \(h = 0\), the primary active point is an explicit node, and there is no implicit suffix node on every edge in \(\mathsf {STree}(T[i-1\ldots j])\). If \(h \ne 0\) and \(u = p\), the lowest implicit suffix node on (qp) is clearly the primary active point. Thus, in the following, we consider the situation with \(u \ne p\) and \(h \ne 0\).

If u is not a leaf and the number of leaves in \(subtree (u)\) is greater than two, then the number of leaves in \(subtree ( hed (v))\) is also greater than two for each implicit suffix node v. Thus, there is no implicit suffix node on (qp). If u is not a leaf and the number of leaves in \(subtree (u)\) is exactly two, then \(lrSuf _{i-1,j}\) occurs at least three times in \(T[i-1\ldots j]\) since \(u \ne p\). Thus, if a suffix s of \(T[i-1\ldots j]\) which is shorter than \(lrSuf _{i-1,j}\) occurs as a prefix of \(T[i-1\ldots j]\), \(\# occ _{T[i-1\ldots j]}(s) \ge 4\). Therefore, there is no implicit suffix node on (qp).

If u is a leaf, as in the proof in Lemma 10, it can be proven that there is an implicit suffix node on (qp) if and only if \(t \ge s\) and \(depth (p)> | lrSuf _{i-1,j}|-(t-s) > depth (q)\), where \(s = start (u)\), \(t = start (\ell ')\) with \(\ell '\) being the sibling of \(\ell\) (see Fig. 11).

Fig. 11
figure 11

Illustration for Lemma 12

In addition, if there is an implicit suffix node on the edge (qp), the length of the string x corresponding to the lowest implicit suffix node on the edge (qp) is \(| lrSuf _{i-1,j}|-(t-s)\), and thus, the implicit suffix node is \(\langle {p},{ depth (p)-|x|}\rangle = \langle {p},{ depth (p) - | lrSuf _{i-1,j}| + t - s}\rangle\).

We can design an algorithm for computing \(sqPref _{i-1,j}\) by using the above lemmas, as follows. Let \(\ell\) be the leaf corresponding to \(T[i-1\ldots j]\), \(p = parent (\ell )\) and \(q = parent (p)\).

In Case A.:

\(sqPref _{i-1, j}\) is computed by Lemma 11.

In Case B.:

\(sqPref _{i-1, j} = T[i-1\ldots i-1+ depth (p)]\) and \(\# occ _{T[i-1\ldots j]}( sqPref _{i-1,j}) = 1\).

In Case C.:

We divide this case into some subcases by the existence of an implicit suffix node on edges \((p, \ell ')\) and (qp) where \(\ell '\) is the sibling of \(\ell\). We first determine the existence of an implicit suffix node on \((p, \ell ')\) (by Lemma 10).

  • If there is an implicit suffix node on \((p, \ell ')\), then \(sqPref _{i-1,j} = T[i-1\ldots i-1 + depth (p)]\) and \(\# occ _{T[i-1\ldots j]}( sqPref _{i-1,j}) = 1\).

  • If there is no implicit suffix node on both \((p, \ell )\) and \((p, \ell ')\), we can determine in constant time the existence of an implicit suffix node on (qp) (by Lemma 12). If there is an implicit suffix node on (qp), \(sqPref _{i-1,j} = T[i-1\ldots depth (p)-h+1]\) and \(occ _{T[i-1\ldots j]}( sqPref _{i-1,j}) = \{i-1, start (\ell ')\}\). Otherwise, \(sqPref _{i-1,j} = T[i-1\ldots depth (q) + 1]\) and \(occ _{T[i.-1.j]}( sqPref _{i-1,j}) = \{i-1, start (\ell ')\}\).

It follows from the above arguments in this subsection that \(sqPref _{i-1, j}\) can be computed in O(1) time by using the suffix tree and the (primary) active point.

4.3 Detecting MUSs to be Added/Deleted

By using the afore-mentioned lemmas in this section, we can design an efficient algorithm for detecting MUSs to be added / deleted.

4.3.1 Data Structure for Maintaining MUSs

First, we introduce a data structure for managing the set of MUSs for a sliding window. Our data structure for MUSs consists of two arrays \(\mathsf {S2E}\) and \(\mathsf {E2S}\) of length d each. Note that by the definition of MUSs, any MUSs cannot be nested each other. Thus, for any text position i, if a MUS starting (resp. ending) at i exists, then its ending (resp. starting) position is unique. From this fact, we can define \(\mathsf {S2E}\) and \(\mathsf {E2S}\) as follows:

Let \([p, p+d-1]\) be the current window. For every index i with \(p \le i \le p+d-1\),

$$\begin{aligned} \mathsf {S2E}[i\bmod d]= & {} {\left\{ \begin{array}{ll} e &{} \text {if}\,[i,e]\in \mathsf {MUS}(T[p\ldots p+d-1])\hbox { exists},\\ nil &{} \text {otherwise.} \end{array}\right. }\\ \mathsf {E2S}[i\bmod d]= & {} {\left\{ \begin{array}{ll} s &{} \text {if}\, [s,i]\in \mathsf {MUS}(T[p\ldots p+d-1])\hbox { exists,}\\ nil &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

Since MUSs cannot be nested each other, these arrays are uniquely defined (see Fig. 12). By using these two arrays, all the following operations for MUSs can be executed in O(1) time; compute the ending/starting position of the MUS that starts/ends at a specified position, and add/remove a MUS into/from the set of MUSs. In particular, when a MUS \([s_r, e_r]\) is removed from the set of MUSs, we set \(\mathsf {S2E}[s_r \bmod d] = \mathsf {E2S}[e_r \bmod d] = nil\). Also, when a MUS \([s_a, e_a]\) is added into the set of MUSs, we set \(\mathsf {S2E}[s_a \bmod d] = e_a\) and \(\mathsf {E2S}[e_a \bmod d] = s_a\).

Fig. 12
figure 12

A long string \(T = \mathtt {babbabababbbba}\cdots\) and two arrays \(\mathsf {S2E}\) and \(\mathsf {E2S}\). The current window is \(T[2\ldots 11]\) of length \(d = 10\), and the MUSs in the window are \(T[2\ldots 4], T[4\ldots 8], T[8\ldots 10]\), and \(T[9\ldots 11]\)

4.3.2 Algorithm When Appending a Character to the Right

Assume that \(\mathsf {S2E}\), \(\mathsf {E2S}\) and the suffix tree of \(T[i\ldots j]\) are computed before reading \(\gamma = T[j+1]\). Also, assume that the longest single character run \(\beta ^e\) as a suffix of \(T[i\ldots j]\) is known, where \(\beta = T[j]\) and \(e \ge 1\).

  • First, compute the length of \(lrSuf _{i, j}\).

  • Second, read \(\gamma\), and update the suffix tree and the active points. Then, compute the lengths of \(lrSuf _{i, j+1}\) and \(sqSuf _{i, j+1}\). Also, update information about the run of the last character of \(T[i\ldots j+1]\). Specifically, if \(\gamma = \beta\) then \(\beta ^e = \gamma ^{e+1}\), and otherwise \(\beta ^e = \gamma ^1\). If \(| lrSuf _{i, j+1}| \le | lrSuf _{i, j}|\) or \(T[j+1-| lrSuf _{i, j+1}|\ldots j+1] = \gamma ^{| lrSuf _{i, j+1}|+1}\), add \([j+1-| lrSuf _{i, j+1}|, j+1]\) into the set of MUSs (by Lemma 4).

  • If \(| lrSuf _{i, j+1}| < | sqSuf _{i, j+1}|\), then terminate this step (by Lemma 5).

  • Otherwise, compute \(p_s\) and q of Lemma 6 by using \(\mathsf {STree}(T[i\ldots j+1])\) and \(\mathsf {sp}_{i, j+1}\). Then, remove \([p_s, q]\) from the set of MUSs (by Lemma 3).

  • Next, if \(\mathsf {E2S}[t'\bmod d] = nil\), then add \([p_s, t']\) into the set of MUSs, where \(t' = q + 1\). Also, if \(s' \ge i\) and \(\mathsf {S2E}[s'\bmod d] = nil\), then add \([s', q]\) into the set of MUSs, where \(s' = q - | lrSuf _{i, j+1}|\) (by Lemma 6).

  • Terminate this step.

4.3.3 Algorithm When Deleting the Leftmost Character

Assume that \(\mathsf {S2E}\), \(\mathsf {E2S}\) and the suffix tree of \(T[i-1\ldots j]\) are computed before deleting \(\alpha = T[i-1]\).

  • First, compute \(\# occ _{T[i-1\ldots j]}( sqPref _{i-1,j})\). If \(\# occ _{T[i-1\ldots j]}( sqPref _{i-1,j}) = 2\), compute two integers s and t with \(T[s\ldots t] = sqPref _{i-1,j}\) and \(s \ne i-1\).

  • Second, delete \(T[i-1]\) and update the suffix tree and the active points. If \(\mathsf {S2E}[(i-1)\bmod d] \ne nil\), remove the MUS starting at \(i-1\) from the set of MUSs.

  • If \(\# occ _{T[i-1\ldots j]}( sqPref _{i-1,j}) = 1\), terminate this step (by Lemma 8).

  • Otherwise, if \(\mathsf {S2E}[s\bmod d] \ne nil\), then remove the MUS starting at s from the set of MUSs. Also, if \(\mathsf {E2S}[t\bmod d] \ne nil\), then remove the MUS ending at t from the set of MUSs (by Lemma 9).

  • Finally, add [st] into the set of MUSs (by Lemma 7), and terminate this step.

The main result of this section is the following:

Theorem 3

We can maintain the set of MUSs for a sliding window of length d on a string T of length n in a total of \(O(n\log \sigma ')\) time and O(d) working space where \(\sigma '\le d\) is the maximum number of distinct characters in every window.

Corollary 2

There exists an online algorithm to compute all MUSs in a string T of length n in a total of \(O(n\log \sigma )\) time with O(n) working space where \(\sigma\) is the alphabet size.

5 Conclusions and Future Work

In this paper, we studied the problem of computing MUSs for a sliding window over a given string T of length n. We first showed combinatorial properties on MUSs for a sliding window, i.e., changes of the set of MUSs are at most constant when appending a character to the right end of the window or deleting the first character from the window. Also, we proposed an \(O(n\log \sigma ')\)-time and O(d)-space algorithm to compute MUSs for a sliding window of size d over T, where \(\sigma '\le d\) is the maximum number of distinct characters in every window.

As future work, we are interested in developing a data structure for the SUS problems for a sliding window. As we described in the introduction, MUSs are heavily utilized for solving the SUS problems. Our sliding window MUS algorithm could be used as a basis for an efficient SUS query data structure for a sliding window. Also, it would be interesting to extend or generalize MUSs for a sliding window, e.g., to computing MUSs with k-mismatches for a sliding window. A substring of T is said to be unique with k-mismatches in T, if it is unique in T even when substituting arbitrary k characters of the substring. To the best of our knowledge, only one deterministic algorithm to compute unique substrings with k-mismatches is known in [10], and their algorithm runs in \(O(n^2)\) time for any \(k \ge 1\) in an offline manner. An interesting open question is: Can we design an online deterministic algorithm which computes MUSs with k-mismatches in sub-quadratic time?.