Computing Minimal Unique Substrings for a Sliding Window

A substring u of a string T is called a minimal unique substring (MUS) of T if u occurs exactly once in T and any proper substring of u occurs at least twice in T. In this paper, we study the problem of computing MUSs for a sliding window over a given string T. We first show how the set of MUSs can change when the window slides over T. We then present an O(nlogσ′)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n\log \sigma ')$$\end{document}-time and O(d)-space algorithm to compute MUSs for a sliding window of size d over the input string T of length n, where σ′≤d\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma '\le d$$\end{document} is the maximum number of distinct characters in every window.


Minimal Unique Substrings and Shortest Unique Substrings
A unique substring of string T is a substring of T which appears exactly once in T. Finding unique substrings of DNA sequences has gained attention in bioinformatics [8,9,15,24]. For example, it can be applied in PCR primer design [24] and alignment-free genome comparison [9].
In the last decade, problems that relate to computing unique substrings in a given string have been studied in the field of string algorithmics. A unique substring u of T is said to be a minimal unique substring (MUS) of T if any proper substring of u is not a unique substring. Ilie and Smyth [13] formalized MUSs and proposed a linear time algorithm to compute all MUSs of a given string T.
MUSs has been heavily utilized for solving the shortest unique substring (SUS) problems: A unique substring v = T[s … t] of T is said to be a shortest unique substring (SUS) of T for a text position p if v contains the position p (i.e., p ∈ [s, t] ) and any proper substring of v which contains p is not a unique substring. The single-SUS problem is to preprocess a given string T of length n so that for any subsequent query position p, a SUS for p can be answered quickly. Pei et al. [20] introduced the single-SUS problem and gave an O(n 2 )-time preprocessing scheme which can answer single-SUS queries in constant time. Tsuruta et al. [22], Ileri et al. [12], and Hu et al. [11] independently showed O(n)-time preprocessing schemes which can answer single-SUS queries in constant time. Also, Hon et al. [10] proposed an inplace algorithm for computing SUSs for all positions in linear time. The all-SUS problem is a generalization of the single-SUS problem which requires to output all SUSs for given position p. The methods of Tsuruta et al. [22] and Hu et al. [11] can answer all-SUS queries in O(occ) time, where occ is the number of SUSs to output. Note that the SUS problem studied by [11] is more general, that is, they find SUSs covering a given interval in the string, instead of a text position. Moreover, Mieno et al. [16] considered the all-SUS problem on a run-length encoded string, and proposed an O(r)-space data structure which can answer all-SUS (interval) queries in O( √ log r∕ log log r + occ) time, where r is the size of a given runlength encoded string. Although not mentioned explicitly in [16], the size of their data structure (except for the input string) and query time can be respectively written as O(m) space and O( √ log m∕ log log m + occ) time with respect to the number m of MUSs of the input string T. Note that all the above algorithms for the SUS problems compute all MUSs of the given string (or some data structure which is essentially equivalent to MUSs) in the preprocessing. We also refer to [1,3,7,17] for related results on the SUS problems.

Sliding Window Model
In this paper, we tackle the problem of computing MUSs in the sliding window model. In the sliding window model, the input string is given in an online fashion, one character at a time from left to right, and the memory usage is limited to some pre-determined space. The task of the sliding window model is to process all substrings T[i … i + d − 1] of pre-fixed length d in a string T of length n in an incremental fashion, for increasing i = 0, … , n − d . Usually the window size d is set to be much smaller than the string length n, and thus the challenge here is to design efficient algorithms that process all such substrings using only O(d) working space. A typical application to the sliding window model is data compression; examples are the famous Lempel-Ziv 77 (the original version) [25] and PPM [4]. Recently, Crochemore et al. [5] introduced the problem of computing Minimal Absent Words for a sliding window, and proposed an O(n )-time and O(d )-space algorithm using suffix trees for a sliding window where is the alphabet size. This paper deals with the problem of computing MUSs in a sliding window. This problem can be directly applied to compute uniqueness score of oligonucleotides for designing tilling arrays [8].

Our Contributions
We begin with combinatorial results on MUSs for a sliding window. Namely, we show that the number of MUSs that are added or deleted by one slide of the window is always constant (Sect. 3). We then present the first efficient algorithm that maintains the set of MUSs for a sliding window of length d over a string of length n in a total of O(n log � ) time and O(d) working space where ′ ≤ d is the maximum number of distinct characters in every window (Sect. 4). Our main algorithmic tool is the suffix tree for a sliding window that requires O(d) space and can be maintained in O(n log � ) time [6,14,21]. Our algorithm for computing MUSs for a sliding window is built on our combinatorial results, and it keeps track of three different loci over the suffix tree, all of which can be maintained in O(log � ) amortized time per each sliding step.
A part of the results reported in this article appeared in a preliminary version of this paper, [18]. The preliminary paper [18] consists of two parts: (1) efficient computation and combinatorial properties of MUSs for a sliding window, and (2) combinatorial properties of minimal absent words (MAWs) [19] for a sliding window. This current article is a full version of the former part (1) which contains complete proofs and supplemental figures which were omitted in the preliminary version [18]. We remark that an extended version of the latter part (2) can be found as an independent article [2].

Strings
Let Σ be an alphabet of size . An element of Σ is called a character. An element of Σ * is called a string. The length of a string T is denoted by |T|. The empty string is the string of length 0. If T = xyz , then x, y, and z are called a prefix, substring, and suffix of T, respectively. They are called a proper prefix, proper substring, and proper suffix of T if x ≠ T , y ≠ T , and z ≠ T , respectively. If a string b is a proper prefix of T and is a proper suffix of T, b is called a border of T.
denotes the substring of T starting at position i and ending at position j, i.e., For a non-empty string w, the set of beginning positions of occurrences of w in T is denoted by occ T For convenience, let #occ T ( ) = |T| + 1 , and thus, is always repeating in any non-empty string. For any In what follows, we consider an arbitrarily fixed string T of length n ≥ 1 over an alphabet Σ of size ≥ 2.  of length 14 and its substrings lrSuf 2,11 , sqSuf 2,11 , and sqPref 2,11 for the current window T [2 … 11] This paper deals with the problem of computing MUSs for a sliding window of fixed length d over a given string T, formalized as follows:

Input
String T of length n and positive integer d (< n).

Suffix Trees
The suffix tree of string T, denoted (T) , is a compacted trie that represents all suffixes of T. We consider a version of suffix trees a.k.a. Ukkonen trees [23]: Namely, (T) is a rooted tree such that 1. each edge is labeled by a non-empty substring of T, 2. each internal node has at least two children, 3. the out-going edges of each node begin with mutually distinct characters, and 4. the suffixes of T that are unique in T are represented by paths from the root to the leaves, and the other suffixes of T that are repeating in T are represented by paths from the root that end either on internal nodes or on edges. To simplify the description of our algorithm, we assume that there is an auxiliary node ⟂ which is the parent of only the root node. The out-going edge of ⟂ is labeled with Σ ; This means that we can go down from ⟂ by reading any character in Σ . See Fig. 2 for an example of (T). For each node v in (T) , parent(v) denotes the parent of v, str(v) denotes the path string from the root to v, depth(v) denotes the string depth of v (i.e., depth(v) = |str(v)| ), and subtree(v) denotes the subtree of (T) rooted at v. For each leaf in (T) , start( ) denotes the starting position of str( ) in T. For each non-empty substring w of T, hed(w) = v denotes the highest explicit descendant where w is a prefix of str(v) and depth(parent(v)) < |w| ≤ depth(v) . For each substring w of T, locus(w) = ⟨u, h⟩ represents the locus in (T) where the path that spells out w from the root terminates, such that u = hed(w) and h = depth(u) − |w| ≥ 0 . Namely, h is the off-set length from the child u of the locus for w when w is on an edge, and h = 0 when w is on a node (namely u). We say that a substring w of T with locus(w) = ⟨u, h⟩ is represented by an explicit node if h = 0 , and by an implicit node if h ≥ 1 . We remark that in the Ukkonen tree (T) of a string T, some repeating suffixes may be represented by implicit nodes. An implicit node which represents a suffix of T is called an implicit suffix node. For any internal node v except for the root, the suffix link of v is a reversed edge from v to the explicit node that represents str(v) [1 …] . The suffix link of the root that represents points to ⟂.

Combinatorial Results on MUSs for a Sliding Window
Throughout this section, we consider positions i and j with 0 ≤ i ≤ j ≤ n − 1 such that T[i … j] denotes the sliding window for the ith position over the input string T. The following arguments hold for any values of i and j, and hence, they will be useful for sliding windows of any length d. The next lemmas are useful for analyzing combinatorial properties of MUSs and for designing an efficient algorithm for computing MUSs for a sliding window.

Lemma 1
The following three statements are equivalent: is at least as long as sqSuf i,j , i.e., |lrSuf i,j | ≥ |sqSuf i,j |. Figure 1 shows a concrete example where (1) of Lemma 1 holds (and hence both (2) and (3) also hold.)

Changes to MUSs When Appending a Character to the Right
In this subsection, we consider an operation that slides the right-end of the current window T[i … j] with one character by appending the next character We use the following observation.

Observation 1 For any non-empty substring s of T[i … j],
Also, the equality holds if and only if s is a suffix of T[i … j + 1].

MUSs to be Deleted When Appending a Character to the Right
Due to Observation 1, we obtain Lemma 3 which describes MUSs to be deleted when a new character T[j + 1] is appended to the current window T[i … j].

Lemma 3 For any
By Lemma 3, at most one MUS can be deleted when appending T[j + 1] to the current window T[i … j] , and such a deleted MUS must be sqSuf i,j+1 .

MUSs to be Added When Appending a Character to the Right
First, we consider a MUS to be added when appending . The next observation follows from the definition of lrSuf i,j : Proof (⇒ ) Assume on the contrary that T[j + 1 − … j + 1] ≠ +1 and > |lrSuf i,j | . By the assumptions and Lemma 2, |lrSuf i,j | = − 1 , and thus,  Fig. 4).
Next, we consider MUSs to be added when appending Then, it follows from Lemma 3 that u = sqSuf i,j+1 and u occurs exactly twice in Namely, a MUS which is not a suffix is added by appending one character only if there is a MUS to be deleted by the same operation. Moreover, such added MUSs must contain the deleted MUS. j+1 . Also, the following propositions hold: (lrSuf i,j+1 ) = 2 and sqSuf i,j+1 is a suffix of lrSuf i,j+1 . Hence, the ending positions of the Algorithmica (2022) From the discussion at the beginning of the proof, the starting positions of the occurrences of lrSuf i,j+1 are p l and j + 2 − |lrSuf i,j+1 | (see also , the starting position of v must be p l − 1 . This contradicts the assumption that there is no MUS of T[i … j] starting at p l − 1 . Thus, Now we have the main result of this subsection: . Also, we show that the upper bound is tight if ≥ 3 . For an integer k ≥ 2 , we consider two strings u and u ′ such that u = k of length k + 3 ≥ 5 and u � = u = k of length k + 4 ≥ 6 . Then,

Changes to MUSs When Deleting the Leftmost Character
In this subsection, we consider an operation that deletes the leftmost character . Basically, we can use symmetric arguments to the previous subsection where we considered appending a character to the right of the window.

MUSs to be Deleted When Deleting the Leftmost Character
Next, we consider MUSs to be deleted by removing . Then, we consider MUSs to be deleted which are not prefixes of

j is a proper substring of T[s … t].
Proof Symmetric to the proof of Lemma 5.
Namely, when deleting the leftmost character, a MUS which is not a prefix is deleted only if an added MUS exists. Moreover, such deleted MUSs must contains the added MUS.
where T[s … t] = sqPref i−1,j and s ≠ i − 1. Proof Symmetric to the proof of Lemma 6. See also Fig. 6 for illustration. The main result of this subsection is the following: Furthermore, these bounds are tight for any , i, j with ≥ 3 , 0 < i ≤ j ≤ n − 1 , and j − i + 1 ≥ 5.
Proof Symmetric to the proof of Theorem 1.
The next corollary is immediate from Theorems 1 and 2.

Algorithm for Computing MUSs for a Sliding Window
This section presents our algorithm for computing MUSs for a sliding window.

Updating Suffix Tree and Its Three Loci
First, we introduce some additional notions. Since we use Ukkonen's algorithm [23] for updating the suffix tree when a new character T[j + 1] is appended to the right end of the window T[i … j] , we maintain the locus for lrSuf i,j as in [23]. Also, in order to compute the changes of MUSs, we use sqSuf i,j (c.f. Lemma 3,6). Thus, we also maintain the locus for sqSuf i,j .
The locus for lrSuf i,j (resp. sqSuf i,j ) in (T[i … j]) is called the primary active point (resp. the secondary active point) and is denoted by i,j (resp. i,j ). Additionally, in order to maintain i,j efficiently, we also maintain the locus for the longest suffix of T[i … j] which occurs at least three times in T[i … j] . We call this locus the tertiary active point that is denoted by i,j . See Fig. 2 for concrete examples of these three loci in a suffix tree.

Appending One Character
When T[i … j] is the empty string (the base case, where i = 0 and j = −1 ), we set all the three active points ⟨root, 0⟩ . Then we increase j, and the suffix tree grows in an online manner until j = d − 1 using Ukkonen's algorithm. Then, for each j > d − 1 , we also increase i each time j increases, so that the sliding window is shifted to the right, by using sliding window algorithm for the suffix tree [21].
When T[j + 1] is appended to the right end of T[i … j] , we first update the suffix tree to (T[i … j + 1]) and compute i,j+1 . Since i,j+1 coincides with the active point, i,j+1 can be found in amortized O(log � ) time [21].
• i,j+1 equals the locus stored in at the penultimate iteration of the while-loop.
Let us show the correctness of the above algorithm. After the first step, w is the longest suffix which possibly corresponds to i,j+1 . In the while loop of the second step, we search for the suffix corresponding to i,j+1 by deleting the first characters from w one-by-one. After breaking from the while-loop, we store in w the longest suffix of T[i … j + 1] which occurs more than twice in T[i … j + 1] , i.e., i,j+1 = locus(w) . Also, by the definitions of and , i,j+1 is the locus for the suffix of T[i … j + 1] which is one character longer than w = str( i,j+1 ).
As is described in the above algorithm, we can locate i,j+1 using suffix links, in a similar manner to the active point i,j+1 . Thus, the time cost for locating i,j+1 for each increasing j is amortized O(log � ) , again by a similar argument to the active point i,j+1 . What remains is, for each candidate w for i,j+1 , how to quickly determine whether #occ T[i…j+1] (w) < 3 or not. In what follows, we show that it can be checked in O(1) time for each candidate. If u is not a leaf, there is no implicit suffix node on the edge (parent( ), ) for any leaf , since every suffix of T[i … j + 1] which is shorter than |x| occurs more than If u is a leaf, then #occ T[i…j+1] (x) = 2 . Let s = start(u) and t = start( ) for each leaf . Notice that x is a border of T[s … j + 1] . There are two sub-cases: • First, we consider the case where t < s . Suppose that there is an implicit suffix node on (parent( ), ) for the sake of contradiction. Let w be a string corresponding to the lowest implicit suffix node on (parent( ), ) . Then, w is a proper suffix of x, and occurs exactly twice in T[i … j + 1] . Furthermore, w occurs exactly twice in T[s … j + 1] since x is a border of T[s … j + 1] . However, w is also a prefix of T[t … j + 1] , hence w occurs at least three times in T[i … j + 1] , it is a contradiction. Thus, if t < s , there is no implicit suffix node on (parent( ), ). • Second, we consider the case where t ≥ s (see Fig. 8). In this case, Thus, there is an implicit suffix node on (parent( ), ) if and only if . Also, if there is an implicit suffix node on (parent( ), ) , the locus of the lowest one is ⟨ , h⟩.

Deleting the Leftmost Character
When the leftmost character , we first update the suffix tree and compute i,j by using the sliding window algorithm for the suffix tree [21]. Each pair of position pointers for the edge-labels of the suffix tree can be maintained in amortized O(1) time so that these pointers always refer to positions within the current sliding window, by a simple batch update technique (see [21] for details). After that, we compute i,j and i,j in a similar way to the case of appending a new character shown previously.
It follows from the above arguments in this subsection that we can update the suffix tree and the three active points in amortized O(log � ) time, each time the window is shifted by one character.

Computing sqPref i−1,j
In order to compute the changes of MUSs when the leftmost character , we use sqPref i−1,j (c.f. Lemmas 7 and 9) before updating the suffix tree. In this subsection, we present an efficient algorithm for computing sqPref i−1,j . First, we consider the following cases (see Fig. 9), where is the leaf Case B hed(lrSuf i−1,j ) ≠ and subtree(parent( )) has more than two leaves. Case C hed(lrSuf i−1,j ) ≠ and subtree(parent( )) has exactly two leaves.
For Case A, the next lemma holds: In Case B, it is clear that since str(p) occurs at least three times in T[i − 1 … j] (see Fig. 9).
For Case C, the next lemma holds: Proof Note that the suffix corresponding to the lowest implicit suffix node on (q, p) occurs exactly three times in T[i − 1 … j] from the assumptions. Let i−1,j = ⟨u, h⟩ . If h = 0 , the primary active point is an explicit node, and there is no implicit suffix node on every edge in (T[i − 1 … j]) . If h ≠ 0 and u = p , the lowest implicit suffix node on (q, p) is clearly the primary active point. Thus, in the following, we consider the situation with u ≠ p and h ≠ 0.
If u is not a leaf and the number of leaves in subtree(u) is greater than two, then the number of leaves in subtree(hed(v)) is also greater than two for each implicit suffix node v. Thus, there is no implicit suffix node on (q, p). If u is not a leaf and If u is a leaf, as in the proof in Lemma 10, it can be proven that there is an implicit suffix node on (q, p) if and only if t ≥ s and depth(p) > |lrSuf i−1,j | − (t − s) > depth(q) , where s = start(u) , t = start( � ) with ′ being the sibling of (see Fig. 11).
In addition, if there is an implicit suffix node on the edge (q, p), the length of the string x corresponding to the lowest implicit suffix node on the edge (q, p) is |lrSuf i−1,j | − (t − s) , and thus, the implicit suffix node is ⟨p, depth(p) − �x�⟩ = ⟨p, depth(p) − �lrSuf i−1,j � + t − s⟩.
We can design an algorithm for computing sqPref i−1,j by using the above lemmas, as follows. Let be the leaf corresponding to T[i − 1 … j] , p = parent( ) and q = parent(p) .
In Case A. sqPref i−1,j is computed by Lemma 11.
and #occ T[i−1…j] (sqPref i−1,j ) = 1. In Case C. We divide this case into some subcases by the existence of an implicit suffix node on edges (p, � ) and (q, p) where ′ is the sibling of . We first determine the existence of an implicit suffix node on (p, � ) (by Lemma 10).  If  there  is  an  implicit  suffix  node  on  (p, � ) , • If there is no implicit suffix node on both (p, ) and (p, � ) , we can determine in constant time the existence of an implicit suffix node on (q, p) (by Lemma 12). If there is an implicit suffix node on (q, p), It follows from the above arguments in this subsection that sqPref i−1,j can be computed in O(1) time by using the suffix tree and the (primary) active point.

Detecting MUSs to be Added/Deleted
By using the afore-mentioned lemmas in this section, we can design an efficient algorithm for detecting MUSs to be added / deleted.

Data Structure for Maintaining MUSs
First, we introduce a data structure for managing the set of MUSs for a sliding window. Our data structure for MUSs consists of two arrays and of length d each. Note that by the definition of MUSs, any MUSs cannot be nested each other. Thus, for any text position i, if a MUS starting (resp. ending) at i exists, then its ending (resp. starting) position is unique. From this fact, we can define and as follows: Let [p, p + d − 1] be the current window. For every index i with p ≤ i ≤ p + d − 1, Fig. 12 A long string T = ⋯ and two arrays and . The current window is T [2 … 11] of length d = 10 , and the MUSs in the window are T [2 … 4], T [4 … 8], T [8 … 10] , and T [9 … 11] Since MUSs cannot be nested each other, these arrays are uniquely defined (see Fig. 12). By using these two arrays, all the following operations for MUSs can be executed in O(1) time; compute the ending/starting position of the MUS that starts/ends at a specified position, and add/remove a MUS into/from the set of MUSs.    The main result of this section is the following:

Conclusions and Future Work
In this paper, we studied the problem of computing MUSs for a sliding window over a given string T of length n. We first showed combinatorial properties on MUSs for a sliding window, i.e., changes of the set of MUSs are at most constant when appending a character to the right end of the window or deleting the first character from the window. Also, we proposed an O(n log � )-time and O(d)-space algorithm to compute MUSs for a sliding window of size d over T, where ′ ≤ d is the maximum number of distinct characters in every window. As future work, we are interested in developing a data structure for the SUS problems for a sliding window. As we described in the introduction, MUSs are heavily utilized for solving the SUS problems. Our sliding window MUS algorithm could be used as a basis for an efficient SUS query data structure for a sliding window. Also, it would be interesting to extend or generalize MUSs for a sliding window, e.g., to computing MUSs with k-mismatches for a sliding window. A substring of T is said to be unique with k-mismatches in T, if it is unique in T even when substituting arbitrary k characters of the substring. To the best of our knowledge, only one deterministic algorithm to compute unique substrings with k-mismatches is known in [10], and their algorithm runs in O(n 2 ) time for any k ≥ 1 in an offline manner. An interesting open question is: Can we design an online deterministic algorithm which computes MUSs with k-mismatches in sub-quadratic time?.