Efficient enumeration of nonequivalent squares in partial words with few holes
 234 Downloads
Abstract
Keywords
Partial word Square in a word Approximate period Lyndon word1 Introduction
A word is a sequence of letters from a given alphabet \(\varSigma \). By \(\varSigma ^*\) we denote the set of all words over \(\varSigma \). A word of the form \(U^2=UU\), for some word U, is called a square. For a word W, a factor is a subword composed of some number of consecutive letters and a square factor is a factor of W which is a square. Enumeration of square factors in words is a wellstudied topic, both from a combinatorial and from an algorithmic perspective. Obviously, a word W of length n may contain \(\varTheta (n^2)\) square factors (e.g. \(W=a^n\)), however, it is known that such a word contains only \(\mathcal {O}(n)\) distinct square factors (Fraenkel and Simpson 1998; Ilie 2005); currently the best known upper bound is \(\frac{11}{6}n\) (Deza et al. 2015).
Moreover, all distinct square factors of a word over an integer alphabet can be listed in \(\mathcal {O}(n)\) time using the suffix tree (Gusfield and Stoye 2004; Bannai et al. 2017) or the suffix array and the structure of runs (maximal repetitions) in the word (Crochemore et al. 2014).
A partial word is a sequence of letters from \(\varSigma \cup \{\lozenge \}\), where \(\lozenge \) denotes a hole, that is, a don’t care symbol. Two symbols \(a,b \in \varSigma \cup \{\lozenge \}\) are said to match (denoted as \(a \approx b\)) if they are equal or one of them is a hole; note that this relation is not transitive. The relation of matching is extended in a natural way to partial words of the same length.
A partial word UV is called a psquare if \(U \approx V\). Like in the context of words, a psquare factor of a partial word T is a factor being a psquare; see BlanchetSadri et al. (2014b, 2015).
Example 1.1
Example 1.2

\( repr (a\lozenge b\ a\lozenge \lozenge )=(a\lozenge b)^2\) and the psquare is ambiguous.

\( repr (a\lozenge \lozenge \ \lozenge a b)=(aab)^2\) and the psquare is unambiguous.
The set of nonequivalent psquare factors in a partial word T is denoted by \( psquares (T)\). Thus, \( psquares (T)\) corresponds to the set of different representatives of psquare factors of T.
Example 1.3
Let \(T\,=\,ab \lozenge \lozenge ba \lozenge aaba\lozenge b\).
 1.
\(a\lozenge aa\) with representative \((aa)^2\),
 2.
\(ab\lozenge \lozenge \equiv \lozenge ba\lozenge \equiv aba\lozenge \) with representative \((ab)^2\),
 3.
\(\lozenge \lozenge ba \equiv ba\lozenge a\) with representative \((ba)^2\), and
 4.
\(b \lozenge \lozenge b\) with representative \((bb)^2\).
Our work is devoted to enumeration of nonequivalent psquare factors in a partial word with a given number \(k>0\) of holes.
Previous results Alongside (BlanchetSadri et al. 2009, 2014b, 2015), we define a solid square as a square of a word and a square subword of a partial word T as a solid square that matches a factor of T.
Previous studies on squares in partial words were mostly focused on combinatorics. They started with the case of \(k=1\) (BlanchetSadri et al. 2009), in which case distinct square subwords correspond to nonequivalent psquare factors. It was shown that a partial word with one hole contains at most \(\frac{7}{2}n\) distinct square subwords (BlanchetSadri and Mercaş 2009) (3n for binary partial words; Halava et al. 2010). Also a generalization of the three squares lemma (see Crochemore and Rytter 1995) was proposed for partial words (BlanchetSadri and Mercaş 2012). As for a larger number of holes, the existing literature is devoted mainly to counting the number of distinct square subwords of a partial word (BlanchetSadri et al. 2009, 2015) or all occurrences of psquare factors (BlanchetSadri et al. 2014a, 2015). On the algorithmic side, Manea and Tiseanu (2010) proved that the problem of counting distinct square subwords of a partial word is #Pcomplete and Diaconu et al. (2009), Manea et al. (2014), and BlanchetSadri et al. (2014b) showed quadratic and nearlyquadratictime algorithms for finding all occurrences of psquare factors and primitivelyrooted psquare factors of a partial word, respectively.
Our algorithmic results We present an algorithm that reports all elements of the set \( psquares (T)\) in a partial word of length n with k holes in \(\mathcal {O}(nk^3)\) time. In particular, our algorithm runs in linear time for \(k=\mathcal {O}(1)\) and its time complexity nearmatches the maximum number of nonequivalent psquare factors. We also show an \(\mathcal {O}(n)\)time algorithm that reports all nonequivalent psquares of a given length. The algorithms assume integer alphabet \(\varSigma \subseteq \{1,\ldots ,n^{\mathcal {O}(1)}\}\). We use recently introduced advanced data structures by Kociumaka (2016).
Comparison with the conference version The paper is an extended version of Charalampopoulos et al. (2017). As far as combinatorics of psquares is concerned, the conference version of the paper derived the bound \(\mathrm {PSQUARES}_k(n) = \varTheta (\min (n^2,nk^2))\). Let \(\mathrm {ASQUARES}_k(n)\) be the maximum number of nonequivalent ambiguous psquares in T over all partial words T of length n with at most k holes. The bound was proved by showing that \(\mathrm {ASQUARES}_k(n) = \varTheta (\min (n^2,nk^2))\) and that \(\mathrm {USQUARES}_k(n) = \mathcal {O}(nk^2)\). As a new contribution here, we present a tight estimation \(\mathrm {USQUARES}_k(n) = \varTheta (nk)\). This lets us identify ambiguous psquares as the ones that attain the bound on \(\mathrm {PSQUARES}_k(n)\). On the algorithmic side, Charalampopoulos et al. (2017) presented an algorithm computing the set \( psquares (T)\) in \(\mathcal {O}(nk^3)\) time. Here the readability of the algorithm has been considerably improved; we also show a lineartime algorithm that reports all nonequivalent psquares of a specified length.
Structure of the paper After the Preliminaries comes the algorithmic part of the paper, which is followed by the combinatorial part. In Sect. 3 we show an \(\mathcal {O}(n)\)time algorithm that reports all nonequivalent psquares of a specified length and, as an immediate corollary, \(\mathcal {O}(nk^2)\)time computation of all nonequivalent ambiguous psquares. Then in Sect. 4 we give an \(\mathcal {O}(nk^3)\)time algorithm for computing all nonequivalent unambiguous psquares. Asymptotic bounds for ambiguous psquares and unambiguous psquares are presented in Sects. 5 and 6, respectively.
2 Preliminaries
For a word \(W \in \varSigma ^*\), by \(W=n\) we denote the length of W, and by \(W_i\), for \(i=1,\ldots ,n\), the ith letter of W. For \(1 \le i \le j \le n\), by [i..j] and (i..j] we denote integer intervals \(\{i,\ldots ,j\}\) and \(\{i+1,\ldots ,j\}\), respectively. W[i..j] denotes the factor of W equal to \(W_i \cdots W_j\); we also use the notation W[I], where I is an integer interval. A factor of the form W[1..j] is called a prefix, a factor of the form W[i..n] is called a suffix.
For a partial word T we use the same notation as for words: \(T=n\) for its length, \(T_{i}\) for the ith letter, T[i..j] for a factor. If T does not contain holes, then it is called solid. The relation \(\approx \) of matching on \(\varSigma \cup \{\lozenge \}\) is defined as: \(a \approx a\), \(\lozenge \approx a\), and \(a \approx \lozenge \) for all \(a \in \varSigma \cup \{\lozenge \}\).
If \(U \approx T[i..i+U1]\) for a partial word U, then we say that U occurs in T at position i.
Two equallength partial words U and V are called cyclic shifts if there are partial words X, Y such that \(U=XY\) and \(V=YX\). We denote this as \( rot (U,X)=W\), where X is the shift value.
For a partial word X, by \(\#_{\lozenge }(X)\) we denote the number of holes in X. For \(1 \le i \le n\) and \(0 \le q \le \log n\), we denote \(T_{i,q}=T[i..\min (n,i+2^q1)]\). We say that \(T_{i,q}\) is a qbasic factor of the partial word T. In other words, qbasic factors are factors of T of length \(2^q\) and suffixes of T of length at most \(2^q\). By \({\mathcal B}(T)\) we denote the set of all basic factors of T.
Lemma 2.1
Proof
We say that a psquare is an unambiguous psquare (usquare) if its representative is solid and an ambiguous psquare (asquare) otherwise. By \( asquares (T)\) and \( usquares (T)\) we denote the sets of nonequivalent factors of T being asquares and usquares, respectively. Obviously:
Observation 2.2
\( psquares (T)= asquares (T) \cup usquares (T)\).
2.1 Periods in solid and partial words
A positive integer q is called a period of a word W if \(W_i=W_{i+q} \ \hbox {for all}\ i=1,\ldots ,nq.\) In this case, W[1..q] is called a string period of W. A word W is called periodic if it has a period q such that \(2q \le W\).
A quantum period of a partial word T is a positive integer q such that \(T_{i} \approx T_{i+q}\) for all \(i=1,\ldots ,nq\). A deterministic period of T is an integer q such that there exists a word W such that \(W \approx T\) and W has a period q.
The partial word T is called quantum (deterministically) periodic if it has a quantum (deterministic) period q such that \(2q \le n\).
Lemma 2.3
 (a)
If \(i \in \mathsf {Mis}_\delta (U)\), then \(i \in \mathsf {Mis}_\delta (V)\) or \(i \in \mathsf {Holes}(V)\) or \(i\delta \in \mathsf {Holes}(V)\).
 (b)
\(\mathsf {Mis}_\delta (U) \le \mathsf {Mis}_\delta (V) + 2\mathsf {Holes}(V)\).
 (c)
If \(\delta \ge \frac{1}{2}U\), then \(\mathsf {Mis}_\delta (U) \le \mathsf {Mis}_\delta (V) + \mathsf {Holes}(V)\).
Proof
(a): We have \(V_{i\delta } \approx U_{i\delta } \not \approx U_i \approx V_i\). This means that \(U_{i\delta },U_i \in \varSigma \). Hence, if \(i \not \in \mathsf {Holes}(V)\) and \(i\delta \not \in \mathsf {Holes}(V)\), then \(V_{i\delta } = U_{i\delta }\) and \(V_i = U_i\), so \(i \in \mathsf {Mis}_\delta (V)\).
Point (b) follows from point (a). Also point (c) follows from point (a). Indeed, if \(i \in \mathsf {Mis}_\delta (U)\) in this case, then for each of the positions i, \(i\delta \) in V, if it contains a hole, then it is counted only for the index i. \(\square \)
3 Computing all psquares of specified length and nonequivalent ambiguous psquares
In this section we develop an \(\mathcal {O}(n)\)time algorithm that enumerates all nonequivalent psquares of a half length d in a partial word T of length n. As a corollary, we obtain a simple computation of all nonequivalent ambiguous psquares in optimal time.
For a partial word T, we denote by \(T'\) a partial word of length \(nd\) such that \(T'[i]=T[i] \odot T[i+d]\) for each \(i=1,\ldots ,nd\). If \(T[i] \odot T[i+d]\) is undefined (since \(T[i] \not \approx T[i+d]\)), we set the value to a symbol \(\# \not \in \varSigma \).
Observation 3.1
 (a)
\(T[i..i+2d1]\) is a psquare if and only if \(T'[i..i+d1]\) does not contain the symbol \(\#\).
 (b)
If \(T[i..i+2d1]\) is a psquare, then \( repr (T[i..i+2d1]) = (T'[i..i+d1])^2\).
Proof
 (a)
If \(T'[i..i+d1]\) contains the symbol \(\#\), this means that \(T[j] \not \approx T[j+d]\) for some \(j \in [i..i+d1]\). Hence, \(T[i..i+2d1]\) is not a psquare. Otherwise, \(T[j] \approx T[j+d]\) for all \(j \in [i..i+d1]\). Hence, \(T[i..i+2d1]\) indeed is a psquare.
 (b)
If \(T[i..i+2d1]\) is a psquare, then
Example 3.2
Theorem 3.3
All nonequivalent psquares of half length d in a partial word of length n can be reported (as factors of the partial word) in \(\mathcal {O}(n)\) time.
Proof
Let T be a partial word of length n. In \(\mathcal {O}(n)\) time we compute \(T'\). Let \(S_1,\ldots ,S_q\) be a partition of \(T'\) into maximal factors that do not contain the symbol \(\#\). By Observation 3.1, our task is equivalent to reporting all distinct factors of length d of the partial words \(S_j\). This can be performed by listing all nodes (implicit and explicit) at depth d in the generalized suffix tree \(\mathcal {T}\) of \(S_1,\ldots ,S_q\), that is, in the suffix tree of \(S_1 \#_1 \ldots S_q \#_q\), where \(\#_1,\ldots ,\#_q \not \in \varSigma \) are distinct symbols. For details, see Gusfield (1997). As the suffix tree of a word of length n can be constructed in \(\mathcal {O}(n)\) time (Farach 1997), the whole algorithm works in \(\mathcal {O}(n)\) time. \(\square \)
As a corollary we obtain efficient computation of nonequivalent asquares.
Theorem 3.4
For a partial word T of length n with k holes, all elements of the set \( asquares (T)\) can be reported in \(\mathcal {O}(nk^2)\) time.
Proof
There are at most \(k^2\) possible lengths of ambiguous psquares. For each length we use the algorithm of Theorem 3.3 to report all nonequivalent psquares. This takes \(\mathcal {O}(nk^2)\) time. In the end, for each length we need to filter out unambiguous psquares. For a specified half length d, it suffices to check, for each psquare \(T[i..i+2d1]\) found, if \(T'[i..i+d1]\) contains a hole. This condition can be checked in \(\mathcal {O}(1)\) time if the prefix sums of the sequence \(a_i=[T'[i]={\lozenge }]\) are stored. \(\square \)
4 Computing all nonequivalent unambiguous psquares
We start the description of the algorithm by an abstract lemma that lets us efficiently generate all distinct squares induced by a special family of (solid) words.
4.1 Computing squares induced by a family of words
For a word S, we define its primitive root U as the shortest word such that \(U^k = S\) for some integer \(k \ge 1\). The Lyndon root \(\lambda \) of a word U is the minimal cyclic shift of the shortest string period of U. The notion of a Lyndon root was introduced in the context of runs by Crochemore et al. (2014).
Example 4.1
The Lyndon root of \(U=abaababaababa\) is aabab. The word U is periodic and its shortest period is 5.
For a word W and its period q, by \( squares (W,q)\) we denote the set of square factors of W of length 2q. We say that \( squares (W,q)\) is the set of squares induced by the word W with the period q. Each square factor in \( squares (W,q)\) can be represented in \(\mathcal {O}(1)\) space by specifying its occurrence in W.
Lemma 4.2
 (1)
\(n_i\le n\) is the length of \(W_i\) and \(2q_i \le n_i\);
 (2)
all the words \(W_i\) for which \(2q_i = n_i\) (socalled short words) are distinct;
 (3)
for a given \(q_i\), the number of words \(W_i\) for which \(2q_i < n_i\) (socalled long words) is at most k;
 (4)
\( first _i\) is the starting position of the first occurrence of the Lyndon root \(\lambda _i\) of \(W_i\) in \(W_i\) and \(\ell _i\) is its length;
 (5)
any two Lyndon roots \(\lambda _i\), \(\lambda _j\) can be compared in O(k) time.
Proof
Let us start with the following observation; see also Fig. 2. The same type of observation was used by Crochemore et al. (2014).
Observation 4.3
We make two transformations of the set of intervals \(I_i\) so that, in the end, each square from the set SQ is induced by exactly one word \(W_i\) with period \(q_i\). If any of the intervals is made empty, this corresponds to removing the word as unnecessary. The first transformation deals with the long words \(W_i\); by definition, at most k of them share the same period \(q_i\).
First transformation For every pair \(W_i,q_i\) and \(W_j,q_j\) of long words such that \(i \ne j\) and \(q_i=q_j\), we check if \(\lambda _i=\lambda _j\). If \(I_i \subseteq I_j\), we dispose of \(W_i\). Likewise, if \(I_j \subseteq I_i\), we remove \(W_j\). If none of the two cases holds and still \(I_i \cap I_j \ne \emptyset \), we trim \(I_j\) to make it disjoint with \(I_i\).
Complexity All long words can be sorted by their periods in \(\mathcal {O}(N+n)\) time by bucket sort. There are n / 2 buckets and each bucket contains at most k words. For each of the \(k(k1)/2\) pairs of long words in a bucket, we check equality of their Lyndon roots, which takes \(\mathcal {O}(k)\) time per pair and \(\mathcal {O}(nk^3)\) time overall. The time complexity of trimming of cyclic intervals is dominated by this step.
Second transformation For every short word \(W_i\) with period \(q_i\) and long word \(W_j\) with period \(q_j=q_i\), we check if \(\lambda _i=\lambda _j\). If so and \(I_i \subseteq I_j\), we remove \(W_i\). Note that \(I_i\) is a singleton.
Complexity All words can be sorted by their periods in \(\mathcal {O}(N+n)\) time by bucket sort. For each short word \(W_i\), we need to inspect at most k long words and check if their Lyndon roots are equal. This takes \(\mathcal {O}(k^2)\) time per short word, \(\mathcal {O}(Nk^2)\) time overall. Checking inclusion of elements in cyclic intervals is dominated by this step.
The two transformations take \(\mathcal {O}(Nk^2+nk^3)\) time in total. Afterwards each square is induced by exactly one interval \(I_i\) for a word \(W_i\) and period \(q_i\), so we can list all the distinct squares in \(\mathcal {O}(SQ)\) time. \(\square \)
For a partial word T, by \( ssquares (T)\) we denote the set of distinct solid factors of T being squares. The following fact was already mentioned in Sect. 1.
Fact 4.4
(Bannai et al. 2017; Crochemore et al. 2014; Gusfield and Stoye 2004) All distinct squares in a word of length n can be computed in \(\mathcal {O}(n)\) time.
By substituting all holes in a partial word with distinct symbols \(\#_1,\ldots ,\#_k\), we obtain the following corollary.
Corollary 4.5
For partial word T of length n, the set \( ssquares (T)\) can be computed in \(\mathcal {O}(n)\) time.
The algorithm of Crochemore et al. (2014) actually computes the set \( ssquares (T)\) together with all the data in assumption of Lemma 4.2. These are the short words in the construction.
In the following section we construct a family \({\mathcal F}\) of words (called sealed fragments) that represent the usquares that contain a hole and compute for them the data required in Lemma 4.2. These are the long words in the construction. Afterwards we list all distinct representatives of usquares using Lemma 4.2. Then nonequivalent usquares are extracted from their representatives.
4.2 Computing a special family of sealed fragments
If T is a partial word, then U is a sealed fragment of T if U is a factor of T with holes substituted by solid symbols. By \( unseal (U)\) we denote the original factor of the partial word.
A sealed fragment is always solid. Obviously, a sealed fragment can be represented in space proportional to the number of holes that were substituted. For example, if \(T[i..i+2q1]\) is a usquare, then \( repr (T[i..i+2q1])\) is a sealed fragment.
If W is a (solid) word, then by a dfragment we mean a concatenation of d factors \(W[i_1..j_1] \dots W[i_d..j_d]\). A dfragment can be represented in \(\mathcal {O}(d)\) space. Kociumaka (2016) showed that several types of operations on dfragments can be performed in \(\mathcal {O}(d)\) or \(\mathcal {O}(d^2)\) time after \(\mathcal {O}(n)\)time preprocessing. We notice here that a sealed fragment of a partial word T with k holes corresponds to a dfragment with \(d=\mathcal {O}(k)\) in a word that corresponds to T where \({\lozenge }\) is treated as an alphabet symbol. Thus the following simple fact is a consequence of Observation 18 from Kociumaka (2016) that was stated in terms of dfragments.
Fact 4.6
(Kociumaka 2016) For a partial word of length n with k holes, after \(\mathcal {O}(n)\)time preprocessing, the length of the longest common prefix (or suffix) of any two sealed fragments can be computed in \(\mathcal {O}(k)\) time. In particular, equality of sealed fragments can be checked within the same time complexity.
Definition 4.7
 (a)
For every i, \(q_i\) is a period of \(W_i\) and \(W_i \ge 2q_i\).
 (b)
For every i, there are no two holes in \( unseal (W_i)\) at distance \(q_i\).
 (c)
For every \(q=1,\ldots ,n\), there are \(\mathcal {O}(k)\) sealed fragments with \(q_i=q\).
 (d)
If X is a nonsolid usquare in T, then X is a factor of \( unseal (W_i)\) for some \(W_i\) with \(q_i =\frac{1}{2}X\).
The size of an Sfamily follows from point (c).
Observation 4.8
An Sfamily contains \(\mathcal {O}(nk)\) elements and thus can be represented in \(\mathcal {O}(nk^2)\) space.
In the following lemma we provide an algorithm for constructing an Sfamily. Our approach resembles computing anchored squares in the MainLorentz algorithm (Main and Lorentz 1984).
Lemma 4.9
For a partial word T of length n with k holes, an Sfamily can be computed in \(\mathcal {O}(nk^2)\) time.
Proof
Each nonsolid usquare X contains a hole in the first half or in the second half. Below, we construct an Sfamily for usquares containing a hole in the second half. A symmetric procedure deals with the usquares containing a hole in the first half.
For a hole h and integer q, we define the family \(\mathcal {S}(q,h)\) of usquares of length 2q, which contain h as the leftmost hole in the second half. For each nonempty set \(\mathcal {S}(q,h)\), we shall construct a sealed fragment W with period q so that each usquare \(X\in \mathcal {S}(q,h)\) is a factor of \( unseal (W)\).
The fragment W is unique and it can be retrieved in \(\mathcal {O}(k)\) time using Fact 4.6. Indeed, it suffices to compute the longest common prefix P of \(\overline{T}[hq..n]\) and \(\overline{T}[h..n]\), the longest common suffix S of \(\overline{T}[1..hq]\) and \(\overline{T}[1..h]\), and take the possibly trimmed fragment \(S\,\overline{T}[hq+1..h1]\,P\); see Fig. 4. We may need to trim S so that its length exceeds neither \(q1\) (so that the hole at position h is contained in the right half of the square) nor \(z1\) (so that h is the leftmost hole in the right half). Similarly, we may need to trim P to the length \(q1\). In total, the construction takes \(\mathcal {O}(nk^2)\) time.
Let us verify that this construction indeed satisfies the condition of Definition 4.7. For each hole we construct just one sealed fragment, so the condition (c) is satisfied. Clearly, W has period q and \(W \ge 2q\), which yields point (a). Moreover, if \(X=T[i..j]\in \mathcal {S}(q,h)\), then \( repr (X)=\overline{T}[i..j]\), so (by maximality) \( repr (X)\) is contained in W, and X is contained in \( unseal (W)\). This gives point (d). Finally, we shall prove that \( unseal (W)\) does not contain two holes at distance q (condition (b)). Suppose that the holes are at positions i and \(i+q\). Observe that one of the holes is sealed with a unique marker, which contradicts \(\overline{T}[i]=\overline{T}[i+q]\). This completes the proof. \(\square \)
Example 4.10
Henceforth we denote by \({\mathcal F}\) the Sfamily constructed in Lemma 4.9. In order to transform it into an instance of Lemma 4.2, we need to compute the Lyndon roots of the sealed fragments \(W_i\) (that is, the values \( first _i\) and \(\ell _i\)).
4.3 Lyndon roots of sealed fragments
We will show how to compute Lyndon roots \(\lambda _i\) of sealed fragments \((W_i,q_i) \in {\mathcal F}\). Obviously, a Lyndon root of a sealed fragment can be represented in the same space complexity as the sealed fragment itself.
Let us start with the following fact that encapsulates Theorems 20 and 23 from Kociumaka (2016).
Fact 4.11
 (a)
the length of the lexicographically minimal suffix of a dfragment can be computed in \(\mathcal {O}(d^2)\) time;
 (b)
the shift value of the minimal cyclic shift of a dfragment can be computed in \(\mathcal {O}(d^2)\) time.
As a consequence of Fact 4.11(a) we obtain:
Observation 4.12
For a word of length n, after \(\mathcal {O}(n)\)time preprocessing, the length of the lexicographically maximal suffix of a dfragment can be computed in \(\mathcal {O}(d^2)\) time.
Proof
To compute the maximal suffix instead of the minimal suffix, we reverse the lexicographic order on the alphabet and append the dfragment in question with a letter that is greater than all the letters from \(\varSigma \). \(\square \)
Fact 4.11(a) and Observation 4.12 provide us with the following toolbox for sealed fragments.
Lemma 4.13
 (a)
the length of the lexicographically maximal suffix of a sealed fragment can be computed in \(\mathcal {O}(k^2)\) time.
 (b)
the shift value of the minimal cyclic shift of a sealed fragment can be computed in \(\mathcal {O}(k^2)\) time.
Lemma 4.14
If W is a periodic sealed fragment and q is its period (not necessarily shortest) such that \(2q \le W\), then the length of the Lyndon root of W and its first occurrence in W can be computed in \(\mathcal {O}(k^2)\) time after \(\mathcal {O}(n)\)time preprocessing.
Proof
We then use Lemma 4.13(b) to find the starting position \(i'\) of the maximal suffix of \(W[i+1..W]\) with the reversed lexicographic order of the alphabet. If \(W[i'..W]\) is a prefix of W[i..W], then \(\ell =i'i\), and otherwise \(\ell =q\). We check this condition in \(\mathcal {O}(k)\) time using Fact 4.6. Finally, we return \(s \bmod \ell \) and \(\ell \); see Fig. 5. \(\square \)
By point (a) of the definition of an Sfamily we immediately obtain:
Corollary 4.15
The Lyndon roots of all sealed fragments \((W_i,q_i) \in {\mathcal F}\) can be computed in \(\mathcal {O}(nk^3)\) time after \(\mathcal {O}(n)\)time preprocessing.
With this missing puzzle we are ready to conclude the algorithm for reporting all unambiguous psquare factors of a partial word.
Theorem 4.16
For a partial word T of length n with k holes, all elements of the set \( usquares (T)\) can be reported in \(\mathcal {O}(nk^3)\) time.
Proof
We construct a family of sealed fragments that consists of the solid psquares \( ssquares (T)\) and an Sfamily \({\mathcal F}\). By Corollary 4.5 and Lemma 4.9, this family can be constructed in \(\mathcal {O}(nk^2)\) time. We compute Lyndon roots of all the sealed fragments in \(\mathcal {O}(nk^3)\) time using Corollary 4.15. For each solid psquare we may compute its Lyndon root in \(\mathcal {O}(k^2)\) time using Lemma 4.14; we can also use the Lyndon roots as computed in Crochemore et al. (2014).
The constructed family satisfies the assumption of Lemma 4.2 with \(N=\mathcal {O}(nk)\). (Actually, if for any sealed factor \((W_i,q_i)\) of the Sfamily \({\mathcal F}\) we have \(W_i=2q_i\), we need to check if it equals any of the solid squares of the same length and, if so, remove it, so that no two short words repeat.) This lemma lets us report all the distinct representatives of usquares in \(\mathcal {O}(nk^3+SQ)\) time. The total number of usquares that will be generated is \(\mathcal {O}(nk)\) due to Theorem 6.6. This gives the final complexity of the algorithm. \(\square \)
5 Combinatorial bounds for ambiguous psquares
Let T be a partial word of length n with k holes. The upper bound in the case of asquares is straightforward.
Theorem 5.1
If T is a partial word of length n with k holes, then \( asquares (T)=\mathcal {O}(nk^2)\).
Proof
The number of possible lengths of asquares is at most \(\left( {\begin{array}{c}k\\ 2\end{array}}\right) \), since we have \(\left( {\begin{array}{c}k\\ 2\end{array}}\right) \) possible distances between the k holes. Consequently, the number of psquares with such lengths is at most \(nk^2\). \(\square \)
 (1)
For each \(d\ge m\), A contains at most one pair of elements with difference d;
 (2)
\( \,\{\,ji\ge m\;:\; i,j \in A\,\}\,\,\ge \, t\).
Example 5.2

\(\{1,2,3,6,9,12\}\) is a (3, 9)cover.

\(\{1,2,3,11,14,17\}\) is a (8, 9)cover.
For a set \(A\subseteq [1..n]\) we denote by \(W_{A,n}\) the partial word of length n over the alphabet \(\varSigma \) such that \(W_{A,n}[i]=\lozenge \Leftrightarrow i\in A\), and \(W_{A,n}[i]=a\) otherwise.
Lemma 5.3
Proof
Each evenlength factor of \(a^{n2}\cdot W_{A,n}\cdot a^{n2}\) is a psquare. Let \(\mathcal {Z}\) be the set of these factors X which contain two positions \(i<j\) containing holes with \(ji\ge m\) and \(X=2(ji)\). As A is an (m, t)cover, i and j are determined uniquely by \(d=ji\). Then all elements of \(\mathcal {Z}\) are pairwise nonequivalent asquares. The size of \(\mathcal {Z}\) is \(\varOmega (mt)\) which is \(\varOmega (n\cdot k^2)\). \(\square \)
Theorem 5.4
For every positive integer n and \(k\le \sqrt{2n}\), there is a partial word of length n with k holes that contains \(\varOmega (nk^2)\) nonequivalent asquare factors.
Proof
Example 5.5
6 Combinatorial bounds for unambiguous psquares
The following theorem shows a lower bound construction. Afterwards we design an upper bound that asymptotically matches this lower bound.
Theorem 6.1
For every positive integers n and k, \(k \le \frac{1}{3} n\), there is a partial word of length n with k holes that contains \(\varOmega (nk)\) nonequivalent usquare factors.
Proof
If X is a partial word, then by \( LONG (X)\) we denote the set of all psquares of length at least \(\frac{1}{2}X\) which occur in X as a prefix.
Lemma 6.2
(Three pSquares Lemma) Let X be a partial word with k holes. Assume that the set \( LONG (X)\) contains at least three elements. Then \(\delta =\mathrm {mingap}( LONG (X))/2\) is a 12kapproximate quantum period of the longest psquare in \( LONG (X)\).
Proof
Let \(B,C \in LONG (X)\) be psquares such that \(BC=2\delta \). Also let A and D be the longest and the shortest element of \( LONG (X)\), respectively. Let \(A=2a\), \(B=2b\), \(C=2c\), \(D=2d\). We aim to show that \(\mathsf {Mis}_{\delta }(A) \le 12k\). We consider two cases, depending on whether \(B \ne A\) or \(B = A\).
 (I)
\(m_1 \le k\):
Assume that \(i \in \mathsf {Mis}_\delta (A) \cap I_1\). Note that \(X_i\approx X_{i+c}\approx X_{i+cb} = X_{i\delta }\) due to psquares B and C, respectively. Hence, \(X_i\not \approx X_{i\delta }\) may hold only if \(X_{i+c}=\lozenge \).
 (II)
\(m_3 \le k\):
Assume that \(i \in \mathsf {Mis}_\delta (A) \cap I_3\). Note that \(b < i \le b+c\). Hence, \(X_i\approx X_{ib}\approx X_{ib+c} = X_{i\delta }\) due to psquares B and C, respectively. Consequently, \(X_i\not \approx X_{i\delta }\) may hold only if \(X_{ib}=\lozenge \).
 (III)
\(m_4 \le m_1+k\):
Assume that \(i \in \mathsf {Mis}_\delta (A) \cap I_4\). Note that \(a< i\delta < i \le 2a\). Let \(J=(2ca..2ba]\). Note that \(X[I_4] \approx X[J]\) due to psquare A and that \(J \subseteq I_1\). We apply Lemma 2.3(c) to \(X[I_4]\) and X[J] to conclude.
 (IV)
\(m_2 \le m_4+k\):
Assume that \(i \in \mathsf {Mis}_\delta (A) \cap I_2\). Note that \(c\delta< i\delta < i \le b\). Note that \(X[I_2] \approx X[I_4]\) due to psquare B. We apply Lemma 2.3(c) to \(X[I_2]\) and \(X[I_4]\) to conclude.
 (V)
\(m_4+m_5 \le m_1+m_2+m_3+2k\):
Assume that \(i \in \mathsf {Mis}_\delta (A) \cap (I_4\cup I_5)\). Note that \(a< i\delta < i \le 2a\). Let \(J=(2ca..a]\). Note that \(X[I_4 \cup I_5] \approx X[J]\) due to psquare A and that \(J \subseteq I_1\cup I_2 \cup I_3\). We apply Lemma 2.3(b) to \(X[I_4 \cup I_5]\) and X[J] to conclude.
 (I)
\(m'_1 \le k\):
Assume that \(i \in \mathsf {Mis}_\delta (A) \cap I'_1\). Note that \(X_i\approx X_{i+c}\approx X_{i+cb}=X_{i\delta }\) due to psquares B and C, respectively. Hence, \(X_i\not \approx X_{i\delta }\) may hold only if \(X_{i+c}=\lozenge \).
 (II)
\(m'_3 \le k\):
Assume that \(i \in \mathsf {Mis}_\delta (A) \cap I'_3\). Note that \(b< i < b+c\). Hence, \(X_i\approx X_{ib}\approx X_{ib+c}=X_{i\delta }\) due to psquares B and C, respectively. Consequently, \(X_i\not \approx X_{i\delta }\) may hold only if \(X_{ib}=\lozenge \).
 (III)
\(m'_2 \le m'_1+k\):
Assume that \(i \in \mathsf {Mis}_\delta (A) \cap I'_2\). Note that \(d< c\delta< i\delta < i \le b \le 2d\). Let \(J=(c\delta d..bd]\). Note that \(X[I'_2] \approx X[J]\) due to psquare D and that \(J \subseteq I'_1\). We apply Lemma 2.3(c) to \(X[I'_2]\) and X[J] to conclude.
 (IV)
\(m'_4 \le m'_2 +k\):
Assume that \(i \in \mathsf {Mis}_\delta (A) \cap I'_4\). Note that \(X[I'_4] \approx X[I'_2]\) due to psquare B. We apply Lemma 2.3(c) to \(X[I'_4]\) and \(X[I'_2]\) to conclude.
Recall that a deterministic period of a partial word X is an integer q such that there exists a (solid) word W such that \(W \approx X\) and W has a period q. In the following lemma we show that if the set \( LONG (X)\) is large enough, then the majority of its elements have strong periodic properties.
Lemma 6.3
Let X be a partial word with k holes. Assume that the set \( LONG (X)\) contains at least \(16k+3\) elements. Then \(\delta =\mathrm {mingap}( LONG (X))/2\) is a deterministic period of all psquares from \( LONG (X)\) excluding possibly the \(2k+1\) longest ones.
Proof
Let \( LONG '(X)\) be the set \( LONG (X)\) without the \(2k+1\) longest elements, A be the longest psquare in \( LONG (X)\), and B be the longest psquare in \( LONG '(X)\). We start by a proof of a weaker property. In the proof we will use the fact that \(\mathsf {Mis}_\delta (A) \le 12k\) (Lemma 6.2).
Claim
\(\delta \) is a quantum period of B.
Proof
 1.
Let us count the psquares from \( LONG (X)\) that contain the position i in the first half. Let \(C \in LONG (X)\), \(C=2c\), be such a psquare. Then \(X_{i+c} \approx X_i \not \approx X_{i\delta } \approx X_{i+c\delta }\). Hence, either at least one of the positions \(X_{i+c}\) and \(X_{i+c\delta }\) contains a hole (2k possibilities), or \(X_{i+c} \not \approx X_{i+c\delta }\) which means that \(i \in \mathsf {Mis}_\delta (A)\) (12k possibilities due to Lemma 6.2). Therefore, there can be at most 14k such psquares.
 2.
Let us count the psquares from \( LONG (X)\) that contain \(i\delta \) in the first half and i in the second half. There can be at most one such psquare. Otherwise there would be two psquares in \( LONG (X)\) whose halves’ lengths differ by less than \(\delta \), contradicting the definition of \(\delta \).
 3.
Let us count the psquares from \( LONG (X)\) that contain both positions \(i\delta \) and i in the second half. Let \(C \in LONG (X)\), \(C=2c\), be such a psquare. Then \(X_{ic} \approx X_i \not \approx X_{i\delta } \approx X_{ic\delta }\). Hence, at least one of the positions \(X_{ic}\) and \(X_{ic\delta }\) contains a hole (they cannot form a mismatch, as i was selected as the minimal index). This gives 2k possibilities for such a psquare.
 4.
We will show that there are no psquares from \( LONG (X)\) that do not contain the position i. If such a psquare existed, then we would have \(X/2< i\delta < i \le B\), so \(i\delta \) and i would be contained in right halves of all psquares that are at least as long as B. There are \(2k+1\) of them, which contradicts point 3.
Now we strengthen the previous claim and prove that \(\delta \) is a deterministic period of B. This will conclude the proof since all the psquares in \( LONG '(X)\) are prefixes of B.
 1.
Let us count the psquares \(C \in LONG (X)\), \(C=2c\), that contain i in the first half. Let \(j=i+c\). If \(j>B\), then \(C \in LONG (X) \setminus LONG '(X)\) and there are \(2k+1\) such psquares. Otherwise, there can be at most 3k psquares \(C \in LONG '(X)\) for which any of the positions \(jd\delta \), \(j\delta \), j contains a hole. Assume otherwise. Then \(B_{jd\delta } = B_{id\delta } \not \approx B_i = B_j\) and \(B_{j\delta } \ne {\lozenge }\). Hence, \(B_{j\delta } \not \approx B_{jd\delta }\) or \(B_{j\delta } \not \approx B_j\), either of which contradicts the way d was selected. In total, there can be \(5k+1\) of the considered psquares.
 2.
Let us count the psquares from \( LONG (X)\) that contain \(id\delta \) in the first half and i in the second half. There can be at most d of them, as otherwise there would be two psquares in \( LONG (X)\) whose halves’ lengths differ by less than \(\delta \), a contradiction. Hence, the number of such psquares is at most \(k+1\).
 3.
Let us count the psquares \(C \in LONG (X)\), \(C=2c\), that contain both positions \(id\delta \) and i in the second half. Let \(j=ic\). There can be at most 2k such psquares C for which any of the positions \(jd\delta \), j contains a hole. Assume otherwise. Then \(B_{jd\delta } \not \approx B_j\) which contradicts the definition of i.
 4.
Let us count the psquares that contain the position \(id\delta \) in the second half and do not contain the position i. Using the same argument as in 2, we see that there are at most \(k+1\) of them.
 5.
Finally, we will show that there are no psquares in \( LONG (X)\) that do not contain the position \(id\delta \). If such a psquare existed, then both positions \(id\delta \) and i would be contained in right halves of all psquares from \( LONG (X) \setminus LONG '(X)\). There are \(2k+1\) of them, which contradicts point 3.
By \(\mathcal {U}\text{ } Pref (X)\) we denote the set of unambiguous psquares in \( LONG (X)\) that occur in X only as a prefix.
Lemma 6.4
Let X be a partial word with k holes. Then \(\mathcal {U}\text{ } Pref (X) < 16k+3\).
Proof
Assume to the contrary that \(\mathcal {U}\text{ } Pref (X) \ge 16k+3\). Let us recall that \(\mathcal {U}\text{ } Pref (X) \subseteq LONG (X)\) so the assumptions of Lemma 6.3 are satisfied.
Let \(\mathcal {U}\text{ } Pref '(X)\) be the set \(\mathcal {U}\text{ } Pref (X)\) without the \(2k+1\) longest elements. By Lemma 6.3, each psquare in \(\mathcal {U}\text{ } Pref '(X)\) has a deterministic period \(\delta = \mathrm {mingap}( LONG (X))/2\).
Let us assume that \(B=X[1..2a] \in \mathcal {U}\text{ } Pref '(X)\) and let \(W^2\) be its (solid) representative. Then \(C=X[1+\delta ..2a+\delta ]\) is a psquare, as it matches \(W^2\) due to the deterministic period \(\delta \). If \(X[2a+1..2a+\delta ]\) did not contain a hole, then C would be another occurrence of a usquare with representative \(W^2\). This would contradict the assumption that \(B \in \mathcal {U}\text{ } Pref (X)\).
Note that the fragments of the form \(X[2a+1..2a+\delta ]\) for \(X[1..2a] \in \mathcal {U}\text{ } Pref '(X)\) are pairwise disjoint due to the definition of \(\delta \). What follows is that \(\mathcal {U}\text{ } Pref '(X) \le k\) and \(\mathcal {U}\text{ } Pref (X) \le 3k+1\), a contradiction. \(\square \)
We say that a solid square \(W^2\) has a solid occurrence in T if T contains a factor equal to \(W^2\). By the following fact, there are at most 2n nonequivalent psquare factors of T with solid occurrences.
Fact 6.5
(Fraenkel and Simpson 1998; Ilie 2005; Deza et al. 2015) Every position of a (solid) word contains at most two rightmost occurrences of squares.
In the proof of the upper bound on the number of usquares we separately count usquares that have a solid occurrence and those that do not. In the latter case, we use Lemma 6.4, which lets us bound \(\mathcal {U}\text{ } Pref (X)\) by 19k in case that \(k>0\).
Theorem 6.6
If T is a partial word of length n with k holes, then \( usquares (T)=\mathcal {O}(nk)\).
Proof
Notes
Acknowledgements
Tomasz Kociumaka is supported by Polish budget funds for science in 2013–2017 as a research project under the ‘Diamond Grant’ program, Grant No. DI2012 017942. Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń are supported by the Polish National Science Center, Grant No. 2014/13/B/ST6/00770.
References
 Bannai H, Inenaga S, Köppl D (2017) Computing all distinct squares in linear time for integer alphabets. In: Kärkkäinen J, Radoszewski J, Rytter W (eds) 28th Annual symposium on combinatorial pattern matching, CPM 2017, LIPIcs, vol 78. Schloss Dagstuhl  LeibnizZentrum fuer Informatik, pp 22:1–22:18. https://doi.org/10.4230/LIPIcs.CPM.2017.22
 BlanchetSadri F, Mercaş R (2009) A note on the number of squares in a partial word with one hole. Inform Théor Appl 43(4):767–774. https://doi.org/10.1051/ita/2009019 MathSciNetCrossRefzbMATHGoogle Scholar
 BlanchetSadri F, Mercaş R (2012) The threesquares lemma for partial words with one hole. Theor Comput Sci 428:1–9. https://doi.org/10.1016/j.tcs.2012.01.012 MathSciNetCrossRefzbMATHGoogle Scholar
 BlanchetSadri F, Mercaş R, Scott G (2009) Counting distinct squares in partial words. Acta Cybern 19(2):465–477MathSciNetzbMATHGoogle Scholar
 BlanchetSadri F, Jiao Y, Machacek JM, Quigley J, Zhang X (2014a) Squares in partial words. Theor Comput Sci 530:42–57. https://doi.org/10.1007/9783642316531_36
 BlanchetSadri F, Nikkel J, Quigley JD, Zhang X (2014b) Computing primitivelyrooted squares and runs in partial words. In: Kratochvíl J, Miller M, Froncek D (eds) Combinatorial algorithms, IWOCA 2014. Lecture notes in computer science, vol 8986. Springer, pp 86–97. https://doi.org/10.1007/9783319193151_8
 BlanchetSadri F, Bodnar M, Nikkel J, Quigley JD, Zhang X (2015) Squares and primitivity in partial words. Discrete Appl Math 185:26–37. https://doi.org/10.1016/j.dam.2014.12.003 MathSciNetCrossRefzbMATHGoogle Scholar
 Charalampopoulos P, Crochemore M, Iliopoulos C.S, Kociumaka T, Pissis S.P, Radoszewski J, Rytter W, Waleń T (2017) Efficient enumeration of nonequivalent squares in partial words with few holes. In: Cao Y, Chen Y (eds) Proceedings of the 23rd international conference on computing and combinatorics, COCOON 2017. Lecture notes in computer science, vol 10392. Springer, pp 99–111. https://doi.org/10.1007/9783319623894_9
 Crochemore M, Rytter W (1995) Squares, cubes, and timespace efficient string searching. Algorithmica 13(5):405–425. https://doi.org/10.1007/BF01190846 MathSciNetCrossRefzbMATHGoogle Scholar
 Crochemore M, Iliopoulos CS, Kubica M, Radoszewski J, Rytter W, Waleń T (2014) Extracting powers and periods in a word from its runs structure. Theor Comput Sci 521:29–41. https://doi.org/10.1016/j.tcs.2013.11.018 MathSciNetCrossRefzbMATHGoogle Scholar
 Deza A, Franek F, Thierry A (2015) How many double squares can a string contain? Discrete Appl Math 180:52–69. https://doi.org/10.1016/j.dam.2014.08.016 MathSciNetCrossRefzbMATHGoogle Scholar
 Diaconu A, Manea F, Tiseanu C (2009) Combinatorial queries and updates on partial words. In: Kutyłowski M, Charatonik W, Gȩbala M (eds) Fundamentals of computation theory, FCT 2009. Lecture notes in computer science, vol 5699. Springer, pp 96–108. https://doi.org/10.1007/9783642034091_10
 Farach M (1997) Optimal suffix tree construction with large alphabets. In: FOCS. IEEE Computer Society, pp 137–143Google Scholar
 Fraenkel AS, Simpson J (1998) How many squares can a string contain? J Comb Theory Ser A 82(1):112–120. https://doi.org/10.1006/jcta.1997.2843 MathSciNetCrossRefzbMATHGoogle Scholar
 Gusfield D (1997) Algorithms on strings, trees, and sequences—computer science and computational biology. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
 Gusfield D, Stoye J (2004) Linear time algorithms for finding and representing all the tandem repeats in a string. J Comput Syst Sci 69(4):525–546. https://doi.org/10.1016/j.jcss.2004.03.004 MathSciNetCrossRefzbMATHGoogle Scholar
 Halava V, Harju T, Kärki T (2010) On the number of squares in partial words. RAIRO—Theor Inform Appl 44(1):125–138. https://doi.org/10.1051/ita/2010008 MathSciNetCrossRefzbMATHGoogle Scholar
 Ilie L (2005) A simple proof that a word of length \(n\) has at most \(2n\) distinct squares. J Comb Theory Ser A 112(1):163–164. https://doi.org/10.1016/j.jcta.2005.01.006 MathSciNetCrossRefzbMATHGoogle Scholar
 Kociumaka T (2016) Minimal suffix and rotation of a substring in optimal time. In: Grossi R, Lewenstein M (eds) Combinatorial pattern matching, CPM 2016. LIPIcs, vol 54. Schloss Dagstuhl, pp 28:1–28:12. https://doi.org/10.4230/LIPIcs.CPM.2016.28
 Main MG, Lorentz RJ (1984) An \(O(n \log n)\) algorithm for finding all repetitions in a string. J Algorithms 5(3):422–432. https://doi.org/10.1016/01966774(84)90021X MathSciNetCrossRefzbMATHGoogle Scholar
 Manea F, Tiseanu C (2010) Hard counting problems for partial words. In: Dediu A, Fernau H, MartínVide C (eds) Language and automata theory and applications, LATA 2010. Lecture notes in computer science, vol 6031. Springer, pp 426–438. https://doi.org/10.1007/9783642130892_36
 Manea F, Mercaş R, Tiseanu C (2014) An algorithmic toolbox for periodic partial words. Discrete Appl Math 179:174–192. https://doi.org/10.1016/j.dam.2014.07.017 MathSciNetCrossRefzbMATHGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.