Tighter Bounds and Optimal Algorithms for All Maximal αgapped Repeats and Palindromes
 517 Downloads
Abstract
An αgapped repeat (α ≥ 1) in a word w is a factor uvu of w such that u v ≤ αu; the two occurrences of u are called arms of this αgapped repeat. An αgapped repeat is called maximal if its arms cannot be extended simultaneously with the same character to the right nor to the left. We show that the number of all maximal αgapped repeats occurring in words of length n is upper bounded by 18α n. In the case of αgapped palindromes, i.e., factors \(uv{{u}^{\intercal }}\) with u v≤ αu, we show that the number of all maximal αgapped palindromes occurring in words of length n is upper bounded by 28α n + 7n. Both upper bounds allow us to construct algorithms finding all maximal αgapped repeats and/or all maximal αgapped palindromes of a word of length n on an integer alphabet of size \(n^{\mathcal {O}(1)}\) in \({\mathcal {O}(\alpha n)}\) time. The presented running times are optimal since there are words that have Θ(α n) maximal αgapped repeats/palindromes.
Keywords
Combinatorics on words Counting algorithms1 Introduction
Gapped repeats and palindromes are repetitive structures occurring in words that were investigated extensively within theoretical computer science (see, e.g., [3, 5, 6, 7, 8, 10, 14, 17, 18, 19, 22] and the references therein) with motivation coming especially from the analysis of DNA and RNA structures, modelling different types of tandem and interspersed repeats as well as hairpin structures; such structures are important in analyzing the structural and functional information of the genetic sequences (see, e.g., [3, 14, 18]).
Let \({{w}^{{\intercal }}}\) denote the reversed word of a word w. An αgapped repeat (respectively, an αgapped palindrome) is a factor of the form uvu (respectively, \(uv{{u}^{{\intercal }}}\)) with u v≤ αu, for a real number α ≥ 1. These are natural generalization of classical repetitive and palindromic structures in a word: 1gapped repeats and 1gapped palindromes are respectively equivalent to squares and even palindromes, which are very well known and studied structures. Also, 2gapped repeats and 2gapped palindromes are respectively called longarmed pairs and longarmed palindromes.

Closing the gap between the upper bound \(\mathcal {O}(\alpha ^{2} n)\) and the lower bound Ω(α n) for the number of maximal αgapped repeats, and

Developing a more efficient algorithm.

The number of all maximal αgapped repeats in a word of length n is at most 18α n (Theorem 11).

The number of all maximal αgapped palindromes in a word of length n is at most 28α n + 7n (Theorem 14).

We can compute the set of all αgapped repeats in \({\mathcal {O}(\alpha n)}\) time for integer alphabets (Theorem 28).

This algorithm can be adapted to find the number of all maximal αgapped palindromes in \({\mathcal {O}(\alpha n)}\) time (Corollary 29).
Example 1 ([7, Thm. 2])
The word w _{ k } :=(abba)^{ k } with \(k\in \mathbb {N}\) contains Θ(α n) maximal αgapped repeats whose arms are of length one. Since an αgapped repeat whose arms have length one is an αgapped palindrome, we get that the number of maximal αgapped repeats and the number of maximal αgapped palindromes in the word w _{ k } is Ω(α n).
In this sense, we cannot hope for algorithms finding all αgapped repeats or palindromes faster in the worst case. The results above improve those of [19] (as well as those existing in the literature before [19]). Our algorithms require a deeper analysis than the one developed in [10] for finding the longest αgapped repeats. Asides, they use essentially different techniques and data structures than the ones described in [7, 18, 22].
A related problem is the computation of all factors with an exponent less than 2 that are maximal with respect to their exponents. This problem was recently investigated in [1].
2 Combinatorics on Words
Let Σ be a finite alphabet; an element of Σ is called character. Σ^{∗} denotes the set of all finite words over Σ. The length of a word w ∈ Σ^{∗} is denoted by w. For v = x u y with x,u,y ∈ Σ^{∗}, we call x, u and y a prefix, factor, and suffix of v, respectively. We denote by w[i] the character occurring at position i in w, and by w[i,j] the factor of w starting at position i and ending at position j, consisting of the catenation of the characters w[i],…,w[j], where 1 ≤ i ≤ j ≤ n; w[i,j] is the empty word if i > j. By \({{{w}^{{\intercal }}}}\) we denote the mirror image of w.
By \(\mathcal {I}=[b,e]\) we represent the set of consecutive integers from b to e, for b ≤ e, and call \(\mathcal {I}\) an interval. For an interval \(\mathcal {I}\), we use the notations \(\mathsf {b}{(\mathcal {I})}\) and \(\mathsf {e}(\mathcal {I})\) to denote the beginning and end of \(\mathcal {I}\); i.e., \(\mathcal {I} = [\mathsf {b}{(\mathcal {I})},\mathsf {e}(\mathcal {I})]\). We write \(\left \mathcal {I}\right \) to denote the length of \(\mathcal {I}\); i.e., \(\left \mathcal {I}\right =\mathsf {e}(\mathcal {I})\mathsf {b}{(\mathcal {I})}+1\).
A subword w[b,e] of a word w is the occurrence of a factor f equal to w[b,e] in w; we say that f occurs at position b in w. While a factor is identified only by a sequence of characters, a subword is also identified by its position in the word. So subwords are always unique, while a word may contain multiple occurrences of the same factor. We use the same notation for defining factors and subwords of a word. For two subwords u and \(\overline {u}\) of a word w, we write \({u} = \overline {u}\) if they start at the same position in w and have the same length. We write \({u} \equiv \overline {u}\) if the factors identifying these subwords are the same (hence \({u} = \overline {u} \Rightarrow {u} \equiv \overline {u}\)). We implicitly use subwords both like factors of w and as intervals contained in [1, w], e.g., we write \({u} \subseteq \overline {u}\) if two subwords \({u} := w[b, e] , \overline {u} := w[\overline {b}, \overline {e}]\) of w satisfy \([b, e] \subseteq [\overline {b}, \overline {e}]\), i.e., \(\mathsf {b}{(\overline {u})} \le \mathsf {b}{({u})} \le \mathsf {e}({u}) \le \mathsf {e}(\overline {u})\). Two subwords u and \(\overline {u}\) of the same word w are called consecutive, iff \(\mathsf {e}({u})+1 = \mathsf {b}{(\overline {u})}\). Two occurrences u and \(\overline {u}\) with \(\mathsf {b}(u) < \mathsf {b}{(\overline {u})}\) of the same factor v in a word w are called subsequent if there is no occurrence of v starting between b(u) + 1 and \(\mathsf {b}{(\overline {u})}1\).
A period of a word w over Σ is a positive integer p < w such that w[i] = w[j] for all i and j with 1 ≤ i,j ≤ w and i ≡ j (mod p); a word that has period p is also called pperiodic. A word w whose smallest period is at most ⌊w/2⌋ is called periodic; otherwise, w is called aperiodic.
A repetition in a word w is a periodic factor; a run is a maximal repetition; the exponent of a run is the (rational) number of times the period fits in that run. Let E(w) denote the sum of the exponents of runs in the word w. The exponent of a run r is denoted by exp(r). We use the following results from the literature:
Lemma 1
The length of the overlap between two subsequent occurrences of an aperiodic factor u in a word w is upper bounded by ⌊u/2⌋.
Lemma 2 (2)
For a word w, E(w) < 3w, and the number of runs is less than w.
Corollary 3 ([7, Conclusions])
The number of maximal 1gapped repeats is less than w.
Lemma 4 (19)
Two distinct runs with the same minimal period p cannot have an overlap of length greater than or equal to p.
Observation 5
The mirror image of a gapped repeat (resp. palindrome) is a gapped repeat (resp. palindrome) with the same period. Hence, there exist the bijections \({\mathcal {G }_{\alpha }(w)}\sim {\mathcal {G }_{\alpha }({{w}^{{\intercal }}})}\) and \({{\mathcal {G}}_{\alpha }^{\intercal }(w)}\sim {{\mathcal {G}}_{\alpha }^{\intercal }({{w}^{{\intercal }}})}\) .
2.1 Point Analysis
A pair of positive integers is called a point. We use points to bound the cardinality of a subset of gapped repeats and gapped palindromes by injectively mapping a gapped repeat (resp. palindrome) to a point as stated above. To this end, we show that a certain vicinity of any point generated by a member of this subset does not contain any point that is generated by another member. This vicinity is given by
Definition 6
Given a real number γ with γ ∈ (0,1], we say that a point (x,y)γcovers a point (x ^{′},y ^{′}) iff x − γ y ≤ x ^{′}≤ x and y − γ y ≤ y ^{′}≤ y.
It is crucial that the γ factor is always multiplied with the ycoordinates. In other words, the number of γcovers of a point (⋅,y) correlates with γ and the value y. The main property of this definition is given by
Lemma 7
Given a real number γ with γ ∈ (0,1], let \(S \subset {[1,n]}^{2} \subset \mathbb {N}^{2}\) be a set of points such that no two distinct points in S γ cover the same point. Then S < 3n/γ .
Proof
We estimate the maximal number of points that can be placed in \({[1,n]}^{2} \subset \mathbb {N}^{2}\) such that their covered points are disjoint. First, the number of points (⋅,y) ∈ [1,n]^{2} with y < 1/γ is less than n/γ. Second, if a point (⋅,y) satisfies 2^{ ℓ }/γ ≤ y < 2^{ ℓ+1}/γ for an integer ℓ ≥ 0, the point (⋅,y)γcovers at least 2^{ ℓ } × 2^{ ℓ } points, or to put it differently, this point γcovers at least 2^{ ℓ } points (⋅,y ^{′}) with y − 2^{ ℓ } ≤ y ^{′}≤ y. In other words, there are at most n/(2^{ ℓ } γ) points in S with 2^{ ℓ }/γ ≤ y < 2^{ ℓ+1}/γ. Hence, \(\left S\right  < n / \gamma + {\sum }_{\ell =0}^{\infty } n / (2^{\ell } \gamma ) = 3 n / \gamma \). □

those whose arms are contained in one or two runs,

those whose arms contain a periodic prefix or suffix larger than half of the size of the arms, and

those belonging to neither of the two subsets.
They showed that the first two subsets contain at most \({\mathcal {O}(\alpha n)}\) elements. The point analysis is used as a tool for studying the last subset. By mapping a gapped repeat to a point consisting of the end position of its left arm and its period, they showed that the points created by two different maximal αgapped repeats cannot \(\frac {1}{4\alpha }\)cover the same point. By this property, they bounded the size of the last subset by \(\mathcal {O}{(\alpha ^{2} n)}\). Lemma 7 immediately improves this bound of \({\mathcal {O}(\alpha ^{2} n)}\) to \({\mathcal {O}(\alpha n)}\). Consequently, it shows that the number of maximal αgapped repeats of a word of length n is \({\mathcal {O}(\alpha n)}\).
2.2 Upper Bound on the Number of Periodic Maximal αgapped Repeats and Palindromes
Unlike [7, 18, 19], we partition the maximal αgapped repeats (resp. palindromes) differently. We categorize a gapped repeat (resp. palindrome) depending on whether their left arm contains a periodic prefix or not. The two subsets are treated differently. For the ones having a periodic prefix, we think about the number of runs covering this prefix. The other category is analyzed by using the results of Section 2.1. We begin with a formal definition of both subsets and analyze the former subset.
Let β be a real number with 0 < β < 1. A gapped repeat (resp. palindrome) σ = u _{ λ },v,u _{ ρ } belongs to \({\beta \mathcal {P }_{\alpha }(w)}\) (resp. \({{\beta \mathcal {P}}_{\alpha }^{\intercal }(w)}\)) iff u _{ λ } contains a periodic prefix of length at least β u _{ λ } . We call σ periodic. Otherwise \({\sigma } \in {\overline {\beta \mathcal {P}}_{\alpha }(w)}\) (resp. \({\sigma } \in {{\overline {\beta \mathcal {P}}}_{\alpha }^{\intercal }(w)}\)), where \({\overline {\beta \mathcal {P }}_{\alpha }(w)}:= {\mathcal {G }_{\alpha }(w)}\setminus {\beta \mathcal {P }_{\alpha }(w)}\) and \({{\overline {\beta \mathcal {P}}}_{\alpha }^{\intercal }(w)}:= {{\mathcal {G}}_{\alpha }^{\intercal }(w)}\setminus {{\beta \mathcal {P}}_{\alpha }^{\intercal }(w)}\); we call σ aperiodic.
Lemma 8
 (a)
\(\left {\beta \mathcal {P }_{\alpha }(w)}\right \) is at most 2α E(w)/β , and
 (b)
\(\left {{\beta \mathcal {P}}_{\alpha }^{\intercal }(w)}\right \) is at most 2(α + 1)E(w)/β .
Proof
(a) Gapped Repeats

r _{ λ } in case b(u _{ λ }) = b(r _{ λ }), or

r _{ ρ } in case b(u _{ ρ }) = b(r _{ ρ }).
 SubClaim

Given two different gapped repeats σ _{1} and σ _{2} with respective periods q _{1} and q _{2} such that the left arms of both are generated by r _{ λ }, the difference δ between q _{1} and q _{2} must be at least p.
 SubProof

We consider two cases:

If e(u _{ λ }) = e(r _{ λ }), then u _{ λ } = s _{ λ } = r _{ λ }, so u _{ λ } is pperiodic. Since both right arms are pperiodic, too, δ is a multiple of p.

Otherwise, both gapped repeats are generated by two different repeats with period p. So by Lemma 4, δ must be at least p.
Since u _{ λ } ≤ s _{ λ } /β and σ is αgapped, 1 ≤ q ≤ s _{ λ }α/β ≤ r _{ λ }α/β. Then the number of possible periods q is bounded by r _{ λ }α/(β p) = exp(r _{ λ })α/β. Therefore the number of maximal αgapped repeats is bounded by α E(w)/β for the case b(u _{ λ }) = b(r _{ λ }). Since the case b(u _{ ρ }) = b(r _{ ρ }) is symmetric, we get the bound 2α E(w)/β in total.
(b) Gapped Palindromes

r _{ λ } in case b(u _{ λ }) = b(r _{ λ }), or

r _{ ρ } in case e(u _{ ρ }) = e(r _{ ρ }).
2.3 Upper Bound on the Number of Maximal αgapped Repeats
We optimize the proof technique from Kolpakov et al. [19] and improve the upper bound on the number of maximal αgapped repeats in a word of length n from \({\mathcal {O}(\alpha n)}\) to 18α n. Remembering the results of Section 2.1, we map gapped repeats to their respective points. By using the period of a gapped repeat as the ycoordinate, we can show the following lemma:
Lemma 9
Given a word w , and two real numbers α,β with α > 1and 2/3 ≤ β < 1, the points mapped by two different maximal gapped repeats in \({\overline {\beta \mathcal {P}}_{\alpha }(w)}\) cannot \(\frac {1\beta }{\alpha }\) cover the same point.
Proof
Let σ = u _{ λ },v,u _{ ρ } and \(\overline {\sigma } = \overline {{u}_{\lambda }},\overline {{v}},\overline {{u}_{\rho }}\) be two different maximal gapped repeats in \({\overline {\beta \mathcal {P}}_{\alpha }(w)}\). Set u := u _{ λ } = u _{ ρ }, \(\overline {u} := \left \overline {{u}_{\lambda }}\right  = \left \overline {{u}_{\rho }}\right \), q := u _{ λ } v and \(\overline {q} := \left \overline {{u}_{\lambda }}\overline {{v}}\right \). We map the maximal gapped repeats σ and \(\overline {\sigma }\) to the points (e(u _{ λ }),q) and \((\mathsf {e}(\overline {{u}_{\lambda }}),\overline {q})\), respectively. Assume, for the sake of contradiction, that both points \(\frac {1\beta }{\alpha }\)cover the same point (x,y).
Let \({z} := \left \mathsf {e}({u}_{\lambda })  \mathsf {e}(\overline {{u}_{\lambda }})\right \) be the difference of the endings of both left arms, and \({s}_{\lambda } := w[ [\mathsf {b}{({u}_{\lambda })},\mathsf {e}({u}_{\lambda })]\cap [\mathsf {b}{(\overline {{u}_{\lambda }})},\mathsf {e}(\overline {{u}_{\lambda }})]]\) be the overlap of u _{ λ } and \(\overline {{u}_{\lambda }}\). Let s := s _{ λ }, and let s _{ ρ } (resp. \(\overline {{s}_{\rho }}\)) be the right copy of s _{ λ } based on σ (resp. \(\overline {\sigma }\)).
 SubClaim

The overlap s _{ λ } is not empty, and \({s}_{\rho } \not = \overline {{s}_{\rho }}\)
 SubProof

Assume for this subproof that \(\mathsf {e}({u}_{\lambda }) < \mathsf {e}(\overline {{u}_{\lambda }})\) (otherwise exchange σ with \(\overline {\sigma }\), or yield the contradiction \({\sigma } = \overline {\sigma }\)). The latter contradiction (\({\sigma } = \overline {\sigma }\)) is yielded by the following consideration: Since \(\mathsf {e}({u}_{\lambda }) = \mathsf {e}(\overline {{u}_{\lambda }})\), s _{ λ } cannot be empty (it is the intersection of both left arms). Further, both right copies are defined as the right translation of s _{ λ } by q and \(\overline {q}\), respectively. So if both right copies are identical, then \(q=\overline {q}\), which contradicts the fact that the mapping of a maximal gapped repeat to the point consisting of its end point and its period is injective.
Having \(\mathsf {e}({u}_{\lambda }) < \mathsf {e}(\overline {{u}_{\lambda }})\), we can combine the (1 − β)/αcover property with the fact that \(\overline {\sigma }\) is αgapped, and yield e(u _{ λ }¯) −u¯ ≤e(u _{ λ }¯) −q¯/α < e(u _{ λ }¯) −q¯(1 − β)/α ≤ x ≤e(u _{ λ }) < e(u _{ λ }¯). Hence, the subword w[e(u _{ λ })] is contained in \(\overline {{u}_{\lambda }}\). If \({s}_{\rho } = \overline {{s}_{\rho }}\), then we get a contradiction to the maximality of σ: By the above inequality, w[e(u _{ λ }) + 1] is contained in \(\overline {{u}_{\lambda }}\), too. Since \(\overline {\sigma }\) is a gapped repeat, the character w[e(u _{ λ }) + 1] occurs in \(\overline {{u}_{\rho }}\), exactly at w[e(u _{ ρ }) + 1].
 1.Case \(\mathsf {e}({u}_{\lambda }) \le \mathsf {e}(\overline {{u}_{\lambda }})\). Since \(\mathsf {e}(\overline {{u}_{\lambda }})  \overline {q}(1\beta )/\alpha \le x \le \mathsf {e}({u}_{\lambda }) \le \mathsf {e}(\overline {{u}_{\lambda }})\),$$ {z} = \mathsf{e}(\overline{{u}_{\lambda}})  \mathsf{e}({u}_{\lambda}) \le \overline{q}(1\beta)/\alpha \le \overline{u}(1\beta). $$(4)
 1a.
SubCase \(\mathsf {b}{({u}_{\lambda })} \le \mathsf {b}{(\overline {{u}_{\lambda }})}\), see Fig. 4. By (4), we get \({s} = \overline {u}  {z} \ge \overline {u} \beta \). It follows from (2) and 2/3 ≤ β < 1 that \({s} / \delta \ge \overline {u}\beta / \overline {u} (1\beta ) = \beta / (1\beta ) \ge 2\), which means that s _{ ρ } and \(\overline {{s}_{\rho }}\) overlap at least half of their common length, so s _{ λ } is periodic. Since s _{ λ } is a prefix of \(\overline {{u}_{\lambda }}\) of length \({s} \ge \overline {u} \beta \), \(\overline {\sigma }\) is in \({\beta \mathcal {P }_{\alpha }(w)}\), a contradiction.
 1b.
SubCase \(\mathsf {b}{({u}_{\lambda })} > \mathsf {b}{(\overline {{u}_{\lambda }})}\), see Fig. 5. We conclude that s _{ λ } = u _{ λ }. It follows from (2) and (3) and 2/3 ≤ β < 1 that \({s} / \delta \ge \overline {q} \alpha \beta / (\overline {q} \alpha (1\beta )) = \beta / (1\beta ) \ge 2\), which means that s _{ λ } = u _{ λ } is periodic. Hence σ is in \({\beta \mathcal {P }_{\alpha }(w)}\), a contradiction.
 2.Case \(\mathsf {e}({u}_{\lambda }) > \mathsf {e}(\overline {{u}_{\lambda }})\). Since \(\mathsf {e}({u}_{\lambda })  q(1\beta )/\alpha \le x \le \mathsf {e}(\overline {{u}_{\lambda }}) \le \mathsf {e}({u}_{\lambda })\),$$ {z} = \mathsf{e}({u}_{\lambda})  \mathsf{e}(\overline{{u}_{\lambda}}) \le q(1\beta)/\alpha \le \overline{q}(1\beta)/\alpha \le \overline{u}(1\beta). $$(5)
 2a.
SubCase \(\mathsf {b}{({u}_{\lambda })} \le \mathsf {b}{(\overline {{u}_{\lambda }})}\), see Fig. 6. We conclude that \({s}_{\lambda } = \overline {{u}_{\lambda }}\). It follows from (2) and 2/3 ≤ β < 1 that \({s} / \delta \ge \overline {u} / (\overline {u} (1\beta )) = 1 / (1\beta ) \ge 3 > 2\), which means that \({s}_{\lambda } = \overline {{u}_{\lambda }}\) is periodic. Hence \(\overline {\sigma }\) is in \({\beta \mathcal {P }_{\alpha }(w)}\), a contradiction.
 2b.SubCase \(\mathsf {b}{({u}_{\lambda })} > \mathsf {b}{(\overline {{u}_{\lambda }})}\), see Fig. 7. By (5) we get z ≤ q(1 − β)/α ≤ u(1 − β) and hence s = u − z ≥ u β. If δ ≤ s/2, s _{ ρ } and \(\overline {{s}_{\rho }}\) overlap at least half of their common length, which leads to the contradiction that u _{ λ } has a periodic prefix s _{ λ } of length at least u β. Otherwise, let us assume that s/2 < δ. By (2) and (3) we get \({u} / \delta \ge \overline {q} \alpha \beta / (\overline {q} \alpha (1\beta )) = \beta / (1\beta ) \ge 2\) with 2/3 ≤ β < 1. Hence, δ is upper bounded by u/2; so u _{ ρ } has a periodic prefix of length at least 2δ (since 2δ > s ≥ u β), a contradiction.
The next lemma follows immediately from Lemmas 7 and 9.
Lemma 10
Given two real numbers α,β with α > 1and 2/3 ≤ β < 1, the number of all aperiodic maximal α gapped repeats \(\left {\overline {\beta \mathcal {P }}_{\alpha }(w)}\right \) of a word w of length n is less than 3α n/(1 − β).
Theorem 11
Given a word w of length n and a real number α > 1, the number of all α gapped repeats \(\left {\mathcal {G }_{\alpha }(w)}\right \) is less than 18α n .
Proof
Combining the results of Lemma 8(a) and Lemma 10, \(\left {\mathcal {G }_{\alpha }(w)}\right  = \left {\beta \mathcal {P }_{\alpha }(w)}\right  + \left {\overline {\beta \mathcal {P }}_{\alpha }(w)}\right  < 2 \alpha E(w) /\beta + 3 \alpha n / (1  \beta )\) for 2/3 ≤ β < 1. Applying Lemma 2, the term is upper bounded by 6α n/β + 3α n/(1 − β). The number is minimal for β = 2/3, yielding the bound 18α n. □
With Corollary 3 we obtain the result of Theorem 11 for α ≥ 1.
2.4 Upper Bound on the Number of Maximal αgapped Palindromes
We can bound the maximum number of maximal αgapped palindromes by similar proofs to 28α n + 7n. This bound solves an open problem in [18], where Kolpakov and Kucherov conjectured that the number of αgapped palindromes with α ≥ 2 in a word is linear. We briefly explain the main differences and similarities needed to understand the relationship between gapped repeats and palindromes. Let σ be a maximal αgapped repeat (or αgapped palindrome). If σ has a periodic prefix s _{ λ } generated by a run, then its right arm has a periodic prefix (suffix) s _{ ρ } generated by a run of the same period. Since σ is maximal, both runs have to obey constraints that are similar in both cases, considering whether σ is a gapped repeat or a gapped palindrome (this is reflected by the fact that large parts for proving the statements of Lemma 8(a) and Lemma 8(b) are identical). Like with aperiodic gapped repeats, we can apply the point analysis to the aperiodic αgapped palindromes, too. Our main idea is to map a gapped palindrome u _{ λ },v,u _{ ρ } injectively to the pair of integers (e(u _{ λ }), v), exchanging the period with the size of the gap.
In what follows, we focus on maximal αgapped palindromes with α > 1. That is because the case α = 1 is already solved in literature. To see this, we observe that 1gapped palindromes are (plain) palindromes. A palindrome u is a subword with \({{{u}}^{{\intercal }}} = {u}\). Its center is the value (e(u) + b(u))/2. A palindrome is called maximal if there is no longer palindrome with the same center. So a maximal palindrome is uniquely defined by its center. Hence, the number of maximal palindromes in a word of length n is at most 2n − 1. We conclude our observation with the fact that maximal 1gapped palindromes are maximal palindromes. For the algorithmic part, the algorithm of [20] can be used to find all the maximal palindromes in linear time.
Lemma 12
Given a word w, and two real numbers α > 1and 6/7 ≤ β < 1. The points mapped by two different maximal gapped palindromes in \({{\overline {\beta \mathcal {P}}}_{\alpha }^{\intercal }(w)}\) cannot \(\frac {1\beta }{\alpha }\) cover the same point.
Proof
Let σ = u _{ λ },v,u _{ ρ } and \(\overline {\sigma } = \overline {{u}_{\lambda }},\overline {{v}},\overline {{u}_{\rho }}\) be two different gapped palindromes in \({{\overline {\beta \mathcal {P}}}_{\alpha }^{\intercal }(w)}\). Set u := u _{ λ } = u _{ ρ }, \(\overline {u} := \left \overline {{u}_{\lambda }}\right  = \left \overline {{u}_{\rho }}\right \), g := v and \(\overline {{g}} := \left \overline {{v}}\right \). We map the maximal gapped palindromes σ and \(\overline {\sigma }\) to the points (e(u _{ λ }),g) and \((\mathsf {e}(\overline {{u}_{\lambda }}),\overline {{g}})\), respectively. Assume, for the sake of contradiction, that both points \(\frac {1\beta }{\alpha }\)cover the same point (x,y). This means for the point (e(u _{ λ }),g) that e(u _{ λ }) − (1 − β)g/α ≤ x ≤e(u _{ λ }) and g − (1 − β)g/α ≤ y ≤ g hold. The same inequations hold when exchanging (e(u _{ λ }),g) with \((\mathsf {e}(\overline {{u}_{\lambda }}),\overline {{g}})\).
Let \({z} := \left \mathsf {e}({u}_{\lambda })  \mathsf {e}(\overline {{u}_{\lambda }})\right \) be the difference of the endings of both left arms, and \({s}_{\lambda } := w[ [\mathsf {b}{({u}_{\lambda })},\mathsf {e}({u}_{\lambda })]\cap [\mathsf {b}{(\overline {{u}_{\lambda }})},\mathsf {e}(\overline {{u}_{\lambda }})] ]\) be the overlap of u _{ λ } and \(\overline {{u}_{\lambda }}\). Let s = s _{ λ }, and let s _{ ρ } (resp. \(\overline {{s}_{\rho }}\)) be the reversed copy of s _{ λ } based on σ (resp. \(\overline {\sigma }\)).
 SubClaim

The overlap s _{ λ } is not empty, and \({s}_{\rho } \not = \overline {{s}_{\rho }}\)
 SubProof

Assume for this subproof that \(\mathsf {e}({u}_{\lambda }) < \mathsf {e}(\overline {{u}_{\lambda }})\) (otherwise exchange σ with \(\overline {\sigma }\), or yield the contradiction \({\sigma } = \overline {\sigma }\)). The latter contradiction (\({\sigma } = \overline {\sigma }\)) is yielded by the following consideration: Since \(\mathsf {e}({u}_{\lambda }) = \mathsf {e}(\overline {{u}_{\lambda }})\), s _{ λ } cannot be empty (it is the intersection of both left arms). In particular, it is the longest common suffix of u _{ λ } and \(\overline {{u}_{\lambda }}\). Consequently, both reversed copies s _{ ρ } and \(\overline {{s}_{\rho }}\) of s _{ λ } are prefixes of u _{ ρ } and \(\overline {{u}_{\rho }}\), respectively. The gap between s _{ ρ } and \(\overline {{s}_{\rho }}\) to s _{ λ } is g and \(\overline {{g}}\), respectively. In other words, if \({s}_{\rho } = \overline {{s}_{\rho }}\), then \({g} = \overline {{g}}\), a contradition to the fact that the mapping of a maximal gapped palindrome to the point consisting of its end point and its gap is injective.
By combining the (1 − β)/αcover property with the fact that \(\overline {\sigma }\) is αgapped, we yield e(u _{ λ }¯) −u¯ ≤e(u _{ λ }¯) − (u¯ + g¯)/α < e(u _{ λ }¯) −g¯(1 − β)/α ≤ x ≤e(u _{ λ }) < e(u _{ λ }¯). So the subword w[e(u _{ λ })] is contained in \(\overline {{u}_{\lambda }}\). If \({s}_{\rho } = \overline {{s}_{\rho }}\), then we get a contradiction to the maximality of σ: By the above inequality, w[e(u _{ λ }) + 1] is contained in \(\overline {{u}_{\lambda }}\), too. Since \(\overline {\sigma }\) is a gapped palindrome, the character w[e(u _{ λ }) + 1] occurs in \(\overline {{u}_{\rho }}\), exactly at w[b(u _{ ρ }) − 1].
 1.Case \(\mathsf {e}({u}_{\lambda }) \le \mathsf {e}(\overline {{u}_{\lambda }})\). Since \(\mathsf {e}(\overline {{u}_{\lambda }})  \overline {{g}}(1\beta )/\alpha \le x \le \mathsf {e}({u}_{\lambda }) \le \mathsf {e}(\overline {{u}_{\lambda }})\),$$ {z} = \mathsf{e}(\overline{{u}_{\lambda}})  \mathsf{e}({u}_{\lambda}) \le \overline{{g}}(1\beta)/\alpha \le \overline{u}(1\beta). $$(8)Since s _{ λ } is a suffix of u _{ λ }, the reverse copy s _{ ρ } is a prefix of u _{ ρ }. The starting positions of both right copies \(\overline {{s}_{\rho }}\) and s _{ ρ } differ by \(\mathsf {b}{(\overline {{s}_{\rho }})}  \mathsf {b}{({s}_{\rho })} = 2{z} + \delta > 0\). The inequality 2z + δ > 0 holds, since \(\mathsf {e}({u}_{\lambda }) \not = \mathsf {e}(\overline {{u}_{\lambda }})\) or \({g} \not = \overline {{g}}\). By (7) and (8), we get$$ 2{z} + \delta \le 3 \overline{{g}}(1\beta)/\alpha \le 3 \overline{u}(1\beta). $$(9)
 1a.
SubCase \(\mathsf {b}{({u}_{\lambda })} \le \mathsf {b}{(\overline {{u}_{\lambda }})}\), see Fig. 8. By (8), we get \({s} = \overline {u}  {z} \ge \overline {u} \beta \). It follows from 6/7 ≤ β < 1 and (9) that \({s} / (2{z} + \delta ) \ge \overline {u} \beta / 3 \overline {u} (1\beta ) = \beta / (3(1\beta )) \ge 2\), which means that s _{ ρ } and \(\overline {{s}_{\rho }}\) overlap by at least half of their common length, and s _{ λ } is periodic. Since s _{ λ } is a prefix of \(\overline {{u}_{\lambda }}\) of length \({s} \ge \overline {u} \beta \), \(\overline {\sigma }\) is in \({{\beta \mathcal {P}}_{\alpha }^{\intercal }(w)}\), a contradiction.
 1b.SubCase \(\mathsf {b}{({u}_{\lambda })} > \mathsf {b}{(\overline {{u}_{\lambda }})}\), see Fig. 9. We conclude that s _{ λ } = u _{ λ }. By (6) and β < 1,It follows from 6/7 ≤ β < 1 and (9) that \({s} / (2{z} + \delta ) \ge \overline {{g}} \alpha \beta / (3 \alpha \overline {{g}} (1\beta )) = \beta / (3(1\beta )) \ge 2\), which means that s _{ λ } = u _{ λ } is periodic. Hence σ is in \({{\beta \mathcal {P}}_{\alpha }^{\intercal }(w)}\), a contradiction.$$ {u} \ge {g}/\alpha \ge \frac{\overline{{g}}}{\alpha} \left( 1  \frac{1\beta}{\alpha}\right) \ge \overline{{g}}\beta/\alpha. $$(10)
 2.Case \(\mathsf {e}({u}_{\lambda }) > \mathsf {e}(\overline {{u}_{\lambda }})\). Since \(\mathsf {e}({u}_{\lambda })  {g}(1\beta )/\alpha \le x \le \mathsf {e}(\overline {{u}_{\lambda }}) \le \mathsf {e}({u}_{\lambda })\),The starting positions of both right copies differ by \(\left \mathsf {b}{({s}_{\rho })}  \mathsf {b}{(\overline {{s}_{\rho }})}\right  = \left 2{z}  \delta \right \) because b(s _{ ρ }) = e(s _{ λ }) + 2z + g and \(\mathsf {b}{(\overline {{s}_{\rho }})} = \mathsf {e}({s}_{\lambda }) + \overline {{g}}\). Since 2z − δ ≤ max (δ,2z), we get \(\left 2{z}  \delta \right  \le 2 \overline {g} (1\beta )/\alpha \le 2 \overline {u}(1\beta )\) by (7) and (11).$$ {z} = \mathsf{e}({u}_{\lambda})  \mathsf{e}(\overline{{u}_{\lambda}}) \le {g}(1\beta)/\alpha \le \overline{g}(1\beta)/\alpha \le \overline{u}(1\beta). $$(11)
 2a.
SubCase \(\mathsf {b}{({u}_{\lambda })} \le \mathsf {b}{(\overline {{u}_{\lambda }})}\), see Fig. 10. We conclude that \({s}_{\lambda } = \overline {{u}_{\lambda }}\). It follows from 6/7 ≤ β < 1 that \({s} / \left 2{z}  \delta \right  \ge \overline {u} / (2 \overline {u} (1\beta )) = 1 / (2(1\beta )) \ge 7/2 > 2\), which means that \({s}_{\lambda } = \overline {{u}_{\lambda }}\) is periodic. Hence \(\overline {\sigma }\) is in \({{\beta \mathcal {P}}_{\alpha }^{\intercal }(w)}\), a contradiction.
 2b.
SubCase \(\mathsf {b}{({u}_{\lambda })} > \mathsf {b}{(\overline {{u}_{\lambda }})}\), see Fig. 11. By z ≤ g(1 − β)/α of (11), we get z ≤ u(1 − β) and thus s = u − z ≥ β u. It follows from (10) and (7), and \(2(\sqrt {2}1) < 6/7 \leq \beta < 1\) that \({s} / \left 2{z}  \delta \right  \ge \beta {u} / (2 \overline {g} (1\beta )/\alpha ) \ge \overline {g} \beta ^{2} / (2 \overline {g} (1\beta )) = \beta ^{2} / (2 (1\beta )) > 2\), which means that s _{ λ } is periodic. Since s _{ λ } is a prefix of u _{ λ } of length s ≥ u β, σ is in \({{\beta \mathcal {P }}_{\alpha }^{\intercal }(w)}\), a contradiction.
The next lemma follows immediately from Lemmas 7 and 12.
Lemma 13
Given two real numbers α,β with α > 1and 6/7 ≤ β < 1, the number of all aperiodic α gapped palindromes \(\left {{\overline {\beta \mathcal {P}}}_{\alpha }^{\intercal }(w)}\right \) of a word w of length n is less than 3α n/(1 − β).
Theorem 14
Given a word w of length n and a real number α > 1, the number of all maximal α gapped palindromes \(\left {{\mathcal {G }}_{\alpha }^{\intercal }(w)}\right \) is less than 28α n + 7n .
Proof
By Lemmas 8 and 13, \(\left {{\mathcal {G}}_{\alpha }^{\intercal }(w)}\right  = \left {{\beta \mathcal {P}}_{\alpha }^{\intercal }(w)}\right  + \left {{\overline {\beta \mathcal {P}}}_{\alpha }^{\intercal }(w)}\right  < 2 (\alpha +1) E(w) / \beta + 3 \alpha n / (1  \beta )\) for every 6/7 ≤ β < 1. Applying Lemma 2, the term is upper bounded by 6(α + 1)n/β + 3α n/(1 − β). This number is minimal when β = 6/7, yielding the bound 28α n + 7n. □
3 Finding All Maximal αgapped Repeats
For the upcoming algorithmic problems, we fix a word w of length n on an alphabet of size \(n^{\mathcal {O}(1)}\). Our computational model is the word RAM model with word size Ω(lgn) (where the function lg is the logarithm to base two). Consequently, each character of w fits into a constant number of memory words.
 (1)
an arm of one character,
 (2)
an arm of length between two and γ lgn characters, and
 (3)
arms longer than γ lgn characters.
As a starter, we can find (1) very easily in our target time of \({\mathcal {O}(\alpha n)}\):
Lemma 15
We can compute all maximal α gapped repeats with an arm of one character in a word w of length n in \({\mathcal {O}(\alpha n)}\) time.
Proof
For each position i with 1 ≤ i ≤ n, we check whether the characters w[i] and w[i + j] form the arms of an maximal αgapped repeat, for 1 ≤ j ≤ α. They form an αgapped repeat if w[i] = w[i + j]. If w[i] = w[i + j] we try to prolong the arms w[i] and w[i + j] to check whether the found αgapped repeat is maximal. □
The main ingredient to our algorithms dealing with (2) and (3) is a data structure for finding maximal equal subwords of a word that start or end at some particular positions.
Lemma 16
 (a)
it can be built in \({\mathcal {O}}(n)\) time,
 (b)
it can compute the longest common prefix of two suffixes of w in constant time, and
 (c)
it can compute the longest common suffix of two prefixes of w in constant time.
Proof
We build the suffix array of w and the longest common prefix (LCP) array of w in \({\mathcal {O}(n)}\) time [15]; Subsequently we invert the suffix array in \({\mathcal {O}(n)}\) time. Having a range minimum data structure [9] on the LCP array, we can solve (b) since the longest common prefix of two suffixes w[s..] and w[t..] can be answered with a range minimum query on the LCP array with the range [S A ^{−1}[s] + 1..S A ^{−1}[t]] in constant time, where S A ^{−1} denotes the inverse suffix array. By building the same data structure on the mirror image of w, we solve (c). □
We call the data structure of Lemma 16 an LCE^{⇔} data structure.^{1} Subsequently, we provide some tools that show the usefulness of an LCE^{⇔} data structure for our problem. We start with a lemma that uses an LCE^{⇔} data structure:
Lemma 17
Given a word w of length n, we can preprocess it in \({\mathcal {O}(n)}\) time such that we can return the longest factor with the period p starting at position i in w (for 1 ≤ i,p ≤ n arbitrary), in constant time.
Proof
Once we have produced the LCE^{⇔} data structure of w, we just have to compute the longest common prefix of w[i,n] and w[i + p,n]. If this prefix is w[i + p,ℓ], then w[i,ℓ] is the longest pperiodic factor starting at position i. □
 On the one hand, we have the socalled single occurrences.

If y is aperiodic, then all its occurrences in z are single occurrences; there are \({\mathcal {O}(\ell )}\) such occurrences [16].

If y is periodic, then the subword z[i,i + y − 1] is a single occurrence if y occurs neither at position i − p nor at position i + p in z.


On the other hand, we call an occurrence of y within a run, if the occurrence is contained in a run whose period is equal to the smallest period of y. Let z[i,i + y − 1] be an occurrence of y. If there is an occurrence of y that shares with z[i,i + y − 1] at least y/2 positions, then both occurrences are occurrences of y within a run. Conversely, given a run r with period p and an occurrence z[i,i + y − 1] of y, then z[i,i + y − 1] is an occurrence within the run r if y occurs either at i − p or at i + p. Additionally, we say that z[i,i + y − 1] is the first occurrence of y within a run of period p if y does not occur at i − p but occurs at i + p. By Lemma 4, there are at most \({\mathcal {O}(\ell )}\) runs containing occurrences of y in z, i.e., \({\mathcal {O}(\ell )}\) first occurrences of y in a run in z.
Corollary 18
Given a subword y of w and a subword z of w with length ℓy, the occurrences of y in z can be represented succinctly in \({\mathcal {O}(\ell )}\) words.
Proof
We only store the starting position of the single and first occurrences, and the period of y. This is sufficient, since we can reconstruct the missing information in constant time due to the LCE^{⇔} data structure.
Having the starting position of the first occurrence of y in a run r we can compute all further occurrences within the run r by an arithmetic progression with the difference equal to the period of y. The number of occurrences within this run can be determined in constant time due to Lemma 17. □
In our approach, we restrict y to be a factor of the length 2^{ k } for an integer k ≥ 1; a factor is called a basic factor if its length is equal to a power of two. We can find the occurrences of a basic factor y in a subword z of length ℓy efficiently due to the following lemma:
Lemma 19 (4, as well as [10, 16] and the references therein)
For each basic factor y of w , we compute an array containing the starting positions of the occurrences of y in ascending order. Computing the arrays can be done in \({\mathcal {O}(n {\lg } n)}\) time.

retrieve the largest index i with A[i] ≤ j (predecessor query),

retrieve the smallest index i with A[i] ≥ j (successor query), and

conduct both above operations in \({\mathcal {O}({\lg } {\lg } \left A\right )}\) time.
Lemma 20 ([21, Observation 2.1])
Given a sorted array of length n storing integers (represented by lgn bits), we can build a data structure in \({\mathcal {O}(n)}\) time that answers predecessor and successor queries in \({\mathcal {O}({\lg } {\lg } n)}\) time.
Corollary 21
Given a word w of length n and an integer ℓ ≥ 2, we can preprocess w in \({\mathcal {O}(n{\lg } n)}\) time such that given a basic factor y := w[i,i + 2^{ k } − 1]and a subword z := w[j,j + ℓ2^{ k } − 1]of w with k ≥ 0, we can find the occurrences of y in z in \({\mathcal {O}({\lg } {\lg } n + \ell )}\) time, and compute their representation as described in Corollary 18.
Proof
As a preprocessing step, we construct the arrays of Lemma 19 in \({\mathcal {O}(n {\lg } n)}\) time. On each such array, we construct the data structure of Lemma 20.
Assume that we get a basic factor and that we want to compute the representation of Corollary 18. In order to find the occurrences of y in z, we search the successor of j and the predecessor of j + ℓ2^{ k } − 1 in the array A storing the starting positions of the occurrences of y. Both retrieved values define the range in A where all occurrences of y in z are contained. Within this range, we can compute the representation of Corollary 18 in \({\mathcal {O}(\ell )}\) time: We linearly process the occurrences in this range from left to right. We return the beginning position of every single occurrence and of every first occurrence. In order to get \({\mathcal {O}(\ell )}\) running time, we have to omit all occurrences within a run except the first occurrence. To this end, we perform the following procedure using a constant number of longest common prefix queries: Since we scan linearly from left to right, we always access the first two (consecutive) occurrences in the run. First, we compute the length of the overlap of both occurrences; this length is the period of y. Subsequently, we determine the length of the run by Lemma 17. Having this length, we skip the remaining occurrences of y in this run (since every occurrence of y is stored in A, and we know the run’s length and period, we know how many occurrences we have to omit). If the next occurrence of y (starting after this run) starts after the computed predecessor of j + ℓ2^{ k } − 1, then we terminate.
Since there are at most \({\mathcal {O}(\ell )}\) runs and single occurrences of y in z, the conclusion follows. □
Lemma 22 (10)
Given a word x and an integer β > γ such that x = β lgn , we can process x in \({\mathcal {O}(\beta {\lg } n)}\) time such that given a basic factor y = x[i2^{ k } + 1,(i + 1)2^{ k }]with i,k ≥ 0and i2^{ k } + 1 > (β − γ)lgn , we can compute a bit vector of length β lgn marking the beginning positions of the occurrences of y in x . The computation of the bit vector takes \({\mathcal {O}(\beta )}\) time.
Lemma 23
Let y and x be defined as in Lemma 22, and let z be a subword of x with length ℓy. Given the bit vector of Lemma 22 marking the starting positions of all occurrences of y in x , we can represent all occurrences of y in x by the representation described in Corollary 18 in \({\mathcal {O}(\ell )}\) time.
Proof
We assume that our RAM model supports retrieving the location of the mostsignificant set bit in the binary representation of an integer in constant time.^{2} Otherwise, as a preliminary step, we store the mapping i↦ ⌊lgi⌋ + 1 for every integer i with 1 < i < n in a lookup table, in \({\mathcal {O}(n)}\) time.
By being able to find the location of the most significant set bit of an integer in constant time, we can output all occurrences of y in z in \({\mathcal {O}(\ell )}\) time. To this end, we scan the bit vector of Lemma 22 in chunks of lgn bits. By skipping all chunks that represent positions of x before z, we only process chunks representing positions of z. Given such a chunk, we retrieve the position of the mostsignificant set bit. This bit represents an occurrence of y that can be retrieved by standard bitoperations. We erase this bit and start to query for the location of the new mostsignificant set bit. If there is no bit set in the current chunk, we fetch the next chunk.
In order to get the representation of Corollary 18, we need to handle occurrences within a run analogously to the proof of Corollary 21. □
Lemma 24
Given a word w and α ≥ 1, we can find all maximal α gapped repeats u _{ λ },v,u _{ ρ } with 1 < u _{ ρ } ≤ γ lgn occurring in w , in \({\mathcal {O}(\alpha n)}\) time.
Proof
An overview of our algorithm follows (see also Algorithm 1): As a preprocessing step, we equip every superblock with the data structure described in Lemma 22, and create an LCE^{⇔} data structure on it. For the actual search, we process each superblock linearly. In each superblock, we search for all maximal αgapped repeats u _{ λ },v,u _{ ρ } that are contained in x _{ m } with u _{ ρ } contained in the suffix of length γ lgn of x _{ m }. In order to spot the right arm u _{ ρ } of a possible gapped repeat, we have to iterate over all possible lengths. Since a linear scan over all lengths would take too much time, we first compute a gapped repeat whose right arm is a basic factor, and then try to extend such a gapped repeat to a maximal αgapped repeat. To this end, we iterate over 0 ≤ k ≤lg(γ lgn) to find gapped repeats with an arm length between 2^{ k+1} and 2^{ k+2} by searching for gapped repeats whose right arms are basic factors of length 2^{ k } contained in the last γ lgn characters of x _{ m } (since we do not allow the overlapping of those right arms, their number is at most \(\gamma \frac {{\lg } n}{2^{k}}\)).
Assume that we identified the copy y _{ λ } := w[ℓ + 1,ℓ + y _{ ρ }] for an integer ℓ with 0 ≤ ℓ < n; we try to build u _{ λ } and u _{ ρ } by extending y _{ λ } and y _{ ρ } in both directions, respectively. To this end, we compute the longest factor p of x _{ m } that ends both at j2^{ k } and at ℓ, and the longest factor s that starts both at (j + 1)2^{ k } + 1 and at ℓ + y _{ ρ } + 1. If ℓ + y _{ ρ } + s > j2^{ k } −p, then y _{ λ } and y _{ ρ } do not determine a maximal repeat (the gap would have a negative length). Otherwise (ℓ + y _{ ρ } + s ≤ j2^{ k } −p), let s _{ λ } and s _{ ρ } denote the left and right occurrences of s, and let p _{ λ } and p _{ ρ } denote the left and right occurrences of p, respectively. Then u _{ λ } is obtained by concatenating p _{ λ }, x _{ m }[ℓ + 1,ℓ + y], and s _{ λ }, while u _{ ρ } is obtained by concatenating p _{ ρ }, x _{ m }[j2^{ k } + 1,(j + 1)2^{ k }], and s _{ ρ }. To avoid duplicates, the determined repeat is only reported if its right arm contains the position j2^{ k } + 1 of x _{ m } within its first 2^{ k } positions.
The algorithm above does not describe how to find the copy y _{ λ } (efficiently). We rectify this omission now: Since u _{ ρ } < 2^{ k+2} and y _{ ρ } = 2^{ k }, the copy y _{ λ } is contained in the subword of x _{ m } of length α2^{ k+2} ending at position j2^{ k }. In our preprocessing, we already equipped x _{ m } with the data structure from Lemma 22. We use this data structure as described in Lemma 23: It allows us to retrieve every possible subword y _{ λ } inside the subword of length α2^{ k+2} ending at position j2^{ k }, in \({\mathcal {O}(\alpha )}\) time. These occurrences can be single occurrences and occurrences within runs. There are \({\mathcal {O}(\alpha )}\) single occurrences, and we can process each of them individually to find the maximal αgapped repeat that is determined by y _{ ρ } and this occurrence.
 (ac)

Assume u _{ ρ } starts within r _{ ρ }, but after r _{ ρ }’s first position (e(r _{ ρ }) > b(u _{ ρ }) > b(r _{ ρ })). Then u _{ λ } starts at the first position of r _{ λ } (otherwise, we could extend both arms to the left, a contradiction to the maximality of the repeat).
 (a)
If u _{ ρ } ends at a position to the right of r _{ ρ }, then u _{ λ } ends at a position to the right of r _{ λ } (otherwise, it would again contradict to the maximality). Moreover, the suffix of u _{ λ } occurring after the end of r _{ λ } and the suffix of u _{ ρ } occurring after the end of r _{ ρ } are equal to the longest equal substring starting at positions e(r _{ λ }) + 1 and e(r _{ ρ }) + 1, and can be computed by a longest common prefix query on x _{ m }.
 (b)
If u _{ ρ } ends exactly at the same position as r _{ ρ } (e(u _{ ρ }) = e(r _{ ρ })), then u _{ ρ } is periodic with the period p as r _{ ρ }. We compute the longest pperiodic prefix u ^{′} of r _{ λ } that is a suffix of r _{ ρ }. By knowing the period p (determined by two subsequent occurrences of y _{ ρ }) and the length of r _{ λ } and r _{ ρ }, the factor u ^{′} can be determined in constant time.
Since u _{ λ } is longer than p, the αgapped repeats under consideration have the left arm u _{ λ } := r _{ λ }[1,u ^{′}− p i] and the right arm u _{ ρ } := r _{ ρ }[r _{ ρ }− (u ^{′}− p i) + 1,r _{ ρ }] for i ≥ 0 such that the gap v := w[e(u _{ λ }) + 1,b(u _{ ρ }) − 1] respects the condition u _{ λ } v ≤ αu _{ λ }.
 (c)
The final case is when u _{ ρ } ends at a position of r _{ ρ }, prior to r _{ ρ }’s last position (e(u _{ ρ }) < e(r _{ ρ })). In that case, we get that u _{ λ } = r _{ λ } (otherwise, we could extend both arms to the right). The left arm u _{ λ } is equal to a factor z ^{ h } z ^{′} for an integer h ≥ 2, where z = r _{ λ }[1,p], p is the period of r _{ λ }, and z ^{′} is a prefix of z.
We can get the position of the first and the last occurrence of z in r _{ ρ }. If the first occurrence starts at ℓ ^{′}, then the starting positions of the succeeding occurrences of z form the arithmetic progression ℓ ^{′}, ℓ ^{′} + p,…,ℓ ^{′} + t p for an integer t ≥ 1. For each 0 ≤ i ≤ t, we let u _{ ρ } start at position ℓ ^{′} + i p (and check whether u _{ ρ } ≡ u _{ λ } by knowing the length of u _{ ρ } and r _{ ρ }).
Finally, additional care has to be taken for the border cases. If u _{ ρ } is a suffix of r _{ λ }, we have to check that we cannot extend simultaneously u _{ ρ } and u _{ λ } to the right. (u _{ ρ } cannot be a prefix of r _{ ρ } since we assumed for the cases (ac) that b(u _{ ρ }) > b(r _{ ρ }).)
 (df)

Assume u _{ ρ } starts at the first position of r _{ ρ } (b(u _{ ρ }) = b(r _{ ρ })).
 (d)
If u _{ ρ } ends at a position inside r _{ ρ }, prior to its last position (b(r _{ ρ }) = e(u _{ ρ }) < e(r _{ ρ })), then u _{ λ } ends at the last position of r _{ λ } (otherwise, both arms could be extended to the right). This means that the gap between the two arms is uniquely determined, and that the arms are periodic for a period p. We compute the longest pperiodic suffix of r _{ λ } that is a prefix of r _{ ρ }, and check whether the two occurrences of this factor determine a maximal αgapped repeat.
 (e)
If u _{ ρ } ends at the last position of r _{ ρ }, then we know the exact location of u _{ ρ }. We can proceed analogously to case (c) by symmetry.
 (f)
Finally, if u _{ ρ } ends at a position to the right of r _{ ρ }, then u _{ λ } ends also after r _{ λ } (e(u _{ ρ }) > e(r _{ ρ }) ∧e(u _{ λ }) > e(r _{ λ })), and the suffix of u _{ λ } occurring after r _{ λ } is equal to the suffix of u _{ ρ } occurring after r _{ ρ }. We determine this suffix by a longest common prefix query on x _{ m }. With this suffix, we obtain the location of both arms.
 (gh)

The last case is when u _{ ρ } starts at a position to the left of r _{ ρ } (b(u _{ ρ }) < b(r _{ ρ })). Then u _{ λ } starts at a position before the first position of r _{ λ } (b(u _{ λ }) < b(r _{ λ }) < e(u _{ λ })); the prefix of u _{ ρ } occurring before the beginning of r _{ ρ } is equal to the prefix of u _{ λ } occurring before r _{ λ }. The length of these prefixes can be retrieved with a longest common suffix query.
 (g)
If e(u _{ ρ }) ≤e(r _{ ρ }), then e(u _{ λ }) ≤e(r _{ λ }), and we are done.
 (h)
Otherwise (e(u _{ ρ }) > e(r _{ ρ })), u _{ λ } and u _{ ρ } contain r _{ λ } and r _{ ρ }, respectively. We determine u _{ λ } be the longest common substring starting at e(r _{ λ }) + 1 and e(r _{ ρ }) + 1.

the arms form a (valid) maximal αgapped repeat,

their length is between 2^{ k+1} and 2^{ k+2}, and whether

the right arm contains position j2^{ k } + 1 of x _{ m } within its first 2^{ k } positions.

for the cases (b) and (c) we check that the right arm contains y _{ ρ }[1] in its first 2^{ k } positions.
This concludes our analysis for finding all αgapped repeats of x _{ m }, for each m separately. We can ensure that our algorithm finds and outputs each maximal repeat exactly once when moving from x _{ m } to x _{ m+1}. To this end, we check that the right arm of each repeat we find is not completely contained in x _{ m } (so it is already found). This condition can be easily imposed in our search: when constructing the arms that are determined by a single occurrence of y _{ ρ }, we check the containment condition separately; when constructing the arms determined by a run of y _{ ρ }occurrences, we have to impose the condition that the right arm extends out of x _{ m } when searching the starting positions of the possible arms.

w ^{′} is a word of length n/lgn on the alphabet {,…,n}, and

w ^{′}[i] = j (1 ≤ i ≤ n/lgn) if and only if the block of w with number j is equal (in the sense of ≡) to w[1 + (i − 1)lgn,i lgn].
Lemma 25
We can build the block representation w ^{′} of a word w of length n in \({\mathcal {O}}(n)\) time.
Equipped with Lemma 25, we are ready to present the algorithm finding longarmed maximal αgapped repeats with large values for α:
Lemma 26
Given a word w of length n, and an α ≥lgn , we can find all maximal α gapped repeats u _{ λ },v,u _{ ρ } with u _{ ρ } > γ lgn occurring in w, in \({\mathcal {O}(\alpha n)}\) time.
Proof
The general approach in proving this lemma is similar to the techniques of the proof of Lemma 24. Essentially, when identifying a new maximal αgapped repeat, we try to fix the place and length of the right arm u _{ ρ } of the respective repeat, which restricts the place where the left arm u _{ λ } occurs. This allows us to fix a long enough subword of w as being part of the right arm, detect its occurrences that are possibly contained in the left arm, and, finally, to efficiently identify the actual repeat. The main difference is that we cannot use the result of Lemma 22, as we have to deal with repeats with arms longer than γ lgn. Instead, we use the structures constructed in Corollary 21. However, to get the stated complexity, we apply this lemma to the blockrepresentation of w, rather than to w itself.
In this sense, the first step is to construct the blockrepresentation w ^{′} of w. Subsequently, we construct the LCE^{⇔} data structures of w and w ^{′}, as well as the data structure of Corollary 21 for the word w ^{′}. Every construction step is conducted in \({\mathcal {O}}(n)\) time.
Like in the proof of Lemma 24, we iterate over all possible arm lengths. For an integer k, we search for all maximal αgapped repeats u _{ λ },v,u _{ ρ } in w with 2^{ k+1}lgn ≤ u _{ λ } ≤ 2^{ k+2}lgn.
For the following, we fix k. Similar to the blockrepresentation, we partition the word w into subwords, but this time into subwords of length 2^{ k }lgn, called kblocks. Again (as for blocks), we assume that each kblock has the same number of characters.
Let y denote the factor of y _{ ρ }. By binary searching the suffix array of w ^{′} (using longest common prefix queries on w to compare the factors of lgn characters of y and the blocks of w ^{′}, at each step of the search) we try to detect a factor of w ^{′} that encodes a word equal to y. Assume that we can find such a sequence y ^{′} of blocks in w ^{′} (otherwise, y cannot correspond to a sequence of blocks from u _{ λ }, so we choose a new y _{ ρ } by taking the next starting position). By Corollary 21, we can spot the occurrences of y ^{′} in the α2^{ k+2} blocks of w ^{′} that occur before the blocks of z, in \({\mathcal {O}}({\lg }{\lg }\left w^{\prime }\right +\alpha )\) time; this range corresponds to an interval of w with a length of α2^{ k+2}lgn.
Each of the occurrences of y ^{′} fixes a possible left arm u _{ λ }; this arm, together with the corresponding arm u _{ ρ } can be constructed with the same techniques as in Lemma 24. In the case of a single occurrence u _{ λ } (there are at most \({\mathcal {O}}(\alpha )\) many of that kind), we extend u _{ λ } and u _{ ρ } in both directions to obtain two arms, for which we have to check if they define a valid αgapped repeat. In order to avoid duplicates, we check that the length of each arm is between 2^{ k+1} and 2^{ k+2}, and that z is the first kblock of the right arm.
Complications occur when some occurrences of y ^{′} are within a run. Given a run of occurrences of y ^{′}, we cannot determine the period of y in general, but a multiple of this period. More precisely, we know that the period of y is a multiple of the block length lgn. However, this is not a problem, since the subword y in u _{ ρ } corresponds to a block sequence from u _{ λ }, hence definitely to one of the subwords encoded in the run of occurrences of y ^{′}. Analogously to the analysis in Lemma 24, we can determine the maximal factor containing y such that it has the same period as the repetition of y ^{′}occurrences (with the period measured in w).

Assume that there are two factors y1′ and \(y^{\prime }_{2}\) of w ^{′} that correspond to two separate factors y _{1} and y _{2}, each of length 2^{ k−1}lgn, occurring in the first lgn characters of z. Since \(y^{\prime }_{1}\) and \(y^{\prime }_{2}\) cannot define the same repeat, the distance between \(y^{\prime }_{1}\) and \(y^{\prime }_{2}\) is at least one block long, i.e., the distance between y _{1} and y _{2} is at least lgn, a contradiction.

Similarly, if we have found a subword y occurring in the first lgn characters of a kblock z such that y determines an αgapped maximal repeat, then the same maximal repeat will not be determined by a subword of another kblock, since z is the first kblock of u _{ ρ }.
 (a)
We iterate over all \(0 \le k \le {\lg } \frac {n}{{\lg } n}2\).
 (b)
For a fixed k, we examine every kblock z, and there are \(\frac {n}{2^{k} {\lg } n}\) many.
 (c)
For a fixed z, we analyze each subword y _{ ρ } of length 2^{ k−1}lgn starting within the first lgn positions of the chosen kblock z.
 (d)
For each such subword y _{ ρ } we find the occurrences of the block encoding the occurrence of y _{ ρ } in u _{ ρ } in \(\mathcal {O}\left ({{\lg } \frac {n}{{\lg } n} + {\lg }{\lg } n+ \alpha }\right )\) time.
 (e)
For each of the \({\mathcal {O}}(\alpha )\) single occurrences u _{ λ }, we check whether it is possible to extend y _{ λ } and y _{ ρ } to a maximal αgapped repeat in \(\mathcal {O}({1})\) time. We also have \(\mathcal {O}({\alpha })\) occurrences of the block encoding u _{ ρ } in runs, all of them are processed in \(\mathcal {O}({\alpha + \text {occ}_{z,y}})\) time overall, where occ_{ z,y } is the number of maximal αgapped repeats we find for a given z and y.
Lemma 27
Given a word w of length n, and an α < lgn , we can find all maximal α gapped repeats u _{ λ },v,u _{ ρ } with u _{ ρ } > γ lgn occurring in w, in \({\mathcal {O}(\alpha n)}\) time.
Proof
Initially, we run the algorithm of Lemma 26 only for k > lglgn to find all maximal αgapped repeats with an arm length of at least 2^{lglgn }lgn. We see that (12) with k > lglgn yields \({\mathcal {O}(\alpha n)}\) time.
In the rest of this proof, we search for maximal αgapped repeats whose arms’ length is upper bounded by 2^{lglgn+1}lgn = 2(lgn)^{2}. Setting ℓ := α ⋅ 2(lgn)^{2} + 2(lgn)^{2} = 2(α + 1)(lgn)^{2}, the lengths of those gapped repeats is at most ℓ. If we cover w with the set of subwords {w[1 + m ℓ,(m + 2)ℓ] : 0 ≤ m ≤ n/ℓ − 2}, then such an αgapped repeat is contained in (at least) one subword of this cover.
In this sense, we can apply the algorithm of Lemma 26 to each subword in the cover (iterating over all m) in order to detect all maximal αgapped repeats with an arm length of at least 2^{lglg(2ℓ)+1}lg(2ℓ) contained completely in each subword of the cover. Equation (12) with k ≥lglg(2ℓ) gives \({\mathcal {O}}(\alpha \ell + \text {occ}_{m})\) running time for the algorithm running on a subword of the cover (each is of length 2ℓ), where occ_{ m } is the number of occurrences of all maximal αgapped repeats with the above described arm length in the mth subword of the cover. Summing over all subwords of the cover, we get \({\mathcal {O}(\alpha n)}\) time in total. By knowing the overlap of two subsequent subwords of the cover, it is easy to adopt the algorithm of Lemma 26 in such a way that no gapped repeat is reported twice.
It is left to find all maximal αgapped repeats with an arm length smaller than 2^{lglg(2ℓ)+1}lg(2ℓ). For n large enough, it holds that 2^{lglg(2ℓ)+1}lg(2ℓ) ≤ γ lgn, since α ≤lgn. But those maximal αgapped repeats are already found by the algorithm of Lemma 24 running in \({\mathcal {O}(\alpha n)}\) time. □
Putting the results of Lemmas 15, 24, 26 and 27 together, we get the following theorem.
Theorem 28
Given a word w and an α ≥ 1, we can compute \({\mathcal {G }_{\alpha }(w)}\) in \({\mathcal {O}(\alpha n)}\) time.
Analogously, we can compute \({{\mathcal {G}}_{\alpha }^{\intercal }(w)}\), generalizing the algorithm of [18]:
Corollary 29
Given a word w and α ≥ 1, we can compute \({{\mathcal {G}}_{\alpha }^{\intercal }(w)}\) in \({\mathcal {O}(\alpha n)}\) time.
Proof
We construct the LCE^{⇔} data structure of \(w {{w}^{{\intercal }}}\) to test in constant time whether a factor \({{w[i,j]}^{{\intercal }}}\) occurs at a position in w. On searching the αgapped palindromes u _{ λ },v,u _{ ρ } (with \({u}_{\rho } \equiv {{{u}_{\lambda }}^{{\intercal }}}\)), we split w into blocks and kblocks (like in Lemma 26) for each k ≤lg w, to check whether there exists a gapped palindrome u _{ λ },v,u _{ ρ } with 2^{ k } ≤ u _{ λ } ≤ 2^{ k+1}. This search is conducted analogously to the case of gapped repeats, with the difference that when fixing the occurrence of a factor y in u _{ ρ }, we have to look for the occurrences of \({{{y}}^{{\intercal }}}\) in the subword of length \({\mathcal {O}}(\alpha \left {u}_{\rho }\right )\) preceding it; the LCE^{⇔} data structure of \(w {{w}^{{\intercal }}}\) is useful for this task, since it allows us to search the mirror images of factors of w inside w in constant time. □
4 Conclusion
We presented two major achievements that shed more light on the combinatorial and computational aspects of αgapped repeats. First, we succeeded in giving concrete bounds for the maximum number of maximal αgapped repeats and maximal αgapped palindromes of a word. Second, we elaborated two algorithms computing the set of all maximal αgapped repeats and the set of all maximal αgapped palindromes, respectively, of a word of length n on an integer alphabet. The achieved combinatorial bounds and the time bounds of the algorithms are asymptotically optimal.
Nevertheless, we deliberately omitted the exact memory consumption of the created data structures (currently \({\mathcal {O}}(n)\) words). With a more careful analysis of the space, we could give preciser bounds (e.g., measured in bits) of the selected data structures, perhaps yielding an algorithm working on succinct space. It is also interesting to further refine both algorithms to such an extent that their running time is output sensitive, i.e., having \({\mathcal {O}}(n + \left {\mathcal {G }_{\alpha }(w)}\right )\) and \({\mathcal {O}}(n + \left {{\mathcal {G }}_{\alpha }^{\intercal }(w)}\right )\) worst case running time, respectively, for a word w. Additionally, we think that our result can serve as a basis for practical solutions, since most of the used data structures are well studied. In this sense, we also want to get better constants for the combinatorial bounds. The current constants seem unreasonably large. We think that a more precise analysis allows us to shrink the constants to a smaller number.
The presented bounds are still valid when working with the more general definition of αgapped φrepeats or αgapped φpalindromes: Let φ : Σ^{∗}→ Σ^{∗} be a word isomorphism, i.e. φ(u v) = φ(u)φ(v), and φ is bijective. For instance, i d(v) = v is a word isomorphism. A subword of the form u v φ(u) (\(uv{{\varphi (u)}^{{\intercal }}}\)) is called αgapped φrepeat (φpalindrome) iff uvu (\(uv{{u}^{{\intercal }}}\)) is an αgapped repeat (palindrome). It is easy to see that our results are also applicable for αgapped φrepeats or αgapped φpalindromes. This generalizes the analysis in [18, Sec. 5]; there, φ is equal to a function building the base complements of a DNA string. The problem of enumerating all 1gapped φrepeats or all 1gapped φpalindromes was already investigated in [11, 12].
Footnotes
 1.
Our abbreviation for a data structure answering longest common extension queries in both directions.
 2.
Commodity computers of the x86 family have an extension instruction set that providesaccess to the functions leading zeros count and bit scan reverse, both returningthe number of leading zeros of the binary representation.
Notes
Acknowledgements
We thank the anonymous reviewers for their careful reading of our manuscript and their insightful comments and suggestions.
The work of Florin Manea was supported by the DFG grant 596676.
References
 1.Badkobeh, G., Crochemore, M.: Computing maximalexponent factors in an overlapfree word. J. Comput. Syst Sci. 82(3), 477–487 (2016)MathSciNetCrossRefMATHGoogle Scholar
 2.Bannai, H., Tomohiro, I., Inenaga, S., Nakashima, Y., Takeda, M., Tsuruta, K.: The runs theorem. arXiv:1406.0263 (2014)
 3.Brodal, G.S., Lyngsø, R. B., Pedersen, C. N. S., Stoye, J.: Finding maximal pairs with bounded gap Proc. CPM, volume 1645 of LNCS, pp 134–149 (1999)Google Scholar
 4.Crochemore, M., Rytter, W.: Usefulness of the karpMillerRosenberg algorithm in parallel computations on strings and arrays. Theor. Comput. Sci. 88(1), 59–82 (1991)MathSciNetCrossRefMATHGoogle Scholar
 5.Crochemore, M., Tischler, G.: Computing longest previous nonoverlapping factors. Inf. Process. Lett. 111(6), 291–295 (2011)MathSciNetCrossRefMATHGoogle Scholar
 6.Crochemore, M., Iliopoulos, C. S., Kubica, M., Rytter, W., Walen, T.: Efficient Algorithms for Two Extensions of LPF table: The Power of Suffix Arrays Proc. SOFSEM, volume 5901 of LNCS, pp 296–307 (2010)Google Scholar
 7.Crochemore, M., Kolpakov, R., Kucherov, G.: Optimal Bounds for Computing αGapped Repeats Proc. LATA, pp 245–255 (2016)Google Scholar
 8.Dumitran, M., Manea, F.: Longest Gapped Repeats and Palindromes Proc. MFCS, volume 9234 of LNCS, pp 205–217 (2015)Google Scholar
 9.Fischer, J., Heun, V.: Space efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011)MathSciNetCrossRefMATHGoogle Scholar
 10.Gawrychowski, P., Manea, F.: Longest αGapped Repeat and Palindrome Proc. FCT, volume 9210 of LNCS, pp 27–40 (2015)Google Scholar
 11.Gawrychowski, P., Manea, F., Mercas, R., Nowotka, D., Tiseanu, C.: Finding Pseudorepetitions Proc. STACS, volume 20 of LIPIcs, pp 257–268 (2013)Google Scholar
 12.Gawrychowski, P., Manea, F., Nowotka, D.: Testing Generalised Freeness of Words Proc. STACS, volume 25 of LIPIcs, pp 337–349 (2014)Google Scholar
 13.Gawrychowski, P., Tomohiro, I., Inenaga, S., Köppl, D., Manea, F.: Efficiently finding all maximal αgapped repeats Proc. STACS, volume 47 of LIPIcs, pp 39:1–39:14 (2016)Google Scholar
 14.Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge (1997)CrossRefMATHGoogle Scholar
 15.Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53, 918–936 (2006)MathSciNetCrossRefMATHGoogle Scholar
 16.Kociumaka, T., Radoszewski, J., Rytter, W., Walen, T.: Efficient Data Structures for the Factor Periodicity Problem Proc. SPIRE, volume 7608 of LNCS, pp 284–294 (2012)Google Scholar
 17.Kolpakov, R., Kucherov, G.: Finding Repeats with Fixed Gap Proc. SPIRE, pp 162–168 (2000)Google Scholar
 18.Kolpakov, R., Kucherov, G.: Searching for gapped palindromes. Theor. Comput. Sci. 410(51), 5365–5373 (2009)MathSciNetCrossRefMATHGoogle Scholar
 19.Kolpakov, R., Podolskiy, M., Posypkin, M., Khrapov, N.: Searching of Gapped Repeats and Subrepetitions in a Word Proc. CPM, volume 8486 of LNCS, pp 212–221 (2014)Google Scholar
 20.Manacher, G.: A new lineartime “online” algorithm for finding the smallest initial palindrome of a string. J. ACM 22(3), 346–351 (1975)CrossRefMATHGoogle Scholar
 21.Ruzic, M.: Making deterministic signatures quickly. ACM Transactions on Algorithms 5 (3) (2009)Google Scholar
 22.Tanimura, Y., Fujishige, Y., Tomohiro, I., Inenaga, S., Bannai, H., Takeda, M.: A faster algorithm for computing maximal αgapped repeats in a string Proc. SPIRE, volume 9309 of LNCS, pp 124–136 (2015)Google Scholar