Validating the KnuthMorrisPratt Failure Function, Fast and Online
 1.3k Downloads
 7 Citations
Abstract
Let \(\pi'_{w}\) denote the failure function of the KnuthMorrisPratt algorithm for a word w. In this paper we study the following problem: given an integer array \(A'[1 \mathinner {\ldotp \ldotp }n]\), is there a word w over an arbitrary alphabet Σ such that \(A'[i]=\pi'_{w}[i]\) for all i? Moreover, what is the minimum cardinality of Σ required? We give an elementary and selfcontained \(\mathcal{O}(n\log n)\) time algorithm for this problem, thus improving the previously known solution (Duval et al. in Conference in honor of Donald E. Knuth, 2007), which had no polynomial time bound. Using both deeper combinatorial insight into the structure of π′ and advanced algorithmic tools, we further improve the running time to \(\mathcal{O}(n)\).
Keywords
Linear Time Consistency Check Online Algorithm Suffix Tree Failure Function1 Introduction
1.1 Pattern Recognition and Failure Functions
The pattern matching algorithms attracted much attention since the dawn of computer science. It was particularly interesting, whether a lineartime algorithm for this problem exists. First results were obtained by Matiyasevich for a fixed pattern in the Turing Machine model [18]. However, the first fully linear time pattern matching algorithm is the MorrisPratt algorithm [21], which is designed for the RAM machine model, and is well known for its beautiful concept. It simulates the minimal DFA recognizing Σ ^{∗} p (p denotes the pattern) by using a failure function π _{ p }, known as the border array. The automaton’s transitions are recovered, in amortized constant time, from the values of π _{ p } for all prefixes of the pattern, to which the DFA’s states correspond. The values of π _{ p } are precomputed in a similar fashion, also in linear time.
The MP algorithm has many variants. For instance, the KnuthMorrisPratt algorithm [17] improves it by using an optimised failure function, namely the strict border array π′ (or strong failure function). This was improved by Simon [23], and further improvements are known [1, 13]. We focus on the KMP failure function for two reasons. Unlike later algorithms, it is wellknown and used in practice. Furthermore, the strong border array itself is of interest as, for instance, it captures all the information about periodicity of the word. Hence it is often used in word combinatorics and numerous text algorithms, see [4, 5]. On the other hand, even Simon’s algorithm (i.e., the very first improvement) deals with periods of pattern prefixes augmented by a single text symbol rather than pure periods of pattern prefixes.
1.2 Strict Border Array Validation
Problem Statement
We investigate the following problem: given an integer array \(A'[1 \mathinner {\ldotp \ldotp }n]\), is there a word w over an arbitrary alphabet Σ such that \(A'[i]=\pi _{w}'[i]\) for all i, where \(\pi_{w}'\) denotes the failure function of the KnuthMorrisPratt algorithm for w. If so, what is the minimum cardinality of the alphabet Σ over which such a word exists?
Pursuing these questions is motivated by the fact that in word combinatorics one is often interested only in values of \(\pi_{w}'\) rather than w itself. For instance, the logarithmic upper bound on delay of KMP follows from properties of the strict border array [17]. Thus it makes sense to ask if there is a word w admitting \(\pi _{w}'=A'\) for a given array A′.
We are interested in an online algorithm, i.e., one that receives the input array values one by one, and is required to output the answer after reading each single value. For the KnuthMorrisPratt array validation problem it means that after reading A′[i] the algorithm should answer, whether there exist a word w such that \(A'[1 \mathinner {\ldotp \ldotp }i] = \pi_{w}'[1 \mathinner {\ldotp \ldotp }i]\) and what is the minimum size of the alphabet over which such a word w exists.
Previous Results
To our best knowledge, this problem was investigated only for a slightly different variant of π′, namely a function g that can be expressed as g[n]=π′[n−1]+1, for which an offline validation algorithm due to Duval et al. [8] is known. Validation of border arrays is used by algorithms generating all valid border arrays [9, 11, 20].
Unfortunately, Duval et al. [8] provided no upper bound on the running time of their algorithm, but they did observe that on certain input arrays it runs in Ω(n ^{2}) time.
Our Results
We give a simple \(\mathcal{O} (n \log n)\) online algorithm Validateπ′ for the strong border array validation, which uses the linear offline bijective transformation between π and π′. Validateπ′ is also applicable to g validation with no changes, thus giving the first provably polynomial algorithm for the problem considered by Duval et al. [8]. Note that aforementioned bijection between π and π′ cannot be applied directly to g, as it essentially uses the unavailable value π[n]=π′[n], see Sect. 2.
Then we improve Validateπ′ to an optimal linear online algorithm LinearValidateπ′. The improved algorithm relies on both more sophisticated data structures, such as dynamic suffix trees supporting LCA queries, and deeper insight into the combinatorial properties of π′ function.
Related Results
The study of validating arrays related to string algorithms and word combinatorics was started by Franěk et al. [11], who gave an offline linear algorithm for border array validation. This result was improved over time, in particular a simple linear online algorithm for π validation is known [9].
The border array validation problem was also studied in the more general setting of the parametrised border array validation [14, 15], where parametrised border array is a border array for text in which a permutation of letters of alphabet is allowed. A linear time algorithm for a restricted variant of this problem is known [14] and a \(\mathcal{O}(n^{1.5})\) for the general case [15].
Recently a linear online algorithm for a closely related prefix array validation was given [2], as well as for cover array validation [6].
2 Preliminaries
For w∈Σ ^{∗}, we denote its length by w. For v,w∈Σ ^{∗}, by vw we denote the concatenation of v and w. We say that u is a prefix of w if there is v∈Σ ^{∗} such that w=uv. Similarly, we call v a suffix of w if there is u∈Σ ^{∗} such that w=uv. A word v that is both a prefix and a suffix of w is called a border of w. By w[i] we denote the ith letter of w and by \(w[i \mathinner {\ldotp \ldotp }j]\) we denote the subword w[i]w[i+1]…w[j] of w. We call a prefix (respectively: suffix, border) v of the word w proper if v≠w, i.e., it is shorter than w itself.
Functions π and π′ for a word aabaabaaabaabaac
i  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16 

w[i]  a  a  b  a  a  b  a  a  a  b  a  a  b  a  a  c 
π[i]  0  1  0  1  2  3  4  5  2  3  4  5  6  7  8  0 
π[i]  −1  1  −1  −1  1  −1  −1  5  1  −1  −1  1  −1  −1  8  0 
By \(\pi_{w}^{(k)}\) we denote the kfold composition of π _{ w } with itself, i.e., \(\pi_{w}^{(0)}[i]:=i\) and \(\pi_{w}^{(k+1)}[i]:=\pi_{w}[\pi_{w}^{(k)}[i]]\). This convention applies to other functions as well. We omit the subscript w in π _{ w }, whenever it is unambiguous. Note that every border of \(w[1 \mathinner {\ldotp \ldotp }i]\) has length \(\pi _{w}^{(k)}[i]\) for some integer k≥0.
The strong failure function π′ is defined as follows: \(\pi'_{w}[n] := \pi_{w}[n]\), and for i<n, π′[i] is the largest k such that \(w[1 \mathinner {\ldotp \ldotp }k]\) is a proper border of \(w[1 \mathinner {\ldotp \ldotp }i]\) and w[k+1]≠w[i+1]. If no such k exists, π′[i]=−1.
For two arrays of numbers A and B, we write \(A[i_{a} \mathinner {\ldotp \ldotp }i_{a}+k] \geq B[i_{b} \mathinner {\ldotp \ldotp }i_{b}+k]\) when A[i _{ a }+j]≥B[i _{ b }+j] for j=0,…,k.
2.1 Border Array Validation
Roughly speaking, given a valid border array \(A[1 \mathinner {\ldotp \ldotp }n] \) Validateπ computes all valid πcandidates for A[n+1]: given a valid border array \(A[1\mathinner {\ldotp \ldotp }n]\) the next element A[n+1] is a valid πcandidate if \(A[1 \mathinner {\ldotp \ldotp }n+1]\) is a valid border array as well. The exact formula for the set of valid candidates is not useful for us, though it should be noted that it depends only on \(A[1 \mathinner {\ldotp \ldotp }n]\) and that 0 and A[n]+1 are always valid πcandidates.
The key idea needed to understand the algorithm is that w[i] depends only on the letters of w at positions A ^{ k }[i−1]+1 for k=1,2,… . Thus the algorithm stores Σ[i], the alphabet size required for such sequence of indices starting at i, for all i. The minimum size of the alphabet required for the whole array A is the maximum over all those values.
 (Val1)

the valid candidates for π[i] depend only on \(\pi [1 \mathinner {\ldotp \ldotp }i1]\),
 (Val2)

π[i−1]+1 is always a valid candidate for π[i],
 (Val3)

if the alphabet needed for \(A[1 \mathinner {\ldotp \ldotp }n]\) is strictly larger than the one needed for \(A[1 \mathinner {\ldotp \ldotp }n1]\) then A[n]=0.
3 Overview of the Algorithm
Definition 1
(Consistent functions)
 (CF1)

\(A[1\mathinner {\ldotp \ldotp }n+1] = \pi_{w}[1 \mathinner {\ldotp \ldotp }n+1]\),
 (CF2)

\(A'[1\mathinner {\ldotp \ldotp }n] = \pi'_{w}[1\mathinner {\ldotp \ldotp }n]\).
 (CF3)

every \(B[1 \mathinner {\ldotp \ldotp }n+1]\) consistent with \(A'[1\mathinner {\ldotp \ldotp }n]\) satisfies \(B[1 \mathinner {\ldotp \ldotp }n+1] \leq A[1 \mathinner {\ldotp \ldotp } n+1]\).
Note that it is crucial that A is defined also on n+1.
Our algorithm Validateπ′ (and its improved variant LinearValidateπ′) maintains such a maximal A.
Slopes and Their Properties
The last slope is defined correctly if and only if (3a) holds (otherwise the slope should end earlier), while the values of A and A′ on the last slope are consistent if and only if (3b) holds. These conditions are checked by appropriate queries: (3a) by the pin value check (denoted PinValueCheck), which returns any \(j \in[i \mathinner {\ldotp \ldotp }n]\) such that A′[j]>A[j] or, if there is no such j, the smallest \(j \in[i \mathinner {\ldotp \ldotp }n]\) such that A′[j]=A[j]; and (3b) by the consistency check (denoted ConsistencyCheck), which checks whether \(A'[i \mathinner {\ldotp \ldotp }n] = A'[A[i] \mathinner {\ldotp \ldotp }A[i] + (ni)]\).
If one of the conditions (3a), (3b) does not hold, Validateπ′ adjusts the last slope of A, until both conditions hold or the input is reported as invalid. These actions are given in detail in Algorithm 6.
If the pin value check returns an index j such that A′[j]>A[j], then we reject the input and report an error: since A is the maximal consistent function, for each consistent function A _{1} it also holds that A _{1}[j]<A′[j] and so none such A _{1} exists and so A′ is invalid.
If ConsistencyCheck fails, then we set the value of A[i] to the next valid candidate value for π[i], see Fig. 3 and propagate the change along the whole slope. If this happens for A[i]=0, then there is no further candidate value, and A′ is rejected. The idea is that some adjustment is needed and since pin value check does not return an index, we cannot break the slope into two and so the only possibility is to decrement A on the whole last slope.
Unfortunately, this simple combinatorial idea alone fails to produce a lineartime algorithm. The problem is caused by the second condition: large segments of A′ should be compared in amortised constant time. While LCA queries on suffix trees seem ideal for this task, available solutions are imperfect: the online suffix tree construction algorithms [19, 24] are linear only for alphabets of constant size, while the only lineartime algorithm for larger alphabets [10] is inherently offline. To overcome this obstacle we specialise the data structures used, building the suffix tree for compressed encoding of A′ and multiple suffix trees for short texts over polylogarithmic alphabet. The details are presented in Sect. 8.
4 Details and Correctness
In this section we present technical details of the algorithm, provide a proof of its correctness and proofs of used combinatorial properties. We do not address the running time and the way the data structures are organised. We start with showing that all the consistent tables coincide on indices smaller than pin.
Lemma 1
Let \(A[1\mathinner {\ldotp \ldotp }n+1] \geq B[1\mathinner {\ldotp \ldotp }n+1]\) be both consistent with \(A'[1 \mathinner {\ldotp \ldotp }n]\). Let i be the pin (for A). Then \(A[1 \mathinner {\ldotp \ldotp }i1] = B[1 \mathinner {\ldotp \ldotp }i1]\).
Proof
Data Maintained

n, the number of values read so far,

\(A'[1\mathinner {\ldotp \ldotp }n]\), the input read so far,

i, the current pin
 \(A[1\mathinner {\ldotp \ldotp }n+1]\), the maximal function consistent with \(A'[1\mathinner {\ldotp \ldotp }n]\):

\(A[1 \mathinner {\ldotp \ldotp }i1]\), the fixed prefix,

A[i], the candidate value that may change.

Sets of Valid π Candidates and Validating A
Validateπ′ creates a border array A, which is always valid by the construction. Nevertheless, it runs Validate\(\pi(A[1 \mathinner {\ldotp \ldotp }i1])\). This way the set of valid candidates for π[i] is computed, as well as a word w over a minimalsize alphabet Σ such that \(\pi_{w} [1 \mathinner {\ldotp \ldotp }i1] = A[1 \mathinner {\ldotp \ldotp }i1]\).
In the remainder of this section it is shown that invariants CF1–CF3 are preserved by Validateπ′.
Lemma 2
If A′[n]=A′[A[n]], then no changes are done by Validateπ′ and the CF1–CF3 are preserved.
Proof

CF1 holds trivially: the implicit A[n+1]=A[n]+1 is always a valid value for π[n+1], see Val2.

CF2 holds: as A′[n]<A[n] by (1) it is enough to check that A′[n]=A′[A[n]], which holds by (3b).
 CF3 holds: consider any \(B[1 \mathinner {\ldotp \ldotp }n+1]\) consistent with \(A'[1 \mathinner {\ldotp \ldotp }n]\). By induction assumption CF3 holds for \(A[1 \mathinner {\ldotp \ldotp }n]\), hence B[n]≤A[n]. Thereforewhich shows the last claim and thus completes the proof. □$$B[n+1] \leq B[n]+1 \leq A[n]+1=A[n+1] , $$
Thus it is left to show that CF1–CF3 are preserved by AdjustLastSlope. We show that during the adjusting inside AdjustLastSlope CF1 and CF3 hold. To be more specific, CF1 alone means that A is always a valid border array, while CF3 means that it is greater than any border table consistent with A′ (this is assumed to hold vacuously if no consistent table exists). Finally, we show that CF3 holds when AdjustLastSlope ends adjusting the last slope, i.e., that then A is in fact consistent with A′.
For the completeness of the proof, we need also to show that if at any point A′ was reported to be invalid, it is in fact invalid.
Lemma 3
After each iteration of the loop in line 1 of AdjustLastSlope the CF1 and CF3 are preserved. Furthermore, if AdjustLastSlope rejects A′ in line 3 or 9, then A′ is invalid.
Proof
We show both claims by induction. In the following, let \(A_{1}[1 \mathinner {\ldotp \ldotp }n+1]\) be any table consistent with \(A'[1 \mathinner {\ldotp \ldotp }n]\).
For the induction base note that \(A[1\mathinner {\ldotp \ldotp }n]\) and \(A'[1\mathinner {\ldotp \ldotp }n1]\) satisfy CF1–CF3. To see that CF1 is satisfied by \(A[1\mathinner {\ldotp \ldotp }n+1]\) note that the assigned value A[n+1]=A[n]+1 is always a valid πvalue, so CF1 holds for \(A[1 \mathinner {\ldotp \ldotp }n+1]\). Similarly, for CF3 note that \(A[1 \mathinner {\ldotp \ldotp }n] \geq A_{1}[1 \mathinner {\ldotp \ldotp }n]\) and A[n+1]=A[n]+1≥A _{1}[n]+1≥A _{1}[n+1], which shows that CF3 holds for \(A[1\mathinner {\ldotp \ldotp }n+1]\). Additionally, the second claim of the Lemma holds vacuously for \(A[1\mathinner {\ldotp \ldotp }n+1]\), as so far it was not rejected.
Suppose that PinValueCheck returns no index j. Then by the induction assumption CF1 and CF3 hold, which ends the proof in this case.
Suppose that PinValueCheck returns j such that A[j]<A′[j]. Then, since CF3 is satisfied, A _{1}[j]≤A[j]<A′[j], i.e., A _{1} is not a valid π table. So no A _{1} is consistent with A′, which means that A′ is invalid, as reported by Validateπ′. This ends the proof in this case.
It is left to consider the case in which PinValueCheck returns j such that A[j]=A′[j]. Then CF1 is satisfied: A[j] is explicitly set to a valid π candidate while for p>j the A[p] is set to A[p]=A[p−1]+1, which is always a valid π candidate, by Val2. Furthermore, j is an end of slope for A _{1}: By CF3, A _{1}[j]≤A[j]=A′[j] but as A _{1} is a valid π table, A _{1}[j]≥A′[j]. So A′[j]=A _{1}[j] and therefore, by (2), it is an end of a slope for A _{1}. As a consequence, by Lemma 1, \(A[i \mathinner {\ldotp \ldotp }j]=A_{1}[i \mathinner {\ldotp \ldotp }j]\). Note that for \(p \in[i \mathinner {\ldotp \ldotp }j1]\) it holds that A[p]>A′[p]: otherwise PinValueCheck would have returned such p instead of j. Thus, by (1), A[p] and A′[p] should satisfy A′[p]=A′[A[p]], and this condition is verified by AdjustLastSlope in line 8. If this equation is not satisfied by some p then clearly \(A'[i \mathinner {\ldotp \ldotp }j1]\) is not consistent with \(A[i \mathinner {\ldotp \ldotp }j]\). Since \(A_{1}[i \mathinner {\ldotp \ldotp }j] = A[i \mathinner {\ldotp \ldotp }j]\) this shows that no such A _{1} exists and consequently A′ is invalid. This shows the second subclaim.
Lemma 4
Suppose that PinValueCheck returns no j and that A satisfies CF1 and CF3. If ConsistencyCheck returns false and A[i]=0 then A′ is invalid. Otherwise after adjusting in line 20 of AdjustLastSlope, CF1 and CF3 hold.
Proof
We now prove that in fact A[i]>A _{1}[i]. Suppose for the sake of contradiction that A[i]=A _{1}[i]. It is not possible that \(A[i \mathinner {\ldotp \ldotp }n+1] = A_{1}[i \mathinner {\ldotp \ldotp }n+1]\): since PinValueCheck returned no j, for each p≥i we have A _{1}[p]=A[p]>A′[p]. In such case by (2) it holds that A′[p]=A′[A _{1}[p]] but from the answer of the PinValueCheck we know that this is not the case.
Consider the smallest position, say p, such that A[p+1]>A _{1}[p+1]; such a position exists as \(A[i \mathinner {\ldotp \ldotp }n+1] \geq A_{1}[i \mathinner {\ldotp \ldotp }n+1]\) and \(A[i \mathinner {\ldotp \ldotp }n+1] \neq A_{1}[i \mathinner {\ldotp \ldotp }n+1]\). Now consider A _{1}[p]: since A _{1}[p+1]<A _{1}[p]+1 then by (2) this means that A _{1}[p]=A′[p]. This is a contradiction, as PinValueCheck should have returned this p.
Therefore, when ConsistencyCheck returns no then A _{1}[i]<A[i] for an arbitrary A _{1} that is consistent with A′. In particular, if A[i]=0, there is no such A _{1}, and hence A′ is invalid.
It is left to show that CF1 holds, i.e., that \(A[i \mathinner {\ldotp \ldotp }n+1]\) were all assigned valid candidates for π at their respective positions. This was addressed explicitly for A[i], while for p>i the assigned values are A[p−1]+1, which are always valid by Val2. □
The last lemma shows that when AdjustLastSlope finishes, CF2 is satisfied as well.
Lemma 5
When AdjustLastSlope finishes, CF2 is satisfied.
Proof
Recall the recursive formula (2) for π′. Its first case corresponds to j being the last element on the slope and the second to other j’s.
If A[j] is an explicit value and j is not an end of a slope, this formula is verified, when A[j] is stored. If A[j] is explicit and j is an end of the slope then the formula trivially holds.
If A[j] is an implicit value, i.e., such that j is on the last slope of A, PinValueCheck guarantees that A[j]>A′[j] and so the second case of this formula should hold. This is verified by ConsistencyCheck. Hence CF2 holds when all adjustments are finished. □
The above four lemmata: Lemma 2–Lemma 5 together show the correctness of Validateπ′.
Theorem 1
Validateπ′ verifies whether A′ is a valid strict border array. If so, it supplies the maximal function A consistent with A′.
Proof
We proceed by induction on n. If n=0, then clearly A[1]=0, CF1–CF3 hold trivially, and A′ is a valid (empty) π′ array. If n>0 and no adjustments were done, CF1–CF3 hold by Lemma 2. So we consider the case when AdjustLastSlope was invoked.
By Lemma 3 and Lemma 4 if the \(A'[1 \mathinner {\ldotp \ldotp }n]\) is rejected, it is invalid. So assume that \(A'[1 \mathinner {\ldotp \ldotp }n]\) was not rejected. We show that it is valid. As it was not rejected, by Lemma 3 and Lemma 4 the constructed table \(A[1 \mathinner {\ldotp \ldotp }n+1]\) together with \(A'[1 \mathinner {\ldotp \ldotp }n]\) satisfy CF1 and CF3. Moreover, by Lemma 5 they satisfy also CF2. Thus \(A[1 \mathinner {\ldotp \ldotp }n+1]\) is a valid border array for some word \(w[1 \mathinner {\ldotp \ldotp }n+1]\) and \(A'[1 \mathinner {\ldotp \ldotp }n]\) is a valid strong border array for the same word \(w[1 \mathinner {\ldotp \ldotp }n]\). □
In the following section we explain how to perform the pin value checks and consistency checks efficiently and bound the whole running time of the algorithm.
5 Performing Pin Value Checks
Domination Properties
As ≺ is an intersection of two transitive relations (order on indices and order on T, defined as T[j]=A[j]−j), it is transitive.
Data Stored
Answering PinValueCheck
When PinValueCheck is asked, we check whether A[j _{1}]≤A′[j _{1}] and return the answer. This way the PinValueCheck is answered in constant time. We show that evaluating this expression for other values of j is not needed, as if A′[j]≥A[j] for some j, then A′[j _{1}]≥A[j _{1}], and moreover if A′[j]>A[j], then also A′[j _{1}]>A[j _{1}].
Update
There is another possible update: when PinValueCheck return j _{1} then i←j _{1}+1 and so j _{1}+1 becomes the new pin. In such case we remove j _{1} from the list.
As each position enters and leaves the list at most once, the time of update is linear.
Lemma 6
All PinValueCheck calls can be made in amortised constant time.
6 Performing Consistency Checks: Slow but Easy
In order to perform consistency check we need to efficiently perform two operations: appending a letter to the current text \(A'[1 \mathinner {\ldotp \ldotp }n]\) and checking if two fragments of the prefix read so far are the same. First we show how to implement both of them using randomisation so that the expected running time is \(\mathcal{O}(\log n)\) per one consistency check. In the next section we improve the running time to (deterministic) \(\mathcal{O}(1)\).
We use the standard labeling technique [16], assigning unique small names to all fragments of lengths that are powers of two. More formally, let name[i][j] be an integer from {1,…,n} such that name[i][j]=name[i′][j] if and only if \(A'[i..i+2^{j}1]=A'[i' \mathinner {\ldotp \ldotp }i'+2^{j}1]\). Then checking if any two fragments of A′ are the same is easy: we only need to cover both of them with fragments of length 2^{ j }, where 2^{ j } is the largest power of two not exceeding their length. Then we check if the corresponding fragments of length 2^{ j } are the same in constant time using the previously assigned names.
Appending a new letter A′[n+1] is more difficult, as we need to compute name[n−2^{ j }+2][j] for all j=1,…,logn. We set name[n+1][0] to A′[n+1]. For names with j>0 we need to check if a given fragment of text \(A'[n2^{j}+2 \mathinner {\ldotp \ldotp }n+1]\) occurs at some earlier position, and if so, choose the same name. To locate the previous occurrences, for each j>0 we keep a dictionary M(j) mapping pair (name[i][j−1],name[i+2^{ j−1}][j−1]) to name[i][j]. To check if a given fragment \(A'[n2^{j}+2 \mathinner {\ldotp \ldotp }n+1]\) occurs previously in the text, we look up the pair (name[n−2^{ j }+2][j−1],name[n−2^{ j−1}+2][j−1]) in M(j). If there is such an element in M(j), we set name[n−2^{ j }+2][j] equal to the corresponding name. Otherwise we set name[n−2^{ j }+2][j] equal to the size of M(j) plus 1, which is the smallest integer which we have not assigned as a name of fragment of length 2^{ j } yet. Then we update the dictionary accordingly: we insert mapping from (name[n−2^{ j }+2][j−1],name[n−2^{ j−1}+2][j−1]) to the newly added element.
To implement the dictionaries M(j), we use dynamic hashing with a worstcase constant time lookup and amortized expected constant time for updates (see [7] or a simpler variant with the same performance bounds [22]). Then the expected running time of the whole algorithm becomes \(\mathcal{O}(n\log n)\), as there are logn dictionaries, each running in expected linear time (the expectation is taken over the random choices of the algorithm).
7 Size of the Alphabet
Validateπ not only answers whether the input table is a valid border array, but also returns the minimum size of the needed alphabet. We show that this is also true of Validateπ′. Roughly speaking, Validateπ′ runs Validateπ and simply returns its answers. To this end we show that the minimum alphabet size required by the fixed prefix of A matches the minimum alphabet size required by A′.
Lemma 7
Let \(A'[1 \mathinner {\ldotp \ldotp }n]\) be a valid π′ function, \(A[1 \mathinner {\ldotp \ldotp }n+1]\) the maximal function consistent with \(A'[1 \mathinner {\ldotp \ldotp }n]\), and i the pin. The minimum alphabet size required by \(A'[1\mathinner {\ldotp \ldotp }n]\) equals the minimum alphabet size required by \(A[1\mathinner {\ldotp \ldotp }i1]\) if A[i]>0, and by \(A[1\mathinner {\ldotp \ldotp }i]\) if A[i]=0.
Proof
Suppose first that A[i]>0. Thus Validateπ run on \(A[1 \mathinner {\ldotp \ldotp }n]\) returns the same size of required alphabet as run on \(A[1 \mathinner {\ldotp \ldotp }i1]\) since new letters are needed only when A[j]=0 at some position, see Val3, and A[j]>0 for j on the last slope. Consider any \(B[1 \mathinner {\ldotp \ldotp }n+1]\) consistent with \(A'[1 \mathinner {\ldotp \ldotp }n]\). Then \(B[1 \mathinner {\ldotp \ldotp }i1] = A[1 \mathinner {\ldotp \ldotp }i1]\) by Lemma 1. Thus A requires an alphabet larger than that required by \(B[1 \mathinner {\ldotp \ldotp }i1]\), which is clearly no larger than the one required by the whole \(B[1 \mathinner {\ldotp \ldotp }n]\).
Note that Validateπ′ runs Validateπ either on \(A[1\mathinner {\ldotp \ldotp }i1]\), or on \(A[1\mathinner {\ldotp \ldotp }i1]\) when A[i]=0. In either case, all these values are fixed, and thus no position of A is inspected twice by Validateπ.
We further note that Lemma 7 implies that the minimum size of the alphabet required for a valid strict border array is at most as large as the one required for border array. The latter is known to be \(\mathcal{O} (\log n)\) [20, Th. 3.3a]. This observation implies the following.
Corollary 1
The minimum size of the alphabet required for a valid strict border array is \(\mathcal{O} (\log n)\).
8 Improving the Running Time to Linear
This section describes our linear time online algorithm LinearValidateπ′ by specifying necessary changes to Validateπ′. It suffices to show how to perform consistency checks more efficiently, as each other operations works in amortised constant time. A natural approach is as follows: construct a suffix tree [10, 19, 24] for the input table \(A'[1 \mathinner {\ldotp \ldotp }n]\), together with a data structure for answering LCA queries [3]. The best known algorithm for constructing the suffix tree runs in linear time, regardless of the size of the alphabet [10]. Unfortunately, this algorithm, and all other linear time solutions we are aware of, are inherently offline, and as such invalid for our purposes. The online suffix tree constructions of [19, 24] have a slightly bigger running time of \(\mathcal{O} (n \log\varSigma)\), where Σ is the alphabet. As A′ is a text over an alphabet {−1,0,…,n−1}, i.e., of size n+1, these constructions would only guarantee an \(\mathcal{O}(n\log n)\) time.
To get a linear time algorithm we exploit both the structure of the π′ array and the relationship between subsequent consistency checks. In more detail, firstly we demonstrate how to improve Ukkonen’s algorithm [24] so that it runs in time \(\mathcal{O} (n)\) for alphabets of polylogarithmic size, which may be of independent interest. This alone is still not enough, since A′ is over an alphabet of linear size. To overcome this obstacle we use the combinatorial properties of A′ to compress it. The compressed table uses alphabet of polylogarithmic size, which makes the improved version of the Ukkonen’s algorithm applicable. New problems arise, as the compressed table is a little harder to read and further conditions need to be verified to answer the consistency checks.
8.1 Suffix Trees for Polylogarithmic Alphabet
In this section we present a construction of an online dictionary with constant time access and insertion, for t=logn elements. When used in Ukkonen’s algorithm [24], it guarantees the following construction of suffix trees.
Lemma 8
For any constant c, the suffix tree for a text of length n over an alphabet of size log^{ c } n can be constructed online in \(\mathcal{O}(n)\) time. Given a vertex in the resulting tree, its child labeled by a specified letter can be retrieved in constant time.
The only reason Ukkonen’s algorithm [24] does not work in linear time is that given a vertex it needs to efficiently retrieve its child labeled with a specified letter. If we are able to perform such a retrieval in constant time, the Ukkonen’s algorithm runs in linear time.
For that we can use the atomic heaps of Fredman and Willard [12], which allow constant time search and insert operations on a collection of \(\mathcal{O}(\sqrt{\log n})\)elements sets. This results in a fairly complicated structure, which can be greatly simplified since in our case not only are the sets small, but the size of the universe is bounded as well.
Simplifying Assumptions
We assume that the value of ⌊logn⌋ is known. Since n is not known in advance, when we read elements of A′ onebyone, as soon as the value of n doubles, we repeat the whole computation with a new value of ⌊logn⌋. This changes the running time only by a constant factor.
It is enough to give the construction for the alphabet of size logn as for alphabets of size log^{ c } n we can encode each letter in c characters chosen from an alphabet of a logarithmic size.
First Step: Dictionary for Small Number of Elements
We implement an online dictionary for an universe of size logn. Both access and insert time are constant and the memory usage is at most linear in the number of elements stored. The first step of the construction is a simpler case of t keys, for \(t \leq\sqrt{\log n}\). Then this construction is folded twice to obtain the general case of t=Θ(logn). One step of such a construction is depicted on Fig. 5.
The indices of items currently present in the dictionary are encoded in one machine word, called the characteristic vector V, in which the bit V[i]=1 if and only if dictionary contains key i.
We store pointer to the keys in the dictionary in a dynamically resized pointer table, in order of their arrival times: whenever we insert a new item, its pointer is put right after the previously added one. Additionally, we keep a permutation table P that encodes the order in which currently stored elements have been inserted. In other words, P[i] stores the position in the pointer table of the pointer to i. Since \(t \leq\sqrt{\log n}\), all successive values of such permutation can be stored in one machine word.
Accessing the Information for Small Number of Elements
If we want to find the pointer to the element number k, we first check if V[k]=1. Then we find the index of k, i.e., j=#{k′≤k: V[k′]=1}. To do this, we mask out all the bits on positions larger than k, obtaining vector V′. Then j=#{k′: V′[k′]=1}. Computing j can be done comparing V′ with the precomputed table. Then we look at position j in the permutation table—P[j] gives address in the pointer table under which the pointer to k is stored. This gives us the desired key.
The precomputed tables can be obtained using standard techniques as well as deamortised in a standard way.
Updating the Information for Small Number of Elements
When a new key k arrives, it is stored in the memory at the next available position and a pointer to it is put in the dictionary: firstly we set V[k]=1 and insert the pointer on the last position at the pointer table. We also need to update the permutation table. To do this, we calculate j=#{k′<k: V[k′]=1} and m=#{k′: V[k′]=1}, this is done in the same way as when accessing the stored pointer. Then we change the permutation table: we move all the numbers on positions greater than j one position higher and write m+1 on position j. Since the whole permutation table fits in one codeword, this can be done in constant time: let P′ be the table P with all positions larger than j−1 masked out and P″ the table with all position smaller than j masked out. Then we shift P″ by one position higher and set P←P′P″. Then we set P[j]=m+1.
Larger Number of Elements
When the number of items becomes bigger, we fold the above construction twice (somehow resembling the Btree of order \(t = \sqrt{ \log n}\)): choose a subset of keys k _{1}<k _{2}<⋯<k _{ ℓ } such that between k _{ j } and k _{ j+1} there are at least t and at most 2t other keys. Observe that k _{1}<k _{2}<⋯<k _{ ℓ } can be kept in the above structure, with constant update and access time, we refer to it as the top structure. Moreover, for each i the keys between k _{ i } and k _{ i+1} also can be kept in such a structure. We refer to those structures as the bottom structures.
Access for Large Number of Elements
To access information associated with a given key k, we first look up the largest chosen key smaller than k in the top structure and then look up k in the corresponding bottom structure. The second operation is already known to have constant amortised time. The first operation can be done in \(\mathcal{O}(1)\) time by first masking out the bits on positions larger than k in top characteristic vector and then extracting the position of the largest bit. Again this can be done using standard techniques.
Update for Large Number of Elements
When we insert new item k, firstly we find i such that k _{ i−1}≤k<k _{ i }, where k _{ i−1} and k _{ i } are elements of the top structure. This is done in the same way as when information on k is accessed. Then k is inserted into proper bottom structure.
If after an insertion the bottom structure has 2t+1 elements, we choose its middle element, insert it into the top structure, and split the keys into two parts consisting of t elements, creating two new bottom structures out of them. This requires \(\mathcal{O}(t)\) time but the amortised insertion time is only \(\mathcal{O}(1)\): the size of the bottom structure is t after the split and 2t before the next split, so we can charge the cost to the new t keys inserted into the tree before the splits.
8.2 Compressing A′
Lemma 8 does not apply to A′ directly, as it may hold too many different values. To overcome this, we compress A′ into Compress(A′), so that the resulting text is over a polylogarithmic alphabet and checking equality of two fragments of A′ can be performed by looking at the corresponding fragments of Compress(A′). To compress A′, we scan it from left to right. If A′[i]=A′[i−j] for some 1≤j≤log^{2} n we output #_{0} j. If A′[i]≤log^{2} n we output #_{1} A′[i]. Otherwise we output the binary encoding of A′[i] enclosed by #_{2} and #_{3}. For each i we store the position of its encoding in Compress(A′) in Start[i].
Note that we need to know, whether a value A′[n] appeared within the last log^{2} n positions. To do this, we keep a table Prev, such that Prev[i] gives the position of the last i in A′ (or −1, if no i appeared so far). It is easily updated in constant time: when we read A′[n] we set Prev[A′[n]] to n and Prev[n] to −1.
In this encoding only the last case of A′[i]>log^{2} n and A′[i] not occurring in \(A'[i\log^{2}n \mathinner {\ldotp \ldotp }i1]\) may result in more than one symbol of an alphabet of size O(log^{2} n). We show that the number of different large values of π′ is small, which allows bounding the total size of these encodings and hence the whole Compress(A′) table by \(\mathcal{O} (n)\).
Lemma 9
Let k≥0 and consider a segment of 2^{ k } consecutive entries in the π′ array. At most 48 different values from the interval [2^{ k },2^{ k+1}) occur in such a segment.
Proof
First note that each i such that π′[i]>0 corresponds to a nonextensible occurrence of the border \(w[1 \mathinner {\ldotp \ldotp }\pi'[i]]\), i.e., π′[i] is the maximum j such that \(w[1 \mathinner {\ldotp \ldotp }j]\) is a suffix of \(w[1 \mathinner {\ldotp \ldotp }i]\) and w[j+1]≠w[i+1].
 1.There exist \(p_{i_{1}}<p_{i_{2}}<p_{i_{3}}\) in this segment such thatDefine \(x = w[p_{i_{1}}+1]\) and \(y = w[\pi'[p_{i_{1}}]+1]\), see Fig. 6. Then by the definition of \(\pi'[p_{i_{1}}]\), x≠y. We derive a contradiction by showing that x=y. To this end we use the periodicity of the word w. Define$$p_{i_1}  \pi'[p_{i_1}] + 1> p_{i_2}  \pi'[p_{i_2}] + 1> p_{i_3}  \pi'[p_{i_3}] + 1. $$see Fig. 6. Define \(s = \pi'[p_{i_{1}}]+b\), see Fig. 6; then both a, b are periods of \(w[1\mathinner {\ldotp \ldotp }s]\), see Fig. 6. We show that \(a,b \leq\frac{s}{2}\) and so periodicity lemma can be applied to them and word \(w[1\mathinner {\ldotp \ldotp }s]\).$$\begin{aligned} a &= \bigl(p_{i_2}  \pi'[p_{i_2}]+1\bigr)  \bigl(p_{i_3}  \pi'[p_{i_3}]+1\bigr) , \\ b &= \bigl(p_{i_1}  \pi'[p_{i_1}]+1\bigr)  \bigl(p_{i_3}  \pi'[p_{i_3}]+1\bigr) , \\ s &= \pi'[p_{i_1}] + b , \end{aligned}$$Since \(s = \pi'[p_{i_{1}}] + b \) and \(\pi'[p_{i_{1}}] \in [\ell,r )\) we obtain s>ℓ. Thus$$\begin{aligned} a < b & = \bigl(p_{i_1}  \pi'[p_{i_1}]\bigr)  \bigl(p_{i_3}  \pi'[p_{i_3}]\bigr) < \pi'[p_{i_3}]  \pi'[p_{i_1}] \\ & \leq r\ell < \frac{\ell}{2} . \end{aligned}$$By periodicity lemma b−a is also a period of \(w[1\mathinner {\ldotp \ldotp }s]\). As position \(p_{i_{1}}+1\) is covered by the nonextensible border ending at \(p_{i_{2}}\) (note that \(b < \frac{\ell}{2}\) and \(\pi'[p_{i_{1}}] \geq\ell\)):$$a < b < \frac{s}{2} . $$see Fig. 7. Note that$$x = w[p_{i_1}+1] = w\bigl[\pi'[p_{i_1}]+1+(ba) \bigr] , $$and so \(w[\pi'[p_{i_{1}}]+1+(ba)]\) is a letter from word \(w[1 \mathinner {\ldotp \ldotp }s]\), which has a period b−a. Hence$$\pi'[p_{i_1}]+1+(ba) \leq\pi'[p_{i_1}]+b = s $$contradiction.$$x = w\bigl[\pi'[p_{i_1}]+1+(ba)\bigr] = w\bigl[ \pi'[p_{i_1}]+1\bigr] = y , $$
 2.There exist \(p_{i_{1}}<p_{i_{2}}<p_{i_{3}}\) in this segment such thatsee Fig. 8.$$p_{i_1}  \pi'[p_{i_1}] + 1 < p_{i_2}  \pi'[p_{i_2}] + 1 < p_{i_3}  \pi'[p_{i_3}] + 1 , $$By assumption \(\pi'[p_{i_{1}}], \pi'[p_{i_{2}}] \geq\ell\). We identify the periods of the corresponding subwords \(w[1 \mathinner {\ldotp \ldotp }\pi'[p_{i_{1}}]]\) and \(w[1 \mathinner {\ldotp \ldotp }\pi'[p_{i_{2}}]]\), respectively:as depicted on Fig. 8. We estimate their sum:$$\begin{aligned} a &= \bigl(p_{i_2}  \pi'[p_{i_2}]+1\bigr)  \bigl(p_{i_1}  \pi'[p_{i_1}]+1\bigr) , \\ b &= \bigl(p_{i_3}  \pi'[p_{i_3}]+1\bigr)  \bigl(p_{i_2}  \pi'[p_{i_2}]+1\bigr), \end{aligned}$$Since \(\ell\leq\pi'[p_{i_{1}}], \pi'[p_{i_{2}}]\), we obtain that$$\begin{aligned} a + b &= \bigl(p_{i_2}  \pi'[p_{i_2}]\bigr)  \bigl(p_{i_1}  \pi'[p_{i_1}]\bigr) + \bigl(p_{i_3}  \pi'[p_{i_3}]\bigr)  \bigl(p_{i_2}  \pi'[p_{i_2}]\bigr) \\ &= \bigl(p_{i_3}  \pi'[p_{i_3}]\bigr)  \bigl(p_{i_1}  \pi'[p_{i_1}]\bigr) \\ &= \bigl(\pi'[p_{i_1}]  \pi'[p_{i_3}] \bigr) + (p_{i_3}  p_{i_1}) \\ &\leq(r  \ell) + 2^{k'} \leq\biggl(\frac{1}{2}\ell2^{k'}\biggr) + 2^{k'}= \frac{\ell}{2} . \end{aligned}$$There are two subcases, depending on whether \(\pi'[p_{i_{1}}] < \pi '[p_{i_{2}}] \) or \(\pi'[p_{i_{1}}] > \pi'[p_{i_{2}}]\):$$ a + b \leq\frac{\pi'[p_{i_1}]}{2} , \frac{\pi'[p_{i_2}]}{2} . $$(7)
 (a)
\(\pi'[p_{i_{1}}]<\pi'[p_{i_{2}}]\): Define \(x = w[p_{i_{1}}+1]\) and \(y= w[\pi'[p_{i_{1}}]+1]\), see Fig. 9. Then by definition of \(\pi'[p_{i_{1}}]\), x≠y. We obtain a contradiction by showing that x=y.
Since the nonextensible border ending at \(p_{i_{3}}\) spans over position \(p_{i_{1}}+1\) and \(a+b < \pi'[p_{i_{1}}]\) (see (7)) it holds thatComparing the nonextensible borders ending at \(p_{i_{2}}\) and \(p_{i_{3}}\) we deduce that b is a period of \(w[1\mathinner {\ldotp \ldotp }\pi'[p_{i_{2}}]]\) and as \(\pi '[p_{i_{1}}]+1 \leq\pi'[p_{i_{2}}]\),$$ x = w\bigl[\bigl(\pi'[p_{i_1}] +1\bigr) (a+b)\bigr] . $$(8)Similarly by comparing the nonextensible prefixes ending at \(p_{i_{1}}\) and \(p_{i_{2}}\) we deduce that a is a period of \(w[1\mathinner {\ldotp \ldotp }\pi'[p_{i_{1}}]]\). Thus$$y = w\bigl[\pi'[p_{i_1}]+1\bigr]=w\bigl[ \pi'[p_{i_1}]+1b\bigr] . $$and therefore by (8) and (9) x=y. Contradiction.$$ y = w\bigl[\pi'[p_{i_1}]+1b\bigr]=w\bigl[ \pi'[p_{i_1}]+1ba\bigr] $$(9)  (b)\(\pi'[p_{i_{1}}]>\pi'[p_{i_{2}}]\): Let \(x' = w[p_{i_{2}}+1]\) and \(y'=w[\pi'[p_{i_{2}}]+1]\). Then x′≠y′ by the definition of \(\pi'[p_{i_{2}}]\), see Fig. 10. We show that x′=y′ and hence obtain a contradiction. Since nonextensible border ending at \(p_{i_{3}}\) spans over position \(p_{i_{2}}+1\), we obtain thatsee Fig. 10. By comparing nonextensible prefixes ending at \(p_{i_{1}}\) and \(p_{i_{2}}\) we deduce that a is a period of \(w[1\mathinner {\ldotp \ldotp }\pi'[p_{i_{1}}]]\). As \(\pi'[p_{i_{2}}] + 1 \leq\pi'[p_{i_{1}}]\),$$ x' = w\bigl[\pi'[p_{i_2}]b+1 \bigr] , $$(10)By comparing the nonextensible prefixes ending at \(p_{i_{2}}\) and \(p_{i_{3}}\) we deduce that b is a period of \(w[1\mathinner {\ldotp \ldotp }\pi'[p_{i_{2}}]]\). Since \(a + b\leq\frac{\pi'[p_{i_{2}}]}{2}\) by (7), it holds that$$ y'=w\bigl[\pi'[p_{i_2}]+1\bigr]=w\bigl[ \pi'[p_{i_2}]+1a\bigr] . $$As a is a period of \(w[1\mathinner {\ldotp \ldotp }\pi'[p_{i_{1}}]]\) and \(\pi'[p_{i_{1}}] > \pi'[p_{i_{2}}]\) it is also a period of \(w[1\mathinner {\ldotp \ldotp }\pi'[p_{i_{2}}]+1]\), hence$$y' = w\bigl[\pi'[p_{i_2}]+1a\bigr]=w\bigl[ \pi'[p_{i_2}]+1ab\bigr] . $$So by (10) and (11) x′=y′, contradiction. □$$ y'=w\bigl[\pi'[p_{i_2}]+1ab \bigr]=w\bigl[\pi'[p_{i_2}]+1b\bigr]. $$(11)
 (a)
Lemma 9 can be used to bound the size of the compressed representation Compress(A′) of A′.
Corollary 2
Compress(A′) consists of \(\mathcal{O} (n)\) symbols over an alphabet of \(\mathcal{O} (\log^{2} n)\) size.
Proof
As the alphabet of Compress(A′) is of polylogarithmic size, the suffix tree for Compress(A′) can be constructed in linear time by Lemma 8.
8.3 Performing Consistency Checks on the Compress(A′)
Subchecks
Consider consistency check: is \(A'[j\mathinner {\ldotp \ldotp }j+k1]=A'[i\mathinner {\ldotp \ldotp }i+k1]\), where j=A[i]? We first establish equivalence of this equality with equality of proper fragments of Compress(A′). Note, that A′[ℓ]=A′[ℓ′] does not imply the equality of two corresponding fragments of Compress(A′), as they may refer to previous values of A′. Still, such references can be only log^{2} n elements backwards. This observation is formalised as follows:
Lemma 10
Proof
If k≤log^{2} n, the claim holds trivially, as (12) and (14) are exactly the same and (13) holds vacuously.
So suppose that k>log^{2} n.
⇒⃝Suppose first that \(A'[j\mathinner {\ldotp \ldotp }j+k1]=A'[i\mathinner {\ldotp \ldotp }i+k1]\). Then of course \(A'[j\mathinner {\ldotp \ldotp }j+\log^{2} n1]=A'[i\mathinner {\ldotp \ldotp }i+\log^{2} n1]\), as k>log^{2} n by case assumption. Thus (14) holds.
Note that \(\mathit{Compress}(A')[\mathit{Start}[j+\log^{2} n] \mathinner {\ldotp \ldotp }\mathit {Start}[j+k]1] \) is created using only \(A'[j \mathinner {\ldotp \ldotp }j+k1]\): when creating an entry corresponding to A′[ℓ] we can refer to A′[ℓ] and to at most log^{2} n elements before it. Similarly, \(\mathit{Compress}(A')[\mathit{Start}[i+\log^{2} n] \mathinner {\ldotp \ldotp }\mathit {Start}[i+k]1]\) is created using \(A'[i \mathinner {\ldotp \ldotp }i+k1]\) exclusively. Since \(A'[j \mathinner {\ldotp \ldotp }j+k1] = A'[i \mathinner {\ldotp \ldotp }i+k1]\), both fragments of Compress(A′) are created using the same input, and so they are equal. Thus (13) holds, which ends the proof in this direction.

If they are both equal to #_{0} m (i.e., both are equal to some value of A′ that is m≤log^{2} n positions earlier) then A′[i+ℓ]=A′[i+ℓ−m] and A′[j+ℓ]=A′[j+ℓ−m]; by the inductive assumption A′[i+ℓ−m]=A′[j+ℓ−m] (as m≤log^{2} n), which ends the case.

If they are both equal to #_{1} m (i.e., both are equal to m≤log^{2} n) then A′[i+ℓ]=A′[j+ℓ]=m.

If they are equal to #_{2} m _{1}…m _{ z }#_{3} (i.e., both are larger than log^{2} n and are both encoded in binary as m _{1}…m _{ z }) then m _{1}…m _{ z } encode some m in binary and A′[i+ℓ]=A′[j+ℓ]=m, which ends the last case.
Similarly as in the Sect. 8.1, we assume that ⌊logn⌋ is known. In the same way we repeat the whole computation from the scratch as soon as it value changes. This increases the running time by a constant factor.
We call the checks of the form (13) the compressed consistency checks, checks of the form (14)—short consistency checks and the near short consistency checks when moreover i−j<log^{2} n.
The compressed consistency checks can be answered in amortised constant time using LCA query [3] on the suffix tree built for Compress(A′). It remains to show how to perform short consistency checks in amortised constant time.
8.4 Performing Short Consistency Checks
Performing Near Short Consistency Checks
To answer near short consistency checks efficiently, we split A′ into blocks of log^{2} n consecutive letters: A′=B _{1} B _{2}…B _{ ℓ }, see Fig. 11. Then we build suffix trees for each pair of consecutive blocks, i.e., B _{1} B _{2},B _{2} B _{3},…,B _{ ℓ−1} B _{ ℓ }. Each block contains at most log^{2} n values smaller than log^{2} n, and at most 48logn larger values by Lemma 9, so all suffix trees can be built in linear time by Lemma 8. For each tree we also build a data structure supporting constanttime LCA queries [3]. Then, any near short consistency check reduces to an LCA query in one of these suffix trees. Such a query also gives the actual length of the longest common prefix of the two compared strings; this is used in performing short consistency checks.
Performing Short Consistency Checks

j≤j _{ best }≤j+log^{2} n

the length (say L) of the common prefix of \(A'[i \mathinner {\ldotp \ldotp }i + k 1]\) and \(A'[j_{best} \mathinner {\ldotp \ldotp }j_{best} + k 1]\) is known.
Simplifying Assumption
Invariants
The intuition behind the invariants is as follows: (15a) simply states that we are interested in common prefix of length at most k. The (15b) justifies the choice of j _{ best }, i.e. we know the common prefix of A′ starting at j _{ best } and at i. The (15c) ensures that comparing A′ starting at j and j _{ best } can be done using near short consistency check. The (15d) says that if j≠j _{ best } then there is a reason for that: \(A'[i \mathinner {\ldotp \ldotp }i + k1]\) and \(A'[j \mathinner {\ldotp \ldotp }j + k1]\) have a shorter common prefix then \(A'[i \mathinner {\ldotp \ldotp }i + k1]\) and \(A'[j_{best} \mathinner {\ldotp \ldotp }j_{best} + k1]\). Finally, (15e) shows maximality of L: either it is k (so it cannot be larger) or there is a mismatch at the ‘next position’.
Potential
Note that when the change of the potential is negative then it actually helps in paying for near short consistency checks and letterbyletter comparisons. Since 0≤L≤k and j≤j _{ best }≤j+log^{2} n, at any point the potential is nonnegative and at most log^{2} n, so the total cost at any point is the sum of amortised costs in each step and the potential, which is sublinear.
We pay for the amortised cost using credit that we get for the changes of n and j: For every increase Δn, we get 8Δn units of credit; for every change of j we get 8Δj units of credit. Clearly the sum of all Δn is n, so in this way we are scored at most 2n credit. We show that the sum of all Δj is also \(\mathcal{O} (n)\).
Lemma 11
The sum of all Δj over the whole run Validateπ′ is 2n.
Proof
For the purpose of the proof, whenever we change the value of i or j let i′, j′ refer to the new values and i, j to the old ones.
It is enough to show that the sum of all increments of j is at most n then clearly the sum of all decrements of j are at most n as well.
Since i≤n and i only increases, its sum of increments is at most n. So the total sum of increments of j is at most n, as claimed. □
LetterbyLetter Comparisons
Lemma 12
If (15a)–(15c) are satisfied before LetterbyLetter, then (15a)–(15c) and (15e) are satisfied afterwards. The amortised cost of LetterbyLetter is 1.
Proof
For the purpose of the proof, let L _{0} be the initial value of L and L _{1} the final value of L; by ‘L’ we denote the value inside LetterbyLetter.
Note that i, j and k are not altered. For (15a), by assumption L _{0}≤k before CommonShortConsistencyCheck, we increment L by 1 and stop as soon as it reaches k, so L _{1}≤k. For (15b) note that \(A'[j_{best} \mathinner {\ldotp \ldotp }j_{best} + L_{0}1] = A'[i\mathinner {\ldotp \ldotp }i+L_{0}1]\) holds by the assumption and we verified \(A'[j_{best} + L_{0} \mathinner {\ldotp \ldotp }j_{best}+L_{1}1] = A'[i+L_{0}\mathinner {\ldotp \ldotp }i+L_{1}1]\) letter by letter. Invariant (15c) holds as neither j nor j _{ best } was changed. As for (15e), it is the termination condition of the while loop, so it holds upon its termination.
Concerning the amortised cost: i, j, j _{ best } do not change, so Δp=−ΔL, i.e. it is negative. On the other hand we make ΔL successful lettertoletter comparisons and perhaps one unsuccessful one (we ignore the cost of checking whether L=k, as they are at most as high as the cost of lettertoletter comparisons). So the cost of comparisons is at most ΔL+1. Hence the amortised cost is at most −ΔL+ΔL+1=1, as claimed. □
Answering Short Consistency Checks Using j _{ best }
Lemma 13
Assume that (15a)–(15c) are satisfied. Then CommonShortConsistencyCheck correctly answers the short consistency check, its amortised cost is 6 and all (15a)–(15e) hold after CommonShortConsistencyCheck.
Proof
Regarding the cost, the amortised cost of LetterbyLetteris 1 by Lemma 12, setting j _{ best } to j can only lower potential, and NearShortConsistencyCheck are answered in constant time using suffix trees.
We now show that after CommonShortConsistencyCheck all (15a)–(15e) hold. By assumption initially (15a)–(15c) hold. By Lemma 12 after the first LetterbyLetter they still hold and additionally (15e) holds. Suppose that ℓ<L, in particular j≠j _{ best }. Then (15d) simply states that ℓ<L, which is the case. So suppose that ℓ≥L. Resetting j _{ best } to j may make (15e) invalid, but (15a)–(15c) are preserved: the (15a) holds as we do not change L, the (15b) holds as we know that \(A'[j_{best} \mathinner {\ldotp \ldotp }j_{best} + k 1]\) has a common prefix of length L with both \(A'[j\mathinner {\ldotp \ldotp }j+k1]\) and \(A'[i\mathinner {\ldotp \ldotp }i+k1]\) and so also \(A'[j\mathinner {\ldotp \ldotp }j+k1]\) and \(A'[i\mathinner {\ldotp \ldotp }i+k1]\) have a common prefix of length L. The (15c) holds trivially. By Lemma 12 the (15a)–(15c) and (15e) hold after LetterbyLetter. Note that LetterbyLetter does not modify j and so (15d) trivially holds, as j=j _{ best }.
Concerning the correctness: if ℓ<L then j≠j _{ best } and from (15b) and (15d) we get that \(A'[j\mathinner {\ldotp \ldotp }j + L1]\) and \(A'[i\mathinner {\ldotp \ldotp }i + L1]\) are different. Since by (15a) we know that L≤k, hence also \(A'[j\mathinner {\ldotp \ldotp }j + k  1]\) and \(A'[i\mathinner {\ldotp \ldotp }i + k  1]\) are different. This justifies the no answer. If ℓ≥L then in the end j=j _{ best } and so by (15a)–(15b) and (15e) we know that \(A'[j\mathinner {\ldotp \ldotp }j+k1] = A'[i\mathinner {\ldotp \ldotp }i+k1]\) if and only if L=k, which is exactly the answer returned by the algorithm. □
It remains to show how to update j _{ best } and L.
Types of Short Consistency Checks
 (Type 1)

This is a first iteration of AdjustLastSlope and PinValueCheck did not return any index in this iteration.
 (Type 2)

This is not a first iteration of AdjustLastSlope and PinValueCheck did not return any index in this iteration.
 (Type 3)

The PinValueCheck did return an index in this iteration.
Lemma 14
In Type 1 short consistency check it holds that Δi=Δj=0, Δk≥0 and Δn≥max(1,Δk); exactly 8Δn units of credit are issued.
In Type 2 short consistency check it holds that Δi=0, Δj<0 and Δk=Δn=0; exactly 8Δj units of credit are issued.
In Type 3 short consistency check it holds that Δi>0, Δj=Δi and −Δi≤Δk≤0, Δn≥0; exactly 8Δj+8Δn units of credit are issued.
Note in particular that when Δi, Δj are known, we can figure out which type of query this is: Type 3 short consistency check is unique with Δi>0, Type 2 with Δj<0 while Type 1 with Δi=Δj=0.
Proof
Recall that we issue 8Δn+8Δj units of credit, which yields the claim on the number of credit issued in each of the cases.
Type 1 short consistency check: Since this is the first iteration of AdjustLastSlope it means that we read A′[n] and it is not equal to A′[A[n]]. In particular, since the last invocation of AdjustLastSlope we read at least one additional value of A′. Hence Δn≥1. As PinValueCheck did not return any index, we do not modify i and j since the last invocation of the short consistency check, so Δi=Δj=0. Concerning k, recall that the short consistency check is asked only on \(A'[i \mathinner {\ldotp \ldotp }\min(n,i + \log^{2}n1)]\), i.e. k=min(n−i+1,log^{2} n). Hence, when k _{0} and n _{0} are the values of k and n when previous short consistency check was asked, we have k _{0}=min(n _{0}−i+1,log^{2} n) (note that we can assume that logn and logn _{0} are the same, as we repeat the calculation as soon as ⌈logn⌉ increases). Then k≥k _{0} and Δk≤Δn, but there is no guarantee that Δk>0, i.e., k _{0}=k can happen when n _{0}−i+1>log^{2} n.
Type 2 short consistency check: in this case short consistency check is asked in iteration of AdjustLastSlope that is not the first one, and the PinValueCheck did not return any index in this iteration. Which means that A[i] is assigned the next candidate in line 14. Thus i, k are unchanged as compared to the previous short consistency check, while j is decreased, hence Δi=0, Δj<0 and Δk=0. Furthermore, we do not read any new value of A′, so Δn=0.
Type 3 short consistency check: In this case the short consistency check is run for the same slope, but pin is moved, thus the new value i′ is larger than the old i. By our simplifying assumption we do not decrease the last slope, just place new i′ on it, i.e. we set A[i′]=A[i]+(i′−i), i.e., we take new j such that Δj=Δi. As n only increases, Δn≥0. Concerning k, recall again that k=min(n−i+1,log^{2} n), hence 0≥Δk≥−Δi. □
In the following, we describe how to update j _{ best } and L in those three different cases so that (15a)–(15c) are preserved.
Type 1 Updates
Lemma 15
Suppose that we are to make Type 1 short consistency check and all (15a)–(15e) hold. Then (15a)–(15c) are preserved and the amortised cost is at most Δn.
Proof
Concerning the invariants: as L is unchanged and Δk≥0 by Lemma 14 we get that (15a) is preserved. Similarly, since we do not change j, j _{ best }, L, the (15b)–(15c) are preserved. □
This allows calculating the whole cost of answering Type 1 short consistency check.
Corollary 3
In Type 1 of short consistency check the amortised cost of Type1Updatej _{ best } and CommonShortConsistencyCheck is covered by the released credit. The Type1Updatej _{ best } followed by CommonShortConsistencyCheck preserves (15a)–(15e) and returns the correct answer to short consistency check.
Proof
By Lemma 15 the update of j _{ best } and L has amortised cost at most Δn. By Lemma 13 the amortised cost of CommonShortConsistencyCheck is at most 6. On the other hand, by Lemma 14 we know that 8Δn≥6+Δn credit is issued, which suffice to pay for the amortised cost.
Concerning the correctness, by Lemma 15 the (15a)–(15c) are satisfied after Type1Updatej _{ best } which by Lemma 13 means that after CommonShortConsistencyCheck all (15a)–(15e) hold and the answer to short consistency check is correct. □
Type 2 Updates
Lemma 16
Assume that all (15a)–(15e) hold and we are to make Type 2 short consistency check. Then after Type2Updatej _{ best } the (15a) and (15c) are preserved. The amortised cost is at most Δj+1.
Proof
Suppose that j+log^{2} n≥j _{ best }. The invariants (15b)–(15c) hold by assumption, as none of i, j _{ best }, L and k was modified. For (15c) note that j _{ best }≤j+log^{2} n holds by case assumption and j≤j _{ best } held by assumption even before the decrement of j, so it holds now as well.
Corollary 4
In Type 2 of short consistency check the amortised cost of Type2Updatej _{ best } and CommonShortConsistencyCheck is covered by the issued credit. The Type2Updatej _{ best } followed by CommonShortConsistencyCheck preserve (15a)–(15e) and correctly answers short consistency check.
Proof
By Lemma 16, the update of j _{ best } and L has amortised cost at most Δj+1. By Lemma 13, the amortised cost of CommonShortConsistencyCheck is 6. On the other hand, by Lemma 14, 8Δj≥7+Δj credit is issued, which suffice to pay for the amortised cost.
Concerning the correctness, by Lemma 16 after Type2Updatej _{ best } the (15a)–(15c) hold and so by Lemma 13 adter CommonShortConsistencyCheck all (15a)–(15e) hold and the answer to short consistency check is correct. □
Type 3 Updates
Lemma 17
Suppose that (15a)–(15e) hold and we are to make Type 3 short consistency check. Then Type3Updatej _{ best } preserves (15a)–(15c). The amortised cost is at most 1+Δj.
Proof
Consider the case in which j _{ best } and L are not reset. By Lemma 14 we get that k is decreased by at most Δj, while we decrease L by Δj, hence (15a) is preserved. Concerning (15b) let L′, i′ and \(j_{best}'\) be the previous values of L, i and j _{ best }. Then \(A'[j_{best} \mathinner {\ldotp \ldotp }j_{best}+L1]\) is an ending block of \(A'[j_{best}' \mathinner {\ldotp \ldotp }j_{best}'+L'1]\) and \(A'[i\mathinner {\ldotp \ldotp }i+L1]\) is an ending block of \(A'[i' \mathinner {\ldotp \ldotp }i'+L'1]\). Hence \(A'[j_{best} \mathinner {\ldotp \ldotp }j_{best}+L1] = A'[i\mathinner {\ldotp \ldotp }i+L1]\) follows from \(A'[j_{best}' \mathinner {\ldotp \ldotp }j_{best}'+L'1] = A'[i'\mathinner {\ldotp \ldotp }i'+L'1]\). So (15b) is preserved. For (15c) note that we decremented j and j _{ best } by the same value Δj, so (15c) is preserved.
Corollary 5
In Type 3 of short consistency check the amortised cost of Type3Updatej _{ best } and CommonShortConsistencyCheck is covered by the issued credit. The Type3Updatej _{ best } followed by CommonShortConsistencyCheck preserve (15a)–(15e) and returns a proper answer to short consistency check.
Proof
By Lemma 17 the update of j _{ best } and L has amortised cost at most 1+Δj. By Lemma 13 the amortised cost of CommonShortConsistencyCheck is at most 6. On the other hand, by Lemma 14 we obtain that at least 8Δj≥7+Δj credit is issued, which suffice to pay for the amortised cost.
Concerning the correctness, by Lemma 16 after Type3Updatej _{ best } the (15a)–(15c) hold and so by Lemma 13 after CommonShortConsistencyCheck all (15a)–(15e) hold and furthermore the answer to the short consistency check is correct. □
In the end, the short consistency check is performed as follows: depending on which type it is, we run one of Type1Updatej _{ best }, Type2Updatej _{ best }, Type3Updatej _{ best }. Afterwards we apply CommonShortConsistencyCheck. By Corollary 3–5 the answer returned to short consistency check is correct and the issued credit covers the whole cost. Since the issued credit is linear, we are done.
Running Time
Validateπ′ runs in \(\mathcal{O} (n)\) time: construction of the suffix trees and doing consistency checks, as well as doing pin value checks all take \(\mathcal{O} (n)\) time.
9 Remarks and Open Problems
While Validateπ produces the word w over the minimum alphabet such that π _{ w }=A online, this is not the case with Validateπ′ and LinearValidateπ′. At each timestep both these algorithms can output a word over minimum alphabet such that \(\pi'_{w}=A'\), but the letters assigned to positions on the last slope may yet change as further entries of A′ are read.
Since Validateπ′ and LinearValidateπ′ keep the function \(\pi[1\mathinner {\ldotp \ldotp }n+1]\) after reading \(A'[1 \mathinner {\ldotp \ldotp }n]\), virtually no changes are required to adapt them to g validation, where g[i]=π′[i−1]+1 is the function considered by Duval et al. [8], because \(A'[1 \mathinner {\ldotp \ldotp }n1]\) can be obtained from \(g[1 \mathinner {\ldotp \ldotp }n]\). Running Validateπ′ or LinearValidateπ′ on such A′ gives \(A[1 \mathinner {\ldotp \ldotp }n]\) that is consistent with \(A'[1 \mathinner {\ldotp \ldotp }n1]\) and \(g[1\mathinner {\ldotp \ldotp }n]\). Similar proof shows that \(A[1\mathinner {\ldotp \ldotp }n]\) and \(g[1 \mathinner {\ldotp \ldotp }n]\) require the same minimum size of the alphabet.
Two interesting questions remain: is it possible to remove the suffix trees and LCA queries from our algorithm without hindering its time complexity? We believe that deeper combinatorial insight might result in a positive answer.
Notes
Acknowledgements
This work was partially supported by Polish Ministry of Science and Higher Education under grants N N206 1723 33, 2007–2010; Łukasz Jeż was also partially supported by the Israeli Centers of Research Excellence (ICORE) program, Center No. 4/11.
References
 1.Breslauer, D., Colussi, L., Toniolo, L.: On the comparison complexity of the string prefixmatching problem. J. Algorithms 29(1), 18–67 (1998) CrossRefzbMATHMathSciNetGoogle Scholar
 2.Clément, J., Crochemore, M., Rindone, G.: Reverse engineering prefix tables. In: Proceedings of 26th STACS, pp. 289–300 (2009). http://drops.dagstuhl.de/opus/volltexte/2009/1825 Google Scholar
 3.Cole, R., Hariharan, R.: Dynamic lca queries on trees. In: Proceedings of SODA ’99, pp. 235–244. Society for Industrial and Applied Mathematics, Philadelphia (1999) Google Scholar
 4.Crochemore, M., Rytter, W.: Jewels of Stringology. World Scientific Publishing Company, Singapore (2002) CrossRefGoogle Scholar
 5.Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, Cambridge (2007) CrossRefzbMATHGoogle Scholar
 6.Crochemore, M., Iliopoulos, C., Pissis, S., Tischler, G.: Cover array string reconstruction. In: CPM 2010. Lecture Notes in Computer Science, vol. 6129, pp. 251–259. Springer, Berlin (2010) Google Scholar
 7.Dietzfelbinger, M., Karlin, A.R., Mehlhorn, K., auf der Heide, F.M., Rohnert, H., Tarjan, R.E.: Dynamic perfect hashing: upper and lower bounds. SIAM J. Comput. 23(4), 738–761 (1994) CrossRefzbMATHMathSciNetGoogle Scholar
 8.Duval, J.P., Lecroq, T., Lefebvre, A.: Efficient validation and construction of Knuth–Morris–Pratt arrays. In: Conference in Honor of Donald E. Knuth (2007) Google Scholar
 9.Duval, J.P., Lecroq, T., Lefebvre, A.: Efficient validation and construction of border arrays and validation of string matching automata. RAIRO Theor. Inform. Appl. 43(2), 281–297 (2009). doi: 10.1051/ita:2008030 CrossRefzbMATHMathSciNetGoogle Scholar
 10.Farach, M.: Optimal suffix tree construction with large alphabets. In: Proceedings of FOCS ’97, pp. 137–143. IEEE Computer Society, Washington (1997) Google Scholar
 11.Franěk, F., Gao, S., Lu, W., Ryan, P.J., Smyth, W.F., Sun, Y., Yang, L.: Verifying a border array in linear time. J. Comb. Math. Comb. Comput. 42, 223–236 (2002) zbMATHGoogle Scholar
 12.Fredman, M.L., Willard, D.E.: Transdichotomous algorithms for minimum spanning trees and shortest paths. J. Comput. Syst. Sci. 48(3), 533–551 (1994). doi: 10.1016/S00220000(05)800649 CrossRefzbMATHMathSciNetGoogle Scholar
 13.Hancart, C.: On Simon’s string searching algorithm. Inf. Process. Lett. 47(2), 95–99 (1993) CrossRefzbMATHMathSciNetGoogle Scholar
 14.I, T., Inenaga, S., Bannai, H., Takeda, M.: Counting parameterized border arrays for a binary alphabet. In: Proc. of the 3rd LATA, pp. 422–433 (2009). doi: 10.1007/9783642009822_36 Google Scholar
 15.I, T., Inenaga, S., Bannai, H., Takeda, M.: Verifying and enumerating parameterized border arrays. Theor. Comput. Sci. 412(50), 6959–6981 (2011). doi: 10.1016/j.tcs.2011.09.008 CrossRefzbMATHMathSciNetGoogle Scholar
 16.Karp, R.M., Miller, R.E., Rosenberg, A.L.: Rapid identification of repeated patterns in strings, trees and arrays. In: STOC ’72: Proceedings of the Fourth Annual ACM Symposium on Theory of Computing, pp. 125–136. ACM, New York (1972). doi: 10.1145/800152.804905 CrossRefGoogle Scholar
 17.Knuth, D.E., Morris, J.H. Jr., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977) CrossRefzbMATHMathSciNetGoogle Scholar
 18.Matiyasevich, Y.: Realtime recognition of the inclusion relation. J. Sov. Math. 1, 64–70 (1973). Published (in Russian) in Zap. Nauc̆. Semin. POMI, 20, 104–114 (1971) CrossRefzbMATHGoogle Scholar
 19.McCreight, E.M.: A spaceeconomical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976). doi: 10.1145/321941.321946 CrossRefzbMATHMathSciNetGoogle Scholar
 20.Moore, D., Smyth, W.F., Miller, D.: Counting distinct strings. Algorithmica 23(1), 1–13 (1999). http://link.springer.de/link/service/journals/00453/bibs/23n1p1.html CrossRefzbMATHMathSciNetGoogle Scholar
 21.Morris, J.H. Jr., Pratt, V.R.: A linear patternmatching algorithm. Tech. Rep. 40, University of California, Berkeley (1970) Google Scholar
 22.Pagh, R., Rodler, F.F.: Cuckoo hashing. J. Algorithms 51(2), 122–144 (2004). doi: 10.1016/j.jalgor.2003.12.002 CrossRefzbMATHMathSciNetGoogle Scholar
 23.Simon, I.: String matching algorithms and automata. In: Results and Trends in Theoretical Computer Science. LNCS, vol. 812, pp. 386–395. Springer, Berlin (1994) CrossRefGoogle Scholar
 24.Ukkonen, E.: Online construction of suffix trees. Algorithmica 14(3), 249–260 (1995). doi: 10.1007/BF01206331 CrossRefzbMATHMathSciNetGoogle Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.