Abstract
Let \(\pi'_{w}\) denote the failure function of the KnuthMorrisPratt algorithm for a word w. In this paper we study the following problem: given an integer array \(A'[1 \mathinner {\ldotp \ldotp }n]\), is there a word w over an arbitrary alphabet Σ such that \(A'[i]=\pi'_{w}[i]\) for all i? Moreover, what is the minimum cardinality of Σ required? We give an elementary and selfcontained \(\mathcal{O}(n\log n)\) time algorithm for this problem, thus improving the previously known solution (Duval et al. in Conference in honor of Donald E. Knuth, 2007), which had no polynomial time bound. Using both deeper combinatorial insight into the structure of π′ and advanced algorithmic tools, we further improve the running time to \(\mathcal{O}(n)\).
1 Introduction
1.1 Pattern Recognition and Failure Functions
The pattern matching algorithms attracted much attention since the dawn of computer science. It was particularly interesting, whether a lineartime algorithm for this problem exists. First results were obtained by Matiyasevich for a fixed pattern in the Turing Machine model [18]. However, the first fully linear time pattern matching algorithm is the MorrisPratt algorithm [21], which is designed for the RAM machine model, and is well known for its beautiful concept. It simulates the minimal DFA recognizing Σ ^{∗} p (p denotes the pattern) by using a failure function π _{ p }, known as the border array. The automaton’s transitions are recovered, in amortized constant time, from the values of π _{ p } for all prefixes of the pattern, to which the DFA’s states correspond. The values of π _{ p } are precomputed in a similar fashion, also in linear time.
The MP algorithm has many variants. For instance, the KnuthMorrisPratt algorithm [17] improves it by using an optimised failure function, namely the strict border array π′ (or strong failure function). This was improved by Simon [23], and further improvements are known [1, 13]. We focus on the KMP failure function for two reasons. Unlike later algorithms, it is wellknown and used in practice. Furthermore, the strong border array itself is of interest as, for instance, it captures all the information about periodicity of the word. Hence it is often used in word combinatorics and numerous text algorithms, see [4, 5]. On the other hand, even Simon’s algorithm (i.e., the very first improvement) deals with periods of pattern prefixes augmented by a single text symbol rather than pure periods of pattern prefixes.
1.2 Strict Border Array Validation
Problem Statement
We investigate the following problem: given an integer array \(A'[1 \mathinner {\ldotp \ldotp }n]\), is there a word w over an arbitrary alphabet Σ such that \(A'[i]=\pi _{w}'[i]\) for all i, where \(\pi_{w}'\) denotes the failure function of the KnuthMorrisPratt algorithm for w. If so, what is the minimum cardinality of the alphabet Σ over which such a word exists?
Pursuing these questions is motivated by the fact that in word combinatorics one is often interested only in values of \(\pi_{w}'\) rather than w itself. For instance, the logarithmic upper bound on delay of KMP follows from properties of the strict border array [17]. Thus it makes sense to ask if there is a word w admitting \(\pi _{w}'=A'\) for a given array A′.
We are interested in an online algorithm, i.e., one that receives the input array values one by one, and is required to output the answer after reading each single value. For the KnuthMorrisPratt array validation problem it means that after reading A′[i] the algorithm should answer, whether there exist a word w such that \(A'[1 \mathinner {\ldotp \ldotp }i] = \pi_{w}'[1 \mathinner {\ldotp \ldotp }i]\) and what is the minimum size of the alphabet over which such a word w exists.
Previous Results
To our best knowledge, this problem was investigated only for a slightly different variant of π′, namely a function g that can be expressed as g[n]=π′[n−1]+1, for which an offline validation algorithm due to Duval et al. [8] is known. Validation of border arrays is used by algorithms generating all valid border arrays [9, 11, 20].
Unfortunately, Duval et al. [8] provided no upper bound on the running time of their algorithm, but they did observe that on certain input arrays it runs in Ω(n ^{2}) time.
Our Results
We give a simple \(\mathcal{O} (n \log n)\) online algorithm Validateπ′ for the strong border array validation, which uses the linear offline bijective transformation between π and π′. Validateπ′ is also applicable to g validation with no changes, thus giving the first provably polynomial algorithm for the problem considered by Duval et al. [8]. Note that aforementioned bijection between π and π′ cannot be applied directly to g, as it essentially uses the unavailable value π[n]=π′[n], see Sect. 2.
Then we improve Validateπ′ to an optimal linear online algorithm LinearValidateπ′. The improved algorithm relies on both more sophisticated data structures, such as dynamic suffix trees supporting LCA queries, and deeper insight into the combinatorial properties of π′ function.
Related Results
The study of validating arrays related to string algorithms and word combinatorics was started by Franěk et al. [11], who gave an offline linear algorithm for border array validation. This result was improved over time, in particular a simple linear online algorithm for π validation is known [9].
The border array validation problem was also studied in the more general setting of the parametrised border array validation [14, 15], where parametrised border array is a border array for text in which a permutation of letters of alphabet is allowed. A linear time algorithm for a restricted variant of this problem is known [14] and a \(\mathcal{O}(n^{1.5})\) for the general case [15].
Recently a linear online algorithm for a closely related prefix array validation was given [2], as well as for cover array validation [6].
2 Preliminaries
For w∈Σ ^{∗}, we denote its length by w. For v,w∈Σ ^{∗}, by vw we denote the concatenation of v and w. We say that u is a prefix of w if there is v∈Σ ^{∗} such that w=uv. Similarly, we call v a suffix of w if there is u∈Σ ^{∗} such that w=uv. A word v that is both a prefix and a suffix of w is called a border of w. By w[i] we denote the ith letter of w and by \(w[i \mathinner {\ldotp \ldotp }j]\) we denote the subword w[i]w[i+1]…w[j] of w. We call a prefix (respectively: suffix, border) v of the word w proper if v≠w, i.e., it is shorter than w itself.
For a word w its failure function π _{ w } is defined as follows: π _{ w }[i] is the length of the longest proper border of \(w[1 \mathinner {\ldotp \ldotp }i]\) for i=1,2,…,n, see Table 1. It is known that π _{ w } table can be computed in lineartime, see Algorithm 1.
By \(\pi_{w}^{(k)}\) we denote the kfold composition of π _{ w } with itself, i.e., \(\pi_{w}^{(0)}[i]:=i\) and \(\pi_{w}^{(k+1)}[i]:=\pi_{w}[\pi_{w}^{(k)}[i]]\). This convention applies to other functions as well. We omit the subscript w in π _{ w }, whenever it is unambiguous. Note that every border of \(w[1 \mathinner {\ldotp \ldotp }i]\) has length \(\pi _{w}^{(k)}[i]\) for some integer k≥0.
The strong failure function π′ is defined as follows: \(\pi'_{w}[n] := \pi_{w}[n]\), and for i<n, π′[i] is the largest k such that \(w[1 \mathinner {\ldotp \ldotp }k]\) is a proper border of \(w[1 \mathinner {\ldotp \ldotp }i]\) and w[k+1]≠w[i+1]. If no such k exists, π′[i]=−1.
It is wellknown that π _{ w } and \(\pi'_{w}\) can be obtained from one another in linear time, using additional lookups in w to check whether w[i]=w[j] for some i, j. What is perhaps less known, these lookups are not necessary, i.e., there is a constructive bijection between π _{ w } and \(\pi'_{w}\). For completeness, we supply both procedures, see Algorithm 2 and Algorithm 3. By standard argument it can be shown that they run in linear time. The correctness as well as the procedures themselves are a consequence of the following observation
Note that procedure π′Fromπ explicitly uses the following recursive formula for π′[j] for j<n, whose correctness follows from (1):
For two arrays of numbers A and B, we write \(A[i_{a} \mathinner {\ldotp \ldotp }i_{a}+k] \geq B[i_{b} \mathinner {\ldotp \ldotp }i_{b}+k]\) when A[i _{ a }+j]≥B[i _{ b }+j] for j=0,…,k.
2.1 Border Array Validation
Our algorithm uses an algorithm validating the input table as the border array. For completeness, we supply the code of one of the simplest such algorithms Validateπ, see Algorithm 4, due to Duval et al. [9]. This algorithm is online and also calculates the minimal size of the required alphabet.
Roughly speaking, given a valid border array \(A[1 \mathinner {\ldotp \ldotp }n] \) Validateπ computes all valid πcandidates for A[n+1]: given a valid border array \(A[1\mathinner {\ldotp \ldotp }n]\) the next element A[n+1] is a valid πcandidate if \(A[1 \mathinner {\ldotp \ldotp }n+1]\) is a valid border array as well. The exact formula for the set of valid candidates is not useful for us, though it should be noted that it depends only on \(A[1 \mathinner {\ldotp \ldotp }n]\) and that 0 and A[n]+1 are always valid πcandidates.
The key idea needed to understand the algorithm is that w[i] depends only on the letters of w at positions A ^{k}[i−1]+1 for k=1,2,… . Thus the algorithm stores Σ[i], the alphabet size required for such sequence of indices starting at i, for all i. The minimum size of the alphabet required for the whole array A is the maximum over all those values.
For future reference we list some properties that follow from Validateπ:
 (Val1):

the valid candidates for π[i] depend only on \(\pi [1 \mathinner {\ldotp \ldotp }i1]\),
 (Val2):

π[i−1]+1 is always a valid candidate for π[i],
 (Val3):

if the alphabet needed for \(A[1 \mathinner {\ldotp \ldotp }n]\) is strictly larger than the one needed for \(A[1 \mathinner {\ldotp \ldotp }n1]\) then A[n]=0.
3 Overview of the Algorithm
Since there is a bijection between valid border arrays and valid strict border arrays, it is natural to proceed as follows: Assume the input forms a valid strict border array, compute the corresponding border array using πFromπ′(A′), and validate the result using Validateπ(A). Unfortunately, πFromπ′ starts the calculations from the last entry of A′, so it is not suitable for an online algorithm. Moreover, it assumes that A′[n]=A[n], which may be not true for some intermediate values of i. Removing this condition invalidates the bijection and, as a consequence, for intermediate values of i there can be many border arrays consistent with \(A'[1\mathinner {\ldotp \ldotp }i]\), each of them corresponding to a different value of A[i+1]. We show that all these border arrays coincide on a certain prefix. Validateπ′, demonstrated in Algorithm 5, identifies this prefix and runs Validateπ on it. Concerning the remaining suffix, Validateπ′ identifies the border array which is maximal on it, in a sense explained below.
Definition 1
(Consistent functions)
We say that \(A[1 \mathinner {\ldotp \ldotp }n+1]\) is consistent with \(A'[1 \mathinner {\ldotp \ldotp }n]\) if and only if there is a word \(w[1\mathinner {\ldotp \ldotp }n+1]\) such that.
 (CF1):

\(A[1\mathinner {\ldotp \ldotp }n+1] = \pi_{w}[1 \mathinner {\ldotp \ldotp }n+1]\),
 (CF2):

\(A'[1\mathinner {\ldotp \ldotp }n] = \pi'_{w}[1\mathinner {\ldotp \ldotp }n]\).
A function \(A[1\mathinner {\ldotp \ldotp }n+1]\) consistent with \(A'[1 \mathinner {\ldotp \ldotp }n]\) is maximal if
 (CF3):

every \(B[1 \mathinner {\ldotp \ldotp }n+1]\) consistent with \(A'[1\mathinner {\ldotp \ldotp }n]\) satisfies \(B[1 \mathinner {\ldotp \ldotp }n+1] \leq A[1 \mathinner {\ldotp \ldotp } n+1]\).
Note that it is crucial that A is defined also on n+1.
Our algorithm Validateπ′ (and its improved variant LinearValidateπ′) maintains such a maximal A.
Slopes and Their Properties
Imagine the array A′ as the set of points (i,A′[i]) on the plane; we think of A in the similar way. Such a picture helps in understanding the idea behind the algorithm. In this setting we think of A as a collection of maximal slopes: a set of indices i,i+1,…,i+j is a slope if A[i+k]=A[i]+k for k=1,…,j. From here on whenever we refer to slope, we implicitly mean a maximal one, i.e., extending as far as possible in both directions. Note that n+1 is part of the last slope, which may consist only of n+1. It is even better to imagine a slope a collection of points (i,A[i]) which together span one interval on the plain, see Fig. 1. Observe also that A[i+j+1]≠A[i+j]+1 implies A[i+j]=A′[i+j], by (1), i.e., the last index of a (maximal) slope is the unique one on which A[i+j]=A′[i+j]. Let the pin be the first position on the last slope of A (in some extreme cases it might be that n+1 is the pin). Validateπ′ calculates and stores the pin. It turns out that all functions consistent with A′ differ from A only on the last slope, as shown later in Lemma 1.
When a new input value A′[n] is read, the values of A and A′ on the last slope \([i \mathinner {\ldotp \ldotp }n+1]\) should satisfy the following conditions:
The last slope is defined correctly if and only if (3a) holds (otherwise the slope should end earlier), while the values of A and A′ on the last slope are consistent if and only if (3b) holds. These conditions are checked by appropriate queries: (3a) by the pin value check (denoted PinValueCheck), which returns any \(j \in[i \mathinner {\ldotp \ldotp }n]\) such that A′[j]>A[j] or, if there is no such j, the smallest \(j \in[i \mathinner {\ldotp \ldotp }n]\) such that A′[j]=A[j]; and (3b) by the consistency check (denoted ConsistencyCheck), which checks whether \(A'[i \mathinner {\ldotp \ldotp }n] = A'[A[i] \mathinner {\ldotp \ldotp }A[i] + (ni)]\).
If one of the conditions (3a), (3b) does not hold, Validateπ′ adjusts the last slope of A, until both conditions hold or the input is reported as invalid. These actions are given in detail in Algorithm 6.
If the pin value check returns an index j such that A′[j]>A[j], then we reject the input and report an error: since A is the maximal consistent function, for each consistent function A _{1} it also holds that A _{1}[j]<A′[j] and so none such A _{1} exists and so A′ is invalid.
If A′[j]=A[j] we break the last slope in two: \([i \mathinner {\ldotp \ldotp }j]\) and \([j+1 \mathinner {\ldotp \ldotp }n]\), the new last slope, see Fig. 2: for every A _{1} consistent with A′ it holds that A _{1}[j]≥A′[j]≥A[j], but as A is maximal consistent with A′, it also holds that A _{1}[j]≤A[j]=A′[j], and hence A _{1}[j]=A[j]. We also check whether
holds. If not, we reject: every table A _{1} consistent with A′ satisfies A _{1}[j]=A[j]=A′[j], and therefore A and A _{1} have to be equal on all preceding values as well, see Lemma 1. Next we set i to j+1 and A[i] to the largest valid candidate value for π[i].
If ConsistencyCheck fails, then we set the value of A[i] to the next valid candidate value for π[i], see Fig. 3 and propagate the change along the whole slope. If this happens for A[i]=0, then there is no further candidate value, and A′ is rejected. The idea is that some adjustment is needed and since pin value check does not return an index, we cannot break the slope into two and so the only possibility is to decrement A on the whole last slope.
Unfortunately, this simple combinatorial idea alone fails to produce a lineartime algorithm. The problem is caused by the second condition: large segments of A′ should be compared in amortised constant time. While LCA queries on suffix trees seem ideal for this task, available solutions are imperfect: the online suffix tree construction algorithms [19, 24] are linear only for alphabets of constant size, while the only lineartime algorithm for larger alphabets [10] is inherently offline. To overcome this obstacle we specialise the data structures used, building the suffix tree for compressed encoding of A′ and multiple suffix trees for short texts over polylogarithmic alphabet. The details are presented in Sect. 8.
4 Details and Correctness
In this section we present technical details of the algorithm, provide a proof of its correctness and proofs of used combinatorial properties. We do not address the running time and the way the data structures are organised. We start with showing that all the consistent tables coincide on indices smaller than pin.
Lemma 1
Let \(A[1\mathinner {\ldotp \ldotp }n+1] \geq B[1\mathinner {\ldotp \ldotp }n+1]\) be both consistent with \(A'[1 \mathinner {\ldotp \ldotp }n]\). Let i be the pin (for A). Then \(A[1 \mathinner {\ldotp \ldotp }i1] = B[1 \mathinner {\ldotp \ldotp }i1]\).
Proof
The claim holds vacuously when there is only one slope, i.e., i=1. If there are more, let i be the pin and consider i−1. Since it is the end of a slope, by (1) A′[i−1]=A[i−1]. On the other hand, consider \(B[1 \mathinner {\ldotp \ldotp }n+1]\) as in the statement of the lemma. By assumption of the lemma, A[i−1]≥B[i−1]. Thus
hence B[i−1]=A[i−1]. Let \(B[1 \mathinner {\ldotp \ldotp }n+1] = \pi_{w'}[1 \mathinner {\ldotp \ldotp }n+1]\) and \(A[1 \mathinner {\ldotp \ldotp }n+1] = \pi_{w}[1 \mathinner {\ldotp \ldotp }n+1]\). Using πFromπ′ we can uniquely recover \(\pi _{w'}[1\mathinner {\ldotp \ldotp }i1]\) from \(\pi_{w'}'[1\mathinner {\ldotp \ldotp }i1]\) and π _{ w′}[i−1], as well as \(\pi_{w}[1\mathinner {\ldotp \ldotp }i1]\) from \(\pi_{w}'[1\mathinner {\ldotp \ldotp }i1]\) and π _{ w }[i−1]. But since those pairs of values are the same,
which shows the claim of the lemma. □
Data Maintained
Validateπ′ stores:

n, the number of values read so far,

\(A'[1\mathinner {\ldotp \ldotp }n]\), the input read so far,

i, the current pin

\(A[1\mathinner {\ldotp \ldotp }n+1]\), the maximal function consistent with \(A'[1\mathinner {\ldotp \ldotp }n]\):

\(A[1 \mathinner {\ldotp \ldotp }i1]\), the fixed prefix,

A[i], the candidate value that may change.

Note that A[j] for j>i are not stored. These values are implicit, given by A[j]=A[i]+(j−i). In particular this means that decrementing A[i] results in decrementing the whole last slope.
Sets of Valid π Candidates and Validating A
Validateπ′ creates a border array A, which is always valid by the construction. Nevertheless, it runs Validate\(\pi(A[1 \mathinner {\ldotp \ldotp }i1])\). This way the set of valid candidates for π[i] is computed, as well as a word w over a minimalsize alphabet Σ such that \(\pi_{w} [1 \mathinner {\ldotp \ldotp }i1] = A[1 \mathinner {\ldotp \ldotp }i1]\).
In the remainder of this section it is shown that invariants CF1–CF3 are preserved by Validateπ′.
Lemma 2
If A′[n]=A′[A[n]], then no changes are done by Validateπ′ and the CF1–CF3 are preserved.
Proof
Whenever a new symbol is read, Validateπ′ checks (3b) for j=n, i.e., whether A′[n]=A′[A[n]]. If it holds, then no changes are needed because:

CF1 holds trivially: the implicit A[n+1]=A[n]+1 is always a valid value for π[n+1], see Val2.

CF2 holds: as A′[n]<A[n] by (1) it is enough to check that A′[n]=A′[A[n]], which holds by (3b).

CF3 holds: consider any \(B[1 \mathinner {\ldotp \ldotp }n+1]\) consistent with \(A'[1 \mathinner {\ldotp \ldotp }n]\). By induction assumption CF3 holds for \(A[1 \mathinner {\ldotp \ldotp }n]\), hence B[n]≤A[n]. Therefore
$$B[n+1] \leq B[n]+1 \leq A[n]+1=A[n+1] , $$which shows the last claim and thus completes the proof. □
Thus it is left to show that CF1–CF3 are preserved by AdjustLastSlope. We show that during the adjusting inside AdjustLastSlope CF1 and CF3 hold. To be more specific, CF1 alone means that A is always a valid border array, while CF3 means that it is greater than any border table consistent with A′ (this is assumed to hold vacuously if no consistent table exists). Finally, we show that CF3 holds when AdjustLastSlope ends adjusting the last slope, i.e., that then A is in fact consistent with A′.
For the completeness of the proof, we need also to show that if at any point A′ was reported to be invalid, it is in fact invalid.
Lemma 3
After each iteration of the loop in line 1 of AdjustLastSlope the CF1 and CF3 are preserved. Furthermore, if AdjustLastSlope rejects A′ in line 3 or 9, then A′ is invalid.
Proof
We show both claims by induction. In the following, let \(A_{1}[1 \mathinner {\ldotp \ldotp }n+1]\) be any table consistent with \(A'[1 \mathinner {\ldotp \ldotp }n]\).
For the induction base note that \(A[1\mathinner {\ldotp \ldotp }n]\) and \(A'[1\mathinner {\ldotp \ldotp }n1]\) satisfy CF1–CF3. To see that CF1 is satisfied by \(A[1\mathinner {\ldotp \ldotp }n+1]\) note that the assigned value A[n+1]=A[n]+1 is always a valid πvalue, so CF1 holds for \(A[1 \mathinner {\ldotp \ldotp }n+1]\). Similarly, for CF3 note that \(A[1 \mathinner {\ldotp \ldotp }n] \geq A_{1}[1 \mathinner {\ldotp \ldotp }n]\) and A[n+1]=A[n]+1≥A _{1}[n]+1≥A _{1}[n+1], which shows that CF3 holds for \(A[1\mathinner {\ldotp \ldotp }n+1]\). Additionally, the second claim of the Lemma holds vacuously for \(A[1\mathinner {\ldotp \ldotp }n+1]\), as so far it was not rejected.
Suppose that PinValueCheck returns no index j. Then by the induction assumption CF1 and CF3 hold, which ends the proof in this case.
Suppose that PinValueCheck returns j such that A[j]<A′[j]. Then, since CF3 is satisfied, A _{1}[j]≤A[j]<A′[j], i.e., A _{1} is not a valid π table. So no A _{1} is consistent with A′, which means that A′ is invalid, as reported by Validateπ′. This ends the proof in this case.
It is left to consider the case in which PinValueCheck returns j such that A[j]=A′[j]. Then CF1 is satisfied: A[j] is explicitly set to a valid π candidate while for p>j the A[p] is set to A[p]=A[p−1]+1, which is always a valid π candidate, by Val2. Furthermore, j is an end of slope for A _{1}: By CF3, A _{1}[j]≤A[j]=A′[j] but as A _{1} is a valid π table, A _{1}[j]≥A′[j]. So A′[j]=A _{1}[j] and therefore, by (2), it is an end of a slope for A _{1}. As a consequence, by Lemma 1, \(A[i \mathinner {\ldotp \ldotp }j]=A_{1}[i \mathinner {\ldotp \ldotp }j]\). Note that for \(p \in[i \mathinner {\ldotp \ldotp }j1]\) it holds that A[p]>A′[p]: otherwise PinValueCheck would have returned such p instead of j. Thus, by (1), A[p] and A′[p] should satisfy A′[p]=A′[A[p]], and this condition is verified by AdjustLastSlope in line 8. If this equation is not satisfied by some p then clearly \(A'[i \mathinner {\ldotp \ldotp }j1]\) is not consistent with \(A[i \mathinner {\ldotp \ldotp }j]\). Since \(A_{1}[i \mathinner {\ldotp \ldotp }j] = A[i \mathinner {\ldotp \ldotp }j]\) this shows that no such A _{1} exists and consequently A′ is invalid. This shows the second subclaim.
Suppose that A′ was not rejected. It is left to show that CF3 is satisfied when PinValueCheck returns j such that A[j]=A′[j]. Since A _{1}[i] is a valid π value and A[i] is the maximal valid π value, A _{1}[i]≤A[i]. The implicit values A[p] for \(p \in[i+1 \mathinner {\ldotp \ldotp }n]\) satisfy A[p]=A[i]+(p−i). Since A _{1} is a valid π table A _{1}[p]≤A _{1}[i]+(p−i) for p=i+1,…,n and thus:
and as A _{1} was chosen arbitrarily, CF3 holds. □
Lemma 4
Suppose that PinValueCheck returns no j and that A satisfies CF1 and CF3. If ConsistencyCheck returns false and A[i]=0 then A′ is invalid. Otherwise after adjusting in line 20 of AdjustLastSlope, CF1 and CF3 hold.
Proof
Let as in the previous lemma A _{1} denote any valid border array consistent with A′. Since A satisfies CF3, we know that A[i]≥A _{1}[i]. When A[i] is updated to the next largest valid π candidate, its new value is at least A _{1}[i] (as A _{1}[i] is itself a valid π value) and for each p>i we have
which shows that CF3 is preserved after the adjusting.
We now prove that in fact A[i]>A _{1}[i]. Suppose for the sake of contradiction that A[i]=A _{1}[i]. It is not possible that \(A[i \mathinner {\ldotp \ldotp }n+1] = A_{1}[i \mathinner {\ldotp \ldotp }n+1]\): since PinValueCheck returned no j, for each p≥i we have A _{1}[p]=A[p]>A′[p]. In such case by (2) it holds that A′[p]=A′[A _{1}[p]] but from the answer of the PinValueCheck we know that this is not the case.
Consider the smallest position, say p, such that A[p+1]>A _{1}[p+1]; such a position exists as \(A[i \mathinner {\ldotp \ldotp }n+1] \geq A_{1}[i \mathinner {\ldotp \ldotp }n+1]\) and \(A[i \mathinner {\ldotp \ldotp }n+1] \neq A_{1}[i \mathinner {\ldotp \ldotp }n+1]\). Now consider A _{1}[p]: since A _{1}[p+1]<A _{1}[p]+1 then by (2) this means that A _{1}[p]=A′[p]. This is a contradiction, as PinValueCheck should have returned this p.
Therefore, when ConsistencyCheck returns no then A _{1}[i]<A[i] for an arbitrary A _{1} that is consistent with A′. In particular, if A[i]=0, there is no such A _{1}, and hence A′ is invalid.
It is left to show that CF1 holds, i.e., that \(A[i \mathinner {\ldotp \ldotp }n+1]\) were all assigned valid candidates for π at their respective positions. This was addressed explicitly for A[i], while for p>i the assigned values are A[p−1]+1, which are always valid by Val2. □
The last lemma shows that when AdjustLastSlope finishes, CF2 is satisfied as well.
Lemma 5
When AdjustLastSlope finishes, CF2 is satisfied.
Proof
Recall the recursive formula (2) for π′. Its first case corresponds to j being the last element on the slope and the second to other j’s.
If A[j] is an explicit value and j is not an end of a slope, this formula is verified, when A[j] is stored. If A[j] is explicit and j is an end of the slope then the formula trivially holds.
If A[j] is an implicit value, i.e., such that j is on the last slope of A, PinValueCheck guarantees that A[j]>A′[j] and so the second case of this formula should hold. This is verified by ConsistencyCheck. Hence CF2 holds when all adjustments are finished. □
The above four lemmata: Lemma 2–Lemma 5 together show the correctness of Validateπ′.
Theorem 1
Validateπ′ verifies whether A′ is a valid strict border array. If so, it supplies the maximal function A consistent with A′.
Proof
We proceed by induction on n. If n=0, then clearly A[1]=0, CF1–CF3 hold trivially, and A′ is a valid (empty) π′ array. If n>0 and no adjustments were done, CF1–CF3 hold by Lemma 2. So we consider the case when AdjustLastSlope was invoked.
By Lemma 3 and Lemma 4 if the \(A'[1 \mathinner {\ldotp \ldotp }n]\) is rejected, it is invalid. So assume that \(A'[1 \mathinner {\ldotp \ldotp }n]\) was not rejected. We show that it is valid. As it was not rejected, by Lemma 3 and Lemma 4 the constructed table \(A[1 \mathinner {\ldotp \ldotp }n+1]\) together with \(A'[1 \mathinner {\ldotp \ldotp }n]\) satisfy CF1 and CF3. Moreover, by Lemma 5 they satisfy also CF2. Thus \(A[1 \mathinner {\ldotp \ldotp }n+1]\) is a valid border array for some word \(w[1 \mathinner {\ldotp \ldotp }n+1]\) and \(A'[1 \mathinner {\ldotp \ldotp }n]\) is a valid strong border array for the same word \(w[1 \mathinner {\ldotp \ldotp }n]\). □
In the following section we explain how to perform the pin value checks and consistency checks efficiently and bound the whole running time of the algorithm.
5 Performing Pin Value Checks
Consider the PinValueCheck and two indices j, j′ such that
We call the relation defined in (4) a domination: we say that j′ dominates j and write it as j≺j′. We will show that if j′≻j and j is an answer to PinValueCheck, so is j′, consult Fig. 4. This observation allows to keep a collection j _{1}<j _{2}<⋯<j _{ ℓ } of indices such that to perform the pin value check, it is enough to see whether A[j _{1}]<A′[j _{1}]. In particular, the answer can be given in constant time. Updates of this collection are done by removal of j _{1} when i becomes j _{1}+1, or by consecutive removals from the end of the list when a new A′[n] is read.
Domination Properties
As ≺ is an intersection of two transitive relations (order on indices and order on T, defined as T[j]=A[j]−j), it is transitive.
Observe that if j≺j′, then A[j]≤A′[j] implies A[j′]<A′[j′]:
Therefore if j is an answer to pin value check, so is j′. As a consequence, we do not need to keep track of j as a potential answer to the PinValueCheck.
Data Stored
Validateπ′ stores a list of positions j _{1}<j _{2}<⋯<j _{ k } such that (for the sake of simplicity, let j _{0}=i, where i is the current pin):
Answering PinValueCheck
When PinValueCheck is asked, we check whether A[j _{1}]≤A′[j _{1}] and return the answer. This way the PinValueCheck is answered in constant time. We show that evaluating this expression for other values of j is not needed, as if A′[j]≥A[j] for some j, then A′[j _{1}]≥A[j _{1}], and moreover if A′[j]>A[j], then also A′[j _{1}]>A[j _{1}].
Suppose that A′[j]≥A[j] for some \(j \in[j_{\ell1}+1 \mathinner {\ldotp \ldotp }j_{\ell}1]\). Since j _{ ℓ } dominates j it holds that A′[j _{ ℓ }]>A[j _{ ℓ }], by (5). Suppose now that A′[j _{ ℓ }]≥A[j _{ ℓ }] for some j _{ ℓ }>j _{1}. Since j _{1}<j _{ ℓ } and j _{ ℓ } does not dominate j _{1}:
As j _{1} and j _{ ℓ } are on the last slope,
and hence
so j _{1} is a proper answer to the PinValueCheck. Similarly, A′[j _{ ℓ }]>A[j _{ ℓ }] implies A′[j _{1}]>A[j _{1}].
Update
We demonstrate that all updates of the list j _{1},…,j _{ k } can be done in \(\mathcal{O} (n)\) time. When new position n is read, we update the list by successively removing j _{ ℓ }’s dominated by n from the end of the queue. By routine calculations, if n≻j _{ ℓ }, then n≻j _{ ℓ+1} as well:
Therefore
So we simply have to remove some tail from the list of j’s. Suppose that j _{ ℓ },…,j _{ k } were removed. It is left to show that (6a), (6b) are preserved after the removal. Consider first (6a). Take any \(j \in[j_{\ell1} \mathinner {\ldotp \ldotp }n1]\). Then there is some j _{ ℓ′} such that \(j \in[j_{\ell'1} \mathinner {\ldotp \ldotp }j_{\ell'}1]\). By (6b), j _{ ℓ′}≻j. Since by assumption n≻j _{ ℓ′}, by transitivity of ≻, also n≻j. As for (6b), it holds since j _{ ℓ−1}⊀n by the construction.
There is another possible update: when PinValueCheck return j _{1} then i←j _{1}+1 and so j _{1}+1 becomes the new pin. In such case we remove j _{1} from the list.
As each position enters and leaves the list at most once, the time of update is linear.
Lemma 6
All PinValueCheck calls can be made in amortised constant time.
6 Performing Consistency Checks: Slow but Easy
In order to perform consistency check we need to efficiently perform two operations: appending a letter to the current text \(A'[1 \mathinner {\ldotp \ldotp }n]\) and checking if two fragments of the prefix read so far are the same. First we show how to implement both of them using randomisation so that the expected running time is \(\mathcal{O}(\log n)\) per one consistency check. In the next section we improve the running time to (deterministic) \(\mathcal{O}(1)\).
We use the standard labeling technique [16], assigning unique small names to all fragments of lengths that are powers of two. More formally, let name[i][j] be an integer from {1,…,n} such that name[i][j]=name[i′][j] if and only if \(A'[i..i+2^{j}1]=A'[i' \mathinner {\ldotp \ldotp }i'+2^{j}1]\). Then checking if any two fragments of A′ are the same is easy: we only need to cover both of them with fragments of length 2^{j}, where 2^{j} is the largest power of two not exceeding their length. Then we check if the corresponding fragments of length 2^{j} are the same in constant time using the previously assigned names.
Appending a new letter A′[n+1] is more difficult, as we need to compute name[n−2^{j}+2][j] for all j=1,…,logn. We set name[n+1][0] to A′[n+1]. For names with j>0 we need to check if a given fragment of text \(A'[n2^{j}+2 \mathinner {\ldotp \ldotp }n+1]\) occurs at some earlier position, and if so, choose the same name. To locate the previous occurrences, for each j>0 we keep a dictionary M(j) mapping pair (name[i][j−1],name[i+2^{j−1}][j−1]) to name[i][j]. To check if a given fragment \(A'[n2^{j}+2 \mathinner {\ldotp \ldotp }n+1]\) occurs previously in the text, we look up the pair (name[n−2^{j}+2][j−1],name[n−2^{j−1}+2][j−1]) in M(j). If there is such an element in M(j), we set name[n−2^{j}+2][j] equal to the corresponding name. Otherwise we set name[n−2^{j}+2][j] equal to the size of M(j) plus 1, which is the smallest integer which we have not assigned as a name of fragment of length 2^{j} yet. Then we update the dictionary accordingly: we insert mapping from (name[n−2^{j}+2][j−1],name[n−2^{j−1}+2][j−1]) to the newly added element.
To implement the dictionaries M(j), we use dynamic hashing with a worstcase constant time lookup and amortized expected constant time for updates (see [7] or a simpler variant with the same performance bounds [22]). Then the expected running time of the whole algorithm becomes \(\mathcal{O}(n\log n)\), as there are logn dictionaries, each running in expected linear time (the expectation is taken over the random choices of the algorithm).
7 Size of the Alphabet
Validateπ not only answers whether the input table is a valid border array, but also returns the minimum size of the needed alphabet. We show that this is also true of Validateπ′. Roughly speaking, Validateπ′ runs Validateπ and simply returns its answers. To this end we show that the minimum alphabet size required by the fixed prefix of A matches the minimum alphabet size required by A′.
Lemma 7
Let \(A'[1 \mathinner {\ldotp \ldotp }n]\) be a valid π′ function, \(A[1 \mathinner {\ldotp \ldotp }n+1]\) the maximal function consistent with \(A'[1 \mathinner {\ldotp \ldotp }n]\), and i the pin. The minimum alphabet size required by \(A'[1\mathinner {\ldotp \ldotp }n]\) equals the minimum alphabet size required by \(A[1\mathinner {\ldotp \ldotp }i1]\) if A[i]>0, and by \(A[1\mathinner {\ldotp \ldotp }i]\) if A[i]=0.
Proof
Suppose first that A[i]>0. Thus Validateπ run on \(A[1 \mathinner {\ldotp \ldotp }n]\) returns the same size of required alphabet as run on \(A[1 \mathinner {\ldotp \ldotp }i1]\) since new letters are needed only when A[j]=0 at some position, see Val3, and A[j]>0 for j on the last slope. Consider any \(B[1 \mathinner {\ldotp \ldotp }n+1]\) consistent with \(A'[1 \mathinner {\ldotp \ldotp }n]\). Then \(B[1 \mathinner {\ldotp \ldotp }i1] = A[1 \mathinner {\ldotp \ldotp }i1]\) by Lemma 1. Thus A requires an alphabet larger than that required by \(B[1 \mathinner {\ldotp \ldotp }i1]\), which is clearly no larger than the one required by the whole \(B[1 \mathinner {\ldotp \ldotp }n]\).
Suppose now that A[i]=0. Then, for any \(B[1 \mathinner {\ldotp \ldotp }n+1]\) consistent with \(A'[1 \mathinner {\ldotp \ldotp }n]\),
holds by CF3, i.e., \(A[ 1 \mathinner {\ldotp \ldotp }i] = B[1 \mathinner {\ldotp \ldotp }i]\). Since A[j]>0 for j>i, the same argument as previously works. □
Note that Validateπ′ runs Validateπ either on \(A[1\mathinner {\ldotp \ldotp }i1]\), or on \(A[1\mathinner {\ldotp \ldotp }i1]\) when A[i]=0. In either case, all these values are fixed, and thus no position of A is inspected twice by Validateπ.
We further note that Lemma 7 implies that the minimum size of the alphabet required for a valid strict border array is at most as large as the one required for border array. The latter is known to be \(\mathcal{O} (\log n)\) [20, Th. 3.3a]. This observation implies the following.
Corollary 1
The minimum size of the alphabet required for a valid strict border array is \(\mathcal{O} (\log n)\).
8 Improving the Running Time to Linear
This section describes our linear time online algorithm LinearValidateπ′ by specifying necessary changes to Validateπ′. It suffices to show how to perform consistency checks more efficiently, as each other operations works in amortised constant time. A natural approach is as follows: construct a suffix tree [10, 19, 24] for the input table \(A'[1 \mathinner {\ldotp \ldotp }n]\), together with a data structure for answering LCA queries [3]. The best known algorithm for constructing the suffix tree runs in linear time, regardless of the size of the alphabet [10]. Unfortunately, this algorithm, and all other linear time solutions we are aware of, are inherently offline, and as such invalid for our purposes. The online suffix tree constructions of [19, 24] have a slightly bigger running time of \(\mathcal{O} (n \log\varSigma)\), where Σ is the alphabet. As A′ is a text over an alphabet {−1,0,…,n−1}, i.e., of size n+1, these constructions would only guarantee an \(\mathcal{O}(n\log n)\) time.
To get a linear time algorithm we exploit both the structure of the π′ array and the relationship between subsequent consistency checks. In more detail, firstly we demonstrate how to improve Ukkonen’s algorithm [24] so that it runs in time \(\mathcal{O} (n)\) for alphabets of polylogarithmic size, which may be of independent interest. This alone is still not enough, since A′ is over an alphabet of linear size. To overcome this obstacle we use the combinatorial properties of A′ to compress it. The compressed table uses alphabet of polylogarithmic size, which makes the improved version of the Ukkonen’s algorithm applicable. New problems arise, as the compressed table is a little harder to read and further conditions need to be verified to answer the consistency checks.
8.1 Suffix Trees for Polylogarithmic Alphabet
In this section we present a construction of an online dictionary with constant time access and insertion, for t=logn elements. When used in Ukkonen’s algorithm [24], it guarantees the following construction of suffix trees.
Lemma 8
For any constant c, the suffix tree for a text of length n over an alphabet of size log^{c} n can be constructed online in \(\mathcal{O}(n)\) time. Given a vertex in the resulting tree, its child labeled by a specified letter can be retrieved in constant time.
The only reason Ukkonen’s algorithm [24] does not work in linear time is that given a vertex it needs to efficiently retrieve its child labeled with a specified letter. If we are able to perform such a retrieval in constant time, the Ukkonen’s algorithm runs in linear time.
For that we can use the atomic heaps of Fredman and Willard [12], which allow constant time search and insert operations on a collection of \(\mathcal{O}(\sqrt{\log n})\)elements sets. This results in a fairly complicated structure, which can be greatly simplified since in our case not only are the sets small, but the size of the universe is bounded as well.
Simplifying Assumptions
We assume that the value of ⌊logn⌋ is known. Since n is not known in advance, when we read elements of A′ onebyone, as soon as the value of n doubles, we repeat the whole computation with a new value of ⌊logn⌋. This changes the running time only by a constant factor.
It is enough to give the construction for the alphabet of size logn as for alphabets of size log^{c} n we can encode each letter in c characters chosen from an alphabet of a logarithmic size.
First Step: Dictionary for Small Number of Elements
We implement an online dictionary for an universe of size logn. Both access and insert time are constant and the memory usage is at most linear in the number of elements stored. The first step of the construction is a simpler case of t keys, for \(t \leq\sqrt{\log n}\). Then this construction is folded twice to obtain the general case of t=Θ(logn). One step of such a construction is depicted on Fig. 5.
The indices of items currently present in the dictionary are encoded in one machine word, called the characteristic vector V, in which the bit V[i]=1 if and only if dictionary contains key i.
We store pointer to the keys in the dictionary in a dynamically resized pointer table, in order of their arrival times: whenever we insert a new item, its pointer is put right after the previously added one. Additionally, we keep a permutation table P that encodes the order in which currently stored elements have been inserted. In other words, P[i] stores the position in the pointer table of the pointer to i. Since \(t \leq\sqrt{\log n}\), all successive values of such permutation can be stored in one machine word.
Accessing the Information for Small Number of Elements
If we want to find the pointer to the element number k, we first check if V[k]=1. Then we find the index of k, i.e., j=#{k′≤k: V[k′]=1}. To do this, we mask out all the bits on positions larger than k, obtaining vector V′. Then j=#{k′: V′[k′]=1}. Computing j can be done comparing V′ with the precomputed table. Then we look at position j in the permutation table—P[j] gives address in the pointer table under which the pointer to k is stored. This gives us the desired key.
The precomputed tables can be obtained using standard techniques as well as deamortised in a standard way.
Updating the Information for Small Number of Elements
When a new key k arrives, it is stored in the memory at the next available position and a pointer to it is put in the dictionary: firstly we set V[k]=1 and insert the pointer on the last position at the pointer table. We also need to update the permutation table. To do this, we calculate j=#{k′<k: V[k′]=1} and m=#{k′: V[k′]=1}, this is done in the same way as when accessing the stored pointer. Then we change the permutation table: we move all the numbers on positions greater than j one position higher and write m+1 on position j. Since the whole permutation table fits in one codeword, this can be done in constant time: let P′ be the table P with all positions larger than j−1 masked out and P″ the table with all position smaller than j masked out. Then we shift P″ by one position higher and set P←P′P″. Then we set P[j]=m+1.
Larger Number of Elements
When the number of items becomes bigger, we fold the above construction twice (somehow resembling the Btree of order \(t = \sqrt{ \log n}\)): choose a subset of keys k _{1}<k _{2}<⋯<k _{ ℓ } such that between k _{ j } and k _{ j+1} there are at least t and at most 2t other keys. Observe that k _{1}<k _{2}<⋯<k _{ ℓ } can be kept in the above structure, with constant update and access time, we refer to it as the top structure. Moreover, for each i the keys between k _{ i } and k _{ i+1} also can be kept in such a structure. We refer to those structures as the bottom structures.
Access for Large Number of Elements
To access information associated with a given key k, we first look up the largest chosen key smaller than k in the top structure and then look up k in the corresponding bottom structure. The second operation is already known to have constant amortised time. The first operation can be done in \(\mathcal{O}(1)\) time by first masking out the bits on positions larger than k in top characteristic vector and then extracting the position of the largest bit. Again this can be done using standard techniques.
Update for Large Number of Elements
When we insert new item k, firstly we find i such that k _{ i−1}≤k<k _{ i }, where k _{ i−1} and k _{ i } are elements of the top structure. This is done in the same way as when information on k is accessed. Then k is inserted into proper bottom structure.
If after an insertion the bottom structure has 2t+1 elements, we choose its middle element, insert it into the top structure, and split the keys into two parts consisting of t elements, creating two new bottom structures out of them. This requires \(\mathcal{O}(t)\) time but the amortised insertion time is only \(\mathcal{O}(1)\): the size of the bottom structure is t after the split and 2t before the next split, so we can charge the cost to the new t keys inserted into the tree before the splits.
8.2 Compressing A′
Lemma 8 does not apply to A′ directly, as it may hold too many different values. To overcome this, we compress A′ into Compress(A′), so that the resulting text is over a polylogarithmic alphabet and checking equality of two fragments of A′ can be performed by looking at the corresponding fragments of Compress(A′). To compress A′, we scan it from left to right. If A′[i]=A′[i−j] for some 1≤j≤log^{2} n we output #_{0} j. If A′[i]≤log^{2} n we output #_{1} A′[i]. Otherwise we output the binary encoding of A′[i] enclosed by #_{2} and #_{3}. For each i we store the position of its encoding in Compress(A′) in Start[i].
Note that we need to know, whether a value A′[n] appeared within the last log^{2} n positions. To do this, we keep a table Prev, such that Prev[i] gives the position of the last i in A′ (or −1, if no i appeared so far). It is easily updated in constant time: when we read A′[n] we set Prev[A′[n]] to n and Prev[n] to −1.
In this encoding only the last case of A′[i]>log^{2} n and A′[i] not occurring in \(A'[i\log^{2}n \mathinner {\ldotp \ldotp }i1]\) may result in more than one symbol of an alphabet of size O(log^{2} n). We show that the number of different large values of π′ is small, which allows bounding the total size of these encodings and hence the whole Compress(A′) table by \(\mathcal{O} (n)\).
Lemma 9
Let k≥0 and consider a segment of 2^{k} consecutive entries in the π′ array. At most 48 different values from the interval [2^{k},2^{k+1}) occur in such a segment.
Proof
First note that each i such that π′[i]>0 corresponds to a nonextensible occurrence of the border \(w[1 \mathinner {\ldotp \ldotp }\pi'[i]]\), i.e., π′[i] is the maximum j such that \(w[1 \mathinner {\ldotp \ldotp }j]\) is a suffix of \(w[1 \mathinner {\ldotp \ldotp }i]\) and w[j+1]≠w[i+1].
If k<2 then the claim is trivial. So let k′=k−2≥0 and assume that there are more than 48 different values from [2^{k},2^{k+1})=[4⋅2^{k′},8⋅2^{k′}) occurring in some segment of length 2^{k}. Then more than 12 different values from [4⋅2^{k′},8⋅2^{k′}) occur in a segment of length 2^{k′}. Split the range [4⋅2^{k′},8⋅2^{k′}) into three subranges [4⋅2^{k′},5⋅2^{k′}), [5⋅2^{k′},6⋅2^{k′}) and [6⋅2^{k′},8⋅2^{k′}). Then at least 5 different values from one of these subranges occur in the segment; let [ℓ,r) be that subrange. Note that (no matter which one it is),
Let these 5 different values occur at positions p _{1}<⋯<p _{5}. Consider the sequence p _{ i }−π′[p _{ i }]+1 for i=1,…,5: these are the beginnings of the corresponding nonextensible borders. In particular p _{ i }’s are pairwise different (since they are ends of nonextendable borders). Each sequence of length 5 contains a monotone subsequence of length 3. We consider the cases of decreasing and increasing sequence separately:

1.
There exist \(p_{i_{1}}<p_{i_{2}}<p_{i_{3}}\) in this segment such that
$$p_{i_1}  \pi'[p_{i_1}] + 1> p_{i_2}  \pi'[p_{i_2}] + 1> p_{i_3}  \pi'[p_{i_3}] + 1. $$Define \(x = w[p_{i_{1}}+1]\) and \(y = w[\pi'[p_{i_{1}}]+1]\), see Fig. 6. Then by the definition of \(\pi'[p_{i_{1}}]\), x≠y. We derive a contradiction by showing that x=y. To this end we use the periodicity of the word w. Define
$$\begin{aligned} a &= \bigl(p_{i_2}  \pi'[p_{i_2}]+1\bigr)  \bigl(p_{i_3}  \pi'[p_{i_3}]+1\bigr) , \\ b &= \bigl(p_{i_1}  \pi'[p_{i_1}]+1\bigr)  \bigl(p_{i_3}  \pi'[p_{i_3}]+1\bigr) , \\ s &= \pi'[p_{i_1}] + b , \end{aligned}$$see Fig. 6. Define \(s = \pi'[p_{i_{1}}]+b\), see Fig. 6; then both a, b are periods of \(w[1\mathinner {\ldotp \ldotp }s]\), see Fig. 6. We show that \(a,b \leq\frac{s}{2}\) and so periodicity lemma can be applied to them and word \(w[1\mathinner {\ldotp \ldotp }s]\).
$$\begin{aligned} a < b & = \bigl(p_{i_1}  \pi'[p_{i_1}]\bigr)  \bigl(p_{i_3}  \pi'[p_{i_3}]\bigr) < \pi'[p_{i_3}]  \pi'[p_{i_1}] \\ & \leq r\ell < \frac{\ell}{2} . \end{aligned}$$Since \(s = \pi'[p_{i_{1}}] + b \) and \(\pi'[p_{i_{1}}] \in [\ell,r )\) we obtain s>ℓ. Thus
$$a < b < \frac{s}{2} . $$By periodicity lemma b−a is also a period of \(w[1\mathinner {\ldotp \ldotp }s]\). As position \(p_{i_{1}}+1\) is covered by the nonextensible border ending at \(p_{i_{2}}\) (note that \(b < \frac{\ell}{2}\) and \(\pi'[p_{i_{1}}] \geq\ell\)):
$$x = w[p_{i_1}+1] = w\bigl[\pi'[p_{i_1}]+1+(ba) \bigr] , $$see Fig. 7. Note that
$$\pi'[p_{i_1}]+1+(ba) \leq\pi'[p_{i_1}]+b = s $$and so \(w[\pi'[p_{i_{1}}]+1+(ba)]\) is a letter from word \(w[1 \mathinner {\ldotp \ldotp }s]\), which has a period b−a. Hence
$$x = w\bigl[\pi'[p_{i_1}]+1+(ba)\bigr] = w\bigl[ \pi'[p_{i_1}]+1\bigr] = y , $$contradiction.

2.
There exist \(p_{i_{1}}<p_{i_{2}}<p_{i_{3}}\) in this segment such that
$$p_{i_1}  \pi'[p_{i_1}] + 1 < p_{i_2}  \pi'[p_{i_2}] + 1 < p_{i_3}  \pi'[p_{i_3}] + 1 , $$see Fig. 8.
By assumption \(\pi'[p_{i_{1}}], \pi'[p_{i_{2}}] \geq\ell\). We identify the periods of the corresponding subwords \(w[1 \mathinner {\ldotp \ldotp }\pi'[p_{i_{1}}]]\) and \(w[1 \mathinner {\ldotp \ldotp }\pi'[p_{i_{2}}]]\), respectively:
$$\begin{aligned} a &= \bigl(p_{i_2}  \pi'[p_{i_2}]+1\bigr)  \bigl(p_{i_1}  \pi'[p_{i_1}]+1\bigr) , \\ b &= \bigl(p_{i_3}  \pi'[p_{i_3}]+1\bigr)  \bigl(p_{i_2}  \pi'[p_{i_2}]+1\bigr), \end{aligned}$$as depicted on Fig. 8. We estimate their sum:
$$\begin{aligned} a + b &= \bigl(p_{i_2}  \pi'[p_{i_2}]\bigr)  \bigl(p_{i_1}  \pi'[p_{i_1}]\bigr) + \bigl(p_{i_3}  \pi'[p_{i_3}]\bigr)  \bigl(p_{i_2}  \pi'[p_{i_2}]\bigr) \\ &= \bigl(p_{i_3}  \pi'[p_{i_3}]\bigr)  \bigl(p_{i_1}  \pi'[p_{i_1}]\bigr) \\ &= \bigl(\pi'[p_{i_1}]  \pi'[p_{i_3}] \bigr) + (p_{i_3}  p_{i_1}) \\ &\leq(r  \ell) + 2^{k'} \leq\biggl(\frac{1}{2}\ell2^{k'}\biggr) + 2^{k'}= \frac{\ell}{2} . \end{aligned}$$Since \(\ell\leq\pi'[p_{i_{1}}], \pi'[p_{i_{2}}]\), we obtain that
$$ a + b \leq\frac{\pi'[p_{i_1}]}{2} , \frac{\pi'[p_{i_2}]}{2} . $$(7)There are two subcases, depending on whether \(\pi'[p_{i_{1}}] < \pi '[p_{i_{2}}] \) or \(\pi'[p_{i_{1}}] > \pi'[p_{i_{2}}]\):

(a)
\(\pi'[p_{i_{1}}]<\pi'[p_{i_{2}}]\): Define \(x = w[p_{i_{1}}+1]\) and \(y= w[\pi'[p_{i_{1}}]+1]\), see Fig. 9. Then by definition of \(\pi'[p_{i_{1}}]\), x≠y. We obtain a contradiction by showing that x=y.
Since the nonextensible border ending at \(p_{i_{3}}\) spans over position \(p_{i_{1}}+1\) and \(a+b < \pi'[p_{i_{1}}]\) (see (7)) it holds that
$$ x = w\bigl[\bigl(\pi'[p_{i_1}] +1\bigr) (a+b)\bigr] . $$(8)Comparing the nonextensible borders ending at \(p_{i_{2}}\) and \(p_{i_{3}}\) we deduce that b is a period of \(w[1\mathinner {\ldotp \ldotp }\pi'[p_{i_{2}}]]\) and as \(\pi '[p_{i_{1}}]+1 \leq\pi'[p_{i_{2}}]\),
$$y = w\bigl[\pi'[p_{i_1}]+1\bigr]=w\bigl[ \pi'[p_{i_1}]+1b\bigr] . $$Similarly by comparing the nonextensible prefixes ending at \(p_{i_{1}}\) and \(p_{i_{2}}\) we deduce that a is a period of \(w[1\mathinner {\ldotp \ldotp }\pi'[p_{i_{1}}]]\). Thus
$$ y = w\bigl[\pi'[p_{i_1}]+1b\bigr]=w\bigl[ \pi'[p_{i_1}]+1ba\bigr] $$(9) 
(b)
\(\pi'[p_{i_{1}}]>\pi'[p_{i_{2}}]\): Let \(x' = w[p_{i_{2}}+1]\) and \(y'=w[\pi'[p_{i_{2}}]+1]\). Then x′≠y′ by the definition of \(\pi'[p_{i_{2}}]\), see Fig. 10. We show that x′=y′ and hence obtain a contradiction. Since nonextensible border ending at \(p_{i_{3}}\) spans over position \(p_{i_{2}}+1\), we obtain that
$$ x' = w\bigl[\pi'[p_{i_2}]b+1 \bigr] , $$(10)see Fig. 10. By comparing nonextensible prefixes ending at \(p_{i_{1}}\) and \(p_{i_{2}}\) we deduce that a is a period of \(w[1\mathinner {\ldotp \ldotp }\pi'[p_{i_{1}}]]\). As \(\pi'[p_{i_{2}}] + 1 \leq\pi'[p_{i_{1}}]\),
$$ y'=w\bigl[\pi'[p_{i_2}]+1\bigr]=w\bigl[ \pi'[p_{i_2}]+1a\bigr] . $$By comparing the nonextensible prefixes ending at \(p_{i_{2}}\) and \(p_{i_{3}}\) we deduce that b is a period of \(w[1\mathinner {\ldotp \ldotp }\pi'[p_{i_{2}}]]\). Since \(a + b\leq\frac{\pi'[p_{i_{2}}]}{2}\) by (7), it holds that
$$y' = w\bigl[\pi'[p_{i_2}]+1a\bigr]=w\bigl[ \pi'[p_{i_2}]+1ab\bigr] . $$As a is a period of \(w[1\mathinner {\ldotp \ldotp }\pi'[p_{i_{1}}]]\) and \(\pi'[p_{i_{1}}] > \pi'[p_{i_{2}}]\) it is also a period of \(w[1\mathinner {\ldotp \ldotp }\pi'[p_{i_{2}}]+1]\), hence
$$ y'=w\bigl[\pi'[p_{i_2}]+1ab \bigr]=w\bigl[\pi'[p_{i_2}]+1b\bigr]. $$(11)

(a)
Lemma 9 can be used to bound the size of the compressed representation Compress(A′) of A′.
Corollary 2
Compress(A′) consists of \(\mathcal{O} (n)\) symbols over an alphabet of \(\mathcal{O} (\log^{2} n)\) size.
Proof
To calculate the total length of the resulting text, observe that the only case resulting in a nonconstant number of characters being output for a single index i is when A′[i]>log^{2} n and the value of A′[i] does not occur at any of log^{2} n previous indices. By Lemma 9, where 2^{k}≥log^{2} n, any segment of consecutive log^{2} n indices contains at most 48 different values from [2^{k},2^{k+1}). For a single k there are \(\frac{n}{\log^{2} n}\) such segments of length log^{2} n end encoding one value of A′ takes logn characters in Compress(A′). As k takes values from log(log^{2}(n)) to logn the total number of characters used to describe all those values of A′[i] is at most
so \(\mathit{Compress}(A')=\mathcal{O}(n)\). □
As the alphabet of Compress(A′) is of polylogarithmic size, the suffix tree for Compress(A′) can be constructed in linear time by Lemma 8.
8.3 Performing Consistency Checks on the Compress(A′)
Subchecks
Consider consistency check: is \(A'[j\mathinner {\ldotp \ldotp }j+k1]=A'[i\mathinner {\ldotp \ldotp }i+k1]\), where j=A[i]? We first establish equivalence of this equality with equality of proper fragments of Compress(A′). Note, that A′[ℓ]=A′[ℓ′] does not imply the equality of two corresponding fragments of Compress(A′), as they may refer to previous values of A′. Still, such references can be only log^{2} n elements backwards. This observation is formalised as follows:
Lemma 10
Let j=A[i]. Then
if and only if
Proof
If k≤log^{2} n, the claim holds trivially, as (12) and (14) are exactly the same and (13) holds vacuously.
So suppose that k>log^{2} n.
⇒⃝Suppose first that \(A'[j\mathinner {\ldotp \ldotp }j+k1]=A'[i\mathinner {\ldotp \ldotp }i+k1]\). Then of course \(A'[j\mathinner {\ldotp \ldotp }j+\log^{2} n1]=A'[i\mathinner {\ldotp \ldotp }i+\log^{2} n1]\), as k>log^{2} n by case assumption. Thus (14) holds.
Note that \(\mathit{Compress}(A')[\mathit{Start}[j+\log^{2} n] \mathinner {\ldotp \ldotp }\mathit {Start}[j+k]1] \) is created using only \(A'[j \mathinner {\ldotp \ldotp }j+k1]\): when creating an entry corresponding to A′[ℓ] we can refer to A′[ℓ] and to at most log^{2} n elements before it. Similarly, \(\mathit{Compress}(A')[\mathit{Start}[i+\log^{2} n] \mathinner {\ldotp \ldotp }\mathit {Start}[i+k]1]\) is created using \(A'[i \mathinner {\ldotp \ldotp }i+k1]\) exclusively. Since \(A'[j \mathinner {\ldotp \ldotp }j+k1] = A'[i \mathinner {\ldotp \ldotp }i+k1]\), both fragments of Compress(A′) are created using the same input, and so they are equal. Thus (13) holds, which ends the proof in this direction.
⇐⃝Assume that (13) and (14) hold. We show by a simple induction on ℓ, that A′[i+ℓ]=A′[j+ℓ]. For ℓ<log^{2} n the claim is trivial, as it is explicitly stated in (14). So let ℓ≥log^{2} n. Consider \(\mathit{Compress}(A')[\mathit {Start}[i+\ell] \mathinner {\ldotp \ldotp }\mathit{Start}[i+\ell+1]1]\) and \(\mathit{Compress}(A')[\mathit{Start}[j+\ell] \mathinner {\ldotp \ldotp }\mathit {Start}[j+\ell+1]1]\), they are equal by the assumption.

If they are both equal to #_{0} m (i.e., both are equal to some value of A′ that is m≤log^{2} n positions earlier) then A′[i+ℓ]=A′[i+ℓ−m] and A′[j+ℓ]=A′[j+ℓ−m]; by the inductive assumption A′[i+ℓ−m]=A′[j+ℓ−m] (as m≤log^{2} n), which ends the case.

If they are both equal to #_{1} m (i.e., both are equal to m≤log^{2} n) then A′[i+ℓ]=A′[j+ℓ]=m.

If they are equal to #_{2} m _{1}…m _{ z }#_{3} (i.e., both are larger than log^{2} n and are both encoded in binary as m _{1}…m _{ z }) then m _{1}…m _{ z } encode some m in binary and A′[i+ℓ]=A′[j+ℓ]=m, which ends the last case.
□
Similarly as in the Sect. 8.1, we assume that ⌊logn⌋ is known. In the same way we repeat the whole computation from the scratch as soon as it value changes. This increases the running time by a constant factor.
We call the checks of the form (13) the compressed consistency checks, checks of the form (14)—short consistency checks and the near short consistency checks when moreover i−j<log^{2} n.
The compressed consistency checks can be answered in amortised constant time using LCA query [3] on the suffix tree built for Compress(A′). It remains to show how to perform short consistency checks in amortised constant time.
8.4 Performing Short Consistency Checks
Performing Near Short Consistency Checks
To answer near short consistency checks efficiently, we split A′ into blocks of log^{2} n consecutive letters: A′=B _{1} B _{2}…B _{ ℓ }, see Fig. 11. Then we build suffix trees for each pair of consecutive blocks, i.e., B _{1} B _{2},B _{2} B _{3},…,B _{ ℓ−1} B _{ ℓ }. Each block contains at most log^{2} n values smaller than log^{2} n, and at most 48logn larger values by Lemma 9, so all suffix trees can be built in linear time by Lemma 8. For each tree we also build a data structure supporting constanttime LCA queries [3]. Then, any near short consistency check reduces to an LCA query in one of these suffix trees. Such a query also gives the actual length of the longest common prefix of the two compared strings; this is used in performing short consistency checks.
Performing Short Consistency Checks
Consider again a short consistency check, which is of the form ‘does \(A'[i\mathinner {\ldotp \ldotp }i+k1] = A'[j\mathinner {\ldotp \ldotp }j+k1]\)’, where j=A[i] and k≤log^{2} n. To improve the running time, the results of previous short consistency checks are reused: we store j _{ best } (which is one of indices for which previously we run short consistency check) such that

j≤j _{ best }≤j+log^{2} n

the length (say L) of the common prefix of \(A'[i \mathinner {\ldotp \ldotp }i + k 1]\) and \(A'[j_{best} \mathinner {\ldotp \ldotp }j_{best} + k 1]\) is known.
To answer short consistency check we first compute the common prefix of \(A'[j\mathinner {\ldotp \ldotp }j+ k  1]\) and \(A'[j_{best}\mathinner {\ldotp \ldotp }j_{best}+ k1]\) (which can be done using near short consistency check) and compare it with L. If it is smaller than min(L,k), then clearly the common prefix of \(A'[j\mathinner {\ldotp \ldotp }j+k1]\) and \(A'[i \mathinner {\ldotp \ldotp }i + k 1]\) is smaller than k; if it equals L then we naively compute the common prefix of \(A'[j + L \mathinner {\ldotp \ldotp }j + k1]\) and \(A'[i+L \mathinner {\ldotp \ldotp }i+k1]\) by lettertoletter comparisons. Also, in such a case we switch j _{ best } to j, as it has a longer common prefix with \(A[i \mathinner {\ldotp \ldotp }i + k1]\).
Simplifying Assumption
To simplify the presentation and analysis, we assume that the adjusting of the last slope is done in a slightly different way than written in the code of AdjustLastSlope (see Algorithm 6): if the pin is assigned value i′>i (in line 13 of Algorithm 6), firstly A[i′] is set to A[i]+(i′−i), i.e., its current implicit value, then it is verified if \(A'[i'\mathinner {\ldotp \ldotp }n] = A'[A[i'] \mathinner {\ldotp \ldotp }A[i']+(ni')] \) (and the result ignored, even if it is an equality, we still treat it as a fail) and only after that A[i′] is assigned valid value for π[i′]. Such a change can only increases the running time of the algorithm.
Invariants
During short consistency check we make sure that the following invariants for j _{ best } and L are preserved:
We refer to them as (15a)–(15e).
The intuition behind the invariants is as follows: (15a) simply states that we are interested in common prefix of length at most k. The (15b) justifies the choice of j _{ best }, i.e. we know the common prefix of A′ starting at j _{ best } and at i. The (15c) ensures that comparing A′ starting at j and j _{ best } can be done using near short consistency check. The (15d) says that if j≠j _{ best } then there is a reason for that: \(A'[i \mathinner {\ldotp \ldotp }i + k1]\) and \(A'[j \mathinner {\ldotp \ldotp }j + k1]\) have a shorter common prefix then \(A'[i \mathinner {\ldotp \ldotp }i + k1]\) and \(A'[j_{best} \mathinner {\ldotp \ldotp }j_{best} + k1]\). Finally, (15e) shows maximality of L: either it is k (so it cannot be larger) or there is a mismatch at the ‘next position’.
Potential
The analysis of the running time is amortised. We define a potential of the configuration of LinearValidateπ′ as
Let Δx denote the change of the value of x in some fragment of an algorithm (which will be always clear from the context); let s be the cost of comparisons and near short consistency checks (i.e. their number). Then the amortised cost is Δp+s. There are some additional costs, like comparing indices, checking conditions etc. All such costs are assigned either to lettertoletter comparisons or to near short consistency checks.
Note that when the change of the potential is negative then it actually helps in paying for near short consistency checks and letterbyletter comparisons. Since 0≤L≤k and j≤j _{ best }≤j+log^{2} n, at any point the potential is nonnegative and at most log^{2} n, so the total cost at any point is the sum of amortised costs in each step and the potential, which is sublinear.
We pay for the amortised cost using credit that we get for the changes of n and j: For every increase Δn, we get 8Δn units of credit; for every change of j we get 8Δj units of credit. Clearly the sum of all Δn is n, so in this way we are scored at most 2n credit. We show that the sum of all Δj is also \(\mathcal{O} (n)\).
Lemma 11
The sum of all Δj over the whole run Validateπ′ is 2n.
Proof
For the purpose of the proof, whenever we change the value of i or j let i′, j′ refer to the new values and i, j to the old ones.
It is enough to show that the sum of all increments of j is at most n then clearly the sum of all decrements of j are at most n as well.
The j increases only when the pin i is updated in line 13 of AdjustLastSlope, otherwise it can only decrease. Moreover, when j is incremented, it increases by at most Δi:
Note that in the third equality we essentially used the simplifying assumption: as i′ and i are on the same (last) slope, we have A[i′]=A[i]+(i′−i).
Since i≤n and i only increases, its sum of increments is at most n. So the total sum of increments of j is at most n, as claimed. □
LetterbyLetter Comparisons
The letter by letter comparisons, see Algorithm 7, are used to ensure that (15e) holds: when we already know that L letters starting at A′[i] and A′[j _{ best }] are the same but we are not sure whether this is the maximal possible value of L, we verify this naively. The amortised cost is only 1, as each successful comparison decreases the potential by 1.
Lemma 12
If (15a)–(15c) are satisfied before LetterbyLetter, then (15a)–(15c) and (15e) are satisfied afterwards. The amortised cost of LetterbyLetter is 1.
Proof
For the purpose of the proof, let L _{0} be the initial value of L and L _{1} the final value of L; by ‘L’ we denote the value inside LetterbyLetter.
Note that i, j and k are not altered. For (15a), by assumption L _{0}≤k before CommonShortConsistencyCheck, we increment L by 1 and stop as soon as it reaches k, so L _{1}≤k. For (15b) note that \(A'[j_{best} \mathinner {\ldotp \ldotp }j_{best} + L_{0}1] = A'[i\mathinner {\ldotp \ldotp }i+L_{0}1]\) holds by the assumption and we verified \(A'[j_{best} + L_{0} \mathinner {\ldotp \ldotp }j_{best}+L_{1}1] = A'[i+L_{0}\mathinner {\ldotp \ldotp }i+L_{1}1]\) letter by letter. Invariant (15c) holds as neither j nor j _{ best } was changed. As for (15e), it is the termination condition of the while loop, so it holds upon its termination.
Concerning the amortised cost: i, j, j _{ best } do not change, so Δp=−ΔL, i.e. it is negative. On the other hand we make ΔL successful lettertoletter comparisons and perhaps one unsuccessful one (we ignore the cost of checking whether L=k, as they are at most as high as the cost of lettertoletter comparisons). So the cost of comparisons is at most ΔL+1. Hence the amortised cost is at most −ΔL+ΔL+1=1, as claimed. □
Answering Short Consistency Checks Using j _{ best }
When we get new values of i, j and k we need to update j _{ best } and L. It turns out that as soon as we update j _{ best } and L so that they satisfy (15a)–(15c), answering short consistency check is easy: we first make letterbyletter comparisons using LetterbyLetter to ensure that also (15e) holds, i.e. that L is maximal. Then we check the length of the common prefix of \(A'[j \mathinner {\ldotp \ldotp }j + k1]\) and \(A'[j_{best} \mathinner {\ldotp \ldotp }j_{best} + k1]\) by a near short consistency check. If it is less than L, then the answer to short consistency check is no. If it is at least L, then we set j _{ best } to j (as it is as good as j _{ best }), run letterbyletter comparison again to check whether L is k, and answer accordingly. It is easy to verify that the amortised cost of this procedure is constant and that all (15a)–(15e) hold afterward. Details are given in Algorithm 8 and lemmata below.
Lemma 13
Assume that (15a)–(15c) are satisfied. Then CommonShortConsistencyCheck correctly answers the short consistency check, its amortised cost is 6 and all (15a)–(15e) hold after CommonShortConsistencyCheck.
Proof
Regarding the cost, the amortised cost of LetterbyLetteris 1 by Lemma 12, setting j _{ best } to j can only lower potential, and NearShortConsistencyCheck are answered in constant time using suffix trees.
We now show that after CommonShortConsistencyCheck all (15a)–(15e) hold. By assumption initially (15a)–(15c) hold. By Lemma 12 after the first LetterbyLetter they still hold and additionally (15e) holds. Suppose that ℓ<L, in particular j≠j _{ best }. Then (15d) simply states that ℓ<L, which is the case. So suppose that ℓ≥L. Resetting j _{ best } to j may make (15e) invalid, but (15a)–(15c) are preserved: the (15a) holds as we do not change L, the (15b) holds as we know that \(A'[j_{best} \mathinner {\ldotp \ldotp }j_{best} + k 1]\) has a common prefix of length L with both \(A'[j\mathinner {\ldotp \ldotp }j+k1]\) and \(A'[i\mathinner {\ldotp \ldotp }i+k1]\) and so also \(A'[j\mathinner {\ldotp \ldotp }j+k1]\) and \(A'[i\mathinner {\ldotp \ldotp }i+k1]\) have a common prefix of length L. The (15c) holds trivially. By Lemma 12 the (15a)–(15c) and (15e) hold after LetterbyLetter. Note that LetterbyLetter does not modify j and so (15d) trivially holds, as j=j _{ best }.
Concerning the correctness: if ℓ<L then j≠j _{ best } and from (15b) and (15d) we get that \(A'[j\mathinner {\ldotp \ldotp }j + L1]\) and \(A'[i\mathinner {\ldotp \ldotp }i + L1]\) are different. Since by (15a) we know that L≤k, hence also \(A'[j\mathinner {\ldotp \ldotp }j + k  1]\) and \(A'[i\mathinner {\ldotp \ldotp }i + k  1]\) are different. This justifies the no answer. If ℓ≥L then in the end j=j _{ best } and so by (15a)–(15b) and (15e) we know that \(A'[j\mathinner {\ldotp \ldotp }j+k1] = A'[i\mathinner {\ldotp \ldotp }i+k1]\) if and only if L=k, which is exactly the answer returned by the algorithm. □
It remains to show how to update j _{ best } and L.
Types of Short Consistency Checks
The way we update j _{ best } and L depends on why the short consistency check is made; we distinguish three situations in which AdjustLastSlope invokes short consistency check:
 (Type 1):

This is a first iteration of AdjustLastSlope and PinValueCheck did not return any index in this iteration.
 (Type 2):

This is not a first iteration of AdjustLastSlope and PinValueCheck did not return any index in this iteration.
 (Type 3):

The PinValueCheck did return an index in this iteration.
We begin with showing what are the changes of i, j, k and n in each of those types of short consistency check.
Lemma 14
In Type 1 short consistency check it holds that Δi=Δj=0, Δk≥0 and Δn≥max(1,Δk); exactly 8Δn units of credit are issued.
In Type 2 short consistency check it holds that Δi=0, Δj<0 and Δk=Δn=0; exactly 8Δj units of credit are issued.
In Type 3 short consistency check it holds that Δi>0, Δj=Δi and −Δi≤Δk≤0, Δn≥0; exactly 8Δj+8Δn units of credit are issued.
Note in particular that when Δi, Δj are known, we can figure out which type of query this is: Type 3 short consistency check is unique with Δi>0, Type 2 with Δj<0 while Type 1 with Δi=Δj=0.
Proof
Recall that we issue 8Δn+8Δj units of credit, which yields the claim on the number of credit issued in each of the cases.
Type 1 short consistency check: Since this is the first iteration of AdjustLastSlope it means that we read A′[n] and it is not equal to A′[A[n]]. In particular, since the last invocation of AdjustLastSlope we read at least one additional value of A′. Hence Δn≥1. As PinValueCheck did not return any index, we do not modify i and j since the last invocation of the short consistency check, so Δi=Δj=0. Concerning k, recall that the short consistency check is asked only on \(A'[i \mathinner {\ldotp \ldotp }\min(n,i + \log^{2}n1)]\), i.e. k=min(n−i+1,log^{2} n). Hence, when k _{0} and n _{0} are the values of k and n when previous short consistency check was asked, we have k _{0}=min(n _{0}−i+1,log^{2} n) (note that we can assume that logn and logn _{0} are the same, as we repeat the calculation as soon as ⌈logn⌉ increases). Then k≥k _{0} and Δk≤Δn, but there is no guarantee that Δk>0, i.e., k _{0}=k can happen when n _{0}−i+1>log^{2} n.
Type 2 short consistency check: in this case short consistency check is asked in iteration of AdjustLastSlope that is not the first one, and the PinValueCheck did not return any index in this iteration. Which means that A[i] is assigned the next candidate in line 14. Thus i, k are unchanged as compared to the previous short consistency check, while j is decreased, hence Δi=0, Δj<0 and Δk=0. Furthermore, we do not read any new value of A′, so Δn=0.
Type 3 short consistency check: In this case the short consistency check is run for the same slope, but pin is moved, thus the new value i′ is larger than the old i. By our simplifying assumption we do not decrease the last slope, just place new i′ on it, i.e. we set A[i′]=A[i]+(i′−i), i.e., we take new j such that Δj=Δi. As n only increases, Δn≥0. Concerning k, recall again that k=min(n−i+1,log^{2} n), hence 0≥Δk≥−Δi. □
In the following, we describe how to update j _{ best } and L in those three different cases so that (15a)–(15c) are preserved.
Type 1 Updates
In this case we do not need any update, as described in Algorithm 9.
Lemma 15
Suppose that we are to make Type 1 short consistency check and all (15a)–(15e) hold. Then (15a)–(15c) are preserved and the amortised cost is at most Δn.
Proof
Let us inspect the change of potential:
By Lemma 14 we know that Δj=0 and Δk≤Δn, we do not change j _{ best } nor L so ΔL=Δj _{ best }=0. Hence
Concerning the invariants: as L is unchanged and Δk≥0 by Lemma 14 we get that (15a) is preserved. Similarly, since we do not change j, j _{ best }, L, the (15b)–(15c) are preserved. □
This allows calculating the whole cost of answering Type 1 short consistency check.
Corollary 3
In Type 1 of short consistency check the amortised cost of Type1Updatej _{ best } and CommonShortConsistencyCheck is covered by the released credit. The Type1Updatej _{ best } followed by CommonShortConsistencyCheck preserves (15a)–(15e) and returns the correct answer to short consistency check.
Proof
By Lemma 15 the update of j _{ best } and L has amortised cost at most Δn. By Lemma 13 the amortised cost of CommonShortConsistencyCheck is at most 6. On the other hand, by Lemma 14 we know that 8Δn≥6+Δn credit is issued, which suffice to pay for the amortised cost.
Concerning the correctness, by Lemma 15 the (15a)–(15c) are satisfied after Type1Updatej _{ best } which by Lemma 13 means that after CommonShortConsistencyCheck all (15a)–(15e) hold and the answer to short consistency check is correct. □
Type 2 Updates
Since j is decreased, it might be that j and j _{ best } no longer satisfy (15b), (as j+log^{2} n<j _{ best }). In such a case we set j←j _{ best } and L←0, see Algorithm 10.
Lemma 16
Assume that all (15a)–(15e) hold and we are to make Type 2 short consistency check. Then after Type2Updatej _{ best } the (15a) and (15c) are preserved. The amortised cost is at most Δj+1.
Proof
Suppose that j+log^{2} n≥j _{ best }. The invariants (15b)–(15c) hold by assumption, as none of i, j _{ best }, L and k was modified. For (15c) note that j _{ best }≤j+log^{2} n holds by case assumption and j≤j _{ best } held by assumption even before the decrement of j, so it holds now as well.
The change of the potential: by Lemma 14, we know that Δi=Δk=Δn=0 and Δj<0. Since L and j _{ best } were not changed, we have
The cost is 1 for the comparison and so the amortised cost is Δj+1.
If j+log^{2} n<j _{ best } then after setting j _{ best }←j and L←0 the (15a)–(15c) trivially hold. The change of potential is
By Lemma 14 we know that Δk=0. As j+log^{2} n<j _{ best } we obtain that Δj _{ best }<−log^{2} n. Since L was reset to 0 we have −ΔL=−(−L _{0})=L _{0}, where L _{0} was the previous value of L. We know that L _{0}≤k≤log^{2} n and so
There is additional cost 1 for the comparison of j and j _{ best } (we hide the cost of changing j _{ best } and L in it). Hence the amortised cost is at most 1+Δj. □
Corollary 4
In Type 2 of short consistency check the amortised cost of Type2Updatej _{ best } and CommonShortConsistencyCheck is covered by the issued credit. The Type2Updatej _{ best } followed by CommonShortConsistencyCheck preserve (15a)–(15e) and correctly answers short consistency check.
Proof
By Lemma 16, the update of j _{ best } and L has amortised cost at most Δj+1. By Lemma 13, the amortised cost of CommonShortConsistencyCheck is 6. On the other hand, by Lemma 14, 8Δj≥7+Δj credit is issued, which suffice to pay for the amortised cost.
Concerning the correctness, by Lemma 16 after Type2Updatej _{ best } the (15a)–(15c) hold and so by Lemma 13 adter CommonShortConsistencyCheck all (15a)–(15e) hold and the answer to short consistency check is correct. □
Type 3 Updates
It is left to show how to update j _{ best } and L in the Type 3 short consistency check, see Algorithm 11. In this case both j and i were increased by the same value Δj, see Lemma 14. This means that the new \(A'[j \mathinner {\ldotp \ldotp }j + k  1]\) and \(A'[i \mathinner {\ldotp \ldotp }i + k  1]\) are the suffixes of the old ones. In particular, \(A'[j_{best} \mathinner {\ldotp \ldotp }j_{best} + L  1]\) has nothing to do with \(A'[i \mathinner {\ldotp \ldotp }i + L  1]\); still, if we also increase j _{ best } by Δj then the new \(A'[j_{best} \mathinner {\ldotp \ldotp }j_{best} + L  1]\) is also a suffix of the old one. Unfortunately, as every table we consider is a suffix of the old one, we have to decrease L by Δj as well. If this turns L nonpositive then \(A'[j_{best} \mathinner {\ldotp \ldotp }j_{best} + L  1]\) is empty and we reset j _{ best } to j and L to 0.
Lemma 17
Suppose that (15a)–(15e) hold and we are to make Type 3 short consistency check. Then Type3Updatej _{ best } preserves (15a)–(15c). The amortised cost is at most 1+Δj.
Proof
Consider the case in which j _{ best } and L are not reset. By Lemma 14 we get that k is decreased by at most Δj, while we decrease L by Δj, hence (15a) is preserved. Concerning (15b) let L′, i′ and \(j_{best}'\) be the previous values of L, i and j _{ best }. Then \(A'[j_{best} \mathinner {\ldotp \ldotp }j_{best}+L1]\) is an ending block of \(A'[j_{best}' \mathinner {\ldotp \ldotp }j_{best}'+L'1]\) and \(A'[i\mathinner {\ldotp \ldotp }i+L1]\) is an ending block of \(A'[i' \mathinner {\ldotp \ldotp }i'+L'1]\). Hence \(A'[j_{best} \mathinner {\ldotp \ldotp }j_{best}+L1] = A'[i\mathinner {\ldotp \ldotp }i+L1]\) follows from \(A'[j_{best}' \mathinner {\ldotp \ldotp }j_{best}'+L'1] = A'[i'\mathinner {\ldotp \ldotp }i'+L'1]\). So (15b) is preserved. For (15c) note that we decremented j and j _{ best } by the same value Δj, so (15c) is preserved.
Concerning the change of potential in this case,
By Lemma 14 Δk≤0. We decrease L by Δj, so ΔL=−Δj and increase j _{ best } by Δj, so Δj _{ best }=Δj. Hence
The additional cost is 1 for the test, so the amortised cost is at most Δj+1.
Now consider the case in which after the decrement by Δj the L is nonpositive, i.e., we reset j _{ best } to j and L to 0. Then (15a)–(15b) hold trivially, as L=0, and (15c) holds because j=j _{ best }. Concerning the cost, we pay 1 for comparisons and the change of potential is:
By Lemma 14 Δk≤0. Since decreasing L by Δj made it nonpositive and then we set it to 0, i.e., increase L, so ΔL≥−Δj. Lastly, j _{ best }−j is now equal 0 and used to be nonnegative by (15c), so Δj _{ best }−Δj≤0. Hence
So the amortised cost is at most 1+Δj. □
Corollary 5
In Type 3 of short consistency check the amortised cost of Type3Updatej _{ best } and CommonShortConsistencyCheck is covered by the issued credit. The Type3Updatej _{ best } followed by CommonShortConsistencyCheck preserve (15a)–(15e) and returns a proper answer to short consistency check.
Proof
By Lemma 17 the update of j _{ best } and L has amortised cost at most 1+Δj. By Lemma 13 the amortised cost of CommonShortConsistencyCheck is at most 6. On the other hand, by Lemma 14 we obtain that at least 8Δj≥7+Δj credit is issued, which suffice to pay for the amortised cost.
Concerning the correctness, by Lemma 16 after Type3Updatej _{ best } the (15a)–(15c) hold and so by Lemma 13 after CommonShortConsistencyCheck all (15a)–(15e) hold and furthermore the answer to the short consistency check is correct. □
In the end, the short consistency check is performed as follows: depending on which type it is, we run one of Type1Updatej _{ best }, Type2Updatej _{ best }, Type3Updatej _{ best }. Afterwards we apply CommonShortConsistencyCheck. By Corollary 3–5 the answer returned to short consistency check is correct and the issued credit covers the whole cost. Since the issued credit is linear, we are done.
Running Time
Validateπ′ runs in \(\mathcal{O} (n)\) time: construction of the suffix trees and doing consistency checks, as well as doing pin value checks all take \(\mathcal{O} (n)\) time.
9 Remarks and Open Problems
While Validateπ produces the word w over the minimum alphabet such that π _{ w }=A online, this is not the case with Validateπ′ and LinearValidateπ′. At each timestep both these algorithms can output a word over minimum alphabet such that \(\pi'_{w}=A'\), but the letters assigned to positions on the last slope may yet change as further entries of A′ are read.
Since Validateπ′ and LinearValidateπ′ keep the function \(\pi[1\mathinner {\ldotp \ldotp }n+1]\) after reading \(A'[1 \mathinner {\ldotp \ldotp }n]\), virtually no changes are required to adapt them to g validation, where g[i]=π′[i−1]+1 is the function considered by Duval et al. [8], because \(A'[1 \mathinner {\ldotp \ldotp }n1]\) can be obtained from \(g[1 \mathinner {\ldotp \ldotp }n]\). Running Validateπ′ or LinearValidateπ′ on such A′ gives \(A[1 \mathinner {\ldotp \ldotp }n]\) that is consistent with \(A'[1 \mathinner {\ldotp \ldotp }n1]\) and \(g[1\mathinner {\ldotp \ldotp }n]\). Similar proof shows that \(A[1\mathinner {\ldotp \ldotp }n]\) and \(g[1 \mathinner {\ldotp \ldotp }n]\) require the same minimum size of the alphabet.
Two interesting questions remain: is it possible to remove the suffix trees and LCA queries from our algorithm without hindering its time complexity? We believe that deeper combinatorial insight might result in a positive answer.
References
Breslauer, D., Colussi, L., Toniolo, L.: On the comparison complexity of the string prefixmatching problem. J. Algorithms 29(1), 18–67 (1998)
Clément, J., Crochemore, M., Rindone, G.: Reverse engineering prefix tables. In: Proceedings of 26th STACS, pp. 289–300 (2009). http://drops.dagstuhl.de/opus/volltexte/2009/1825
Cole, R., Hariharan, R.: Dynamic lca queries on trees. In: Proceedings of SODA ’99, pp. 235–244. Society for Industrial and Applied Mathematics, Philadelphia (1999)
Crochemore, M., Rytter, W.: Jewels of Stringology. World Scientific Publishing Company, Singapore (2002)
Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, Cambridge (2007)
Crochemore, M., Iliopoulos, C., Pissis, S., Tischler, G.: Cover array string reconstruction. In: CPM 2010. Lecture Notes in Computer Science, vol. 6129, pp. 251–259. Springer, Berlin (2010)
Dietzfelbinger, M., Karlin, A.R., Mehlhorn, K., auf der Heide, F.M., Rohnert, H., Tarjan, R.E.: Dynamic perfect hashing: upper and lower bounds. SIAM J. Comput. 23(4), 738–761 (1994)
Duval, J.P., Lecroq, T., Lefebvre, A.: Efficient validation and construction of Knuth–Morris–Pratt arrays. In: Conference in Honor of Donald E. Knuth (2007)
Duval, J.P., Lecroq, T., Lefebvre, A.: Efficient validation and construction of border arrays and validation of string matching automata. RAIRO Theor. Inform. Appl. 43(2), 281–297 (2009). doi:10.1051/ita:2008030
Farach, M.: Optimal suffix tree construction with large alphabets. In: Proceedings of FOCS ’97, pp. 137–143. IEEE Computer Society, Washington (1997)
Franěk, F., Gao, S., Lu, W., Ryan, P.J., Smyth, W.F., Sun, Y., Yang, L.: Verifying a border array in linear time. J. Comb. Math. Comb. Comput. 42, 223–236 (2002)
Fredman, M.L., Willard, D.E.: Transdichotomous algorithms for minimum spanning trees and shortest paths. J. Comput. Syst. Sci. 48(3), 533–551 (1994). doi:10.1016/S00220000(05)800649
Hancart, C.: On Simon’s string searching algorithm. Inf. Process. Lett. 47(2), 95–99 (1993)
I, T., Inenaga, S., Bannai, H., Takeda, M.: Counting parameterized border arrays for a binary alphabet. In: Proc. of the 3rd LATA, pp. 422–433 (2009). doi:10.1007/9783642009822_36
I, T., Inenaga, S., Bannai, H., Takeda, M.: Verifying and enumerating parameterized border arrays. Theor. Comput. Sci. 412(50), 6959–6981 (2011). doi:10.1016/j.tcs.2011.09.008
Karp, R.M., Miller, R.E., Rosenberg, A.L.: Rapid identification of repeated patterns in strings, trees and arrays. In: STOC ’72: Proceedings of the Fourth Annual ACM Symposium on Theory of Computing, pp. 125–136. ACM, New York (1972). doi:10.1145/800152.804905
Knuth, D.E., Morris, J.H. Jr., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)
Matiyasevich, Y.: Realtime recognition of the inclusion relation. J. Sov. Math. 1, 64–70 (1973). Published (in Russian) in Zap. Nauc̆. Semin. POMI, 20, 104–114 (1971)
McCreight, E.M.: A spaceeconomical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976). doi:10.1145/321941.321946
Moore, D., Smyth, W.F., Miller, D.: Counting distinct strings. Algorithmica 23(1), 1–13 (1999). http://link.springer.de/link/service/journals/00453/bibs/23n1p1.html
Morris, J.H. Jr., Pratt, V.R.: A linear patternmatching algorithm. Tech. Rep. 40, University of California, Berkeley (1970)
Pagh, R., Rodler, F.F.: Cuckoo hashing. J. Algorithms 51(2), 122–144 (2004). doi:10.1016/j.jalgor.2003.12.002
Simon, I.: String matching algorithms and automata. In: Results and Trends in Theoretical Computer Science. LNCS, vol. 812, pp. 386–395. Springer, Berlin (1994)
Ukkonen, E.: Online construction of suffix trees. Algorithmica 14(3), 249–260 (1995). doi:10.1007/BF01206331
Acknowledgements
This work was partially supported by Polish Ministry of Science and Higher Education under grants N N206 1723 33, 2007–2010; Łukasz Jeż was also partially supported by the Israeli Centers of Research Excellence (ICORE) program, Center No. 4/11.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
About this article
Cite this article
Gawrychowski, P., Jeż, A. & Jeż, Ł. Validating the KnuthMorrisPratt Failure Function, Fast and Online. Theory Comput Syst 54, 337–372 (2014). https://doi.org/10.1007/s0022401395228
Published:
Issue Date:
DOI: https://doi.org/10.1007/s0022401395228