Algorithms and Complexity on Indexing Founder Graphs

We study the problem of matching a string in a labeled graph. Previous research has shown that unless the Orthogonal Vectors Hypothesis (OVH) is false, one cannot solve this problem in strongly sub-quadratic time, nor index the graph in polynomial time to answer queries efficiently (Equi et al. ICALP 2019, SOFSEM 2021). These conditional lower-bounds cover even deterministic graphs with binary alphabet, but there naturally exist also graph classes that are easy to index: For example, Wheeler graphs (Gagie et al. Theor. Comp. Sci. 2017) cover graphs admitting a Burrows-Wheeler transform -based indexing scheme. However, it is NP-complete to recognize if a graph is a Wheeler graph (Gibney, Thankachan, ESA 2019). We propose an approach to alleviate the construction bottleneck of Wheeler graphs. Rather than starting from an arbitrary graph, we study graphs induced from multiple sequence alignments (MSAs). Elastic degenerate strings (Bernadini et al. SPIRE 2017, ICALP 2019) can be seen as such graphs, and we introduce here their generalization: elastic founder graphs. We first prove that even such induced graphs are hard to index under OVH. Then we introduce two subclasses, repeat-free and semi-repeat-free graphs, that are easy to index. We give a linear time algorithm to construct a repeat-free (non-elastic) founder graph from a gapless MSA, and (parameterized) near-linear time algorithms to construct a semi-repeat-free (repeat-free, respectively) elastic founder graph from general MSA. Finally, we show that repeat-free founder graphs admit a reduction to Wheeler graphs in polynomial time.


Introduction
In string research, many different problems relate to the common question of how to handle a collection of strings. When such a collection contains very similar strings, it can be represented as some "high scoring" Multiple Sequence Alignment (MSA), i.e., as a matrix MSA [1..m, 1..n] whose m rows are the individual strings each of length n, which may include special "gap" symbols such that the columns represent the aligned positions. While it is NP-hard to find an optimal MSA even under the simplest score of maximizing the number of identity columns (i.e., longest common subsequence length) [32], the central role of MSA as a model of biological evolution has resulted into numerous heuristics to solve this problem in practice [14]. In this paper, we assume an MSA as an input.
A simple way to define the problem of finding a match for a given string in the MSA is to ask whether the string matches a substring of some row (ignoring gap symbols). This leads to the widely studied problem of indexing repetitive text collections, see, e.g., references [23,24,34,38,39,40,41]. These approaches reducing an MSA to plain text reach algorithms with linear time complexities. However, the performance of these algorithms rely on the fact that a match is always found within an individual row of the MSA.
One feature worth considering is the possibility to allow a match to jump from any row to any other row of the MSA between consecutive columns. This property is usually referred to as recombination due to its connection to evolution. To solve this version of the problem, different approaches have to be used, and a possible alternative is a graph representation of the MSA. Figure 1a shows a simple solution, which consists in turning distinct characters of each column into nodes, and then adding the edges supported by row-wise connections. In this graph, a path whose concatenation of node labels matches a given string represents a match in the original MSA (ignoring gaps). Refinements of this approach are mostly used in bioinformatics [37], where recombination is a desired feature, and it is realized by the fact that the resulting graph encodes a super-set of the strings of the original MSA.
Aligning a sequence against a graph is not a trivial task. Only quadratic solutions are known [4,35,43], and this was recently proved to be a conditional lower bound for the problem [17]. Moreover, even attempting to index the graph to query the string faster presents significant difficulties. On one hand, indexes constructed in polynomial time still require quadratic-time queries in the worst case [49]. On the other hand, worst-case linear-time queries are possible, but this has the potential to make the index grow exponentially [48]. These might be the best results possible for general graphs and DAGs without any specific structural property, as the need for exponential indexing time to achieve sub-quadratic time queries constitutes another conditional lower bound for the problem [18].
Thus, if we want to achieve better performances, we have to make more assumptions on the structure of the input, so that the problem might become tractable. Following this line, a possible solution consists in identifying special classes of graphs that, while still able to represent any MSA, have a more limited amount of recombination, thus allowing for fast matching or fast indexing. This is the case for Elastic Degenerate Strings (EDSs) [5,9,10,11,28], which can represent an MSA as a sequence of sets of strings, in which a match can span consecutive sets, using any one string in each of these (see Figure 1b, graph in the center). The advantage of this structure is that it is possible to perform expected-case subquadratic time queries [9]. However, EDSs are still hard to index [25], and there is a lack of results on how to derive a "suitable" EDS from an MSA.
In this context, we propose a generalization of an EDS to what we call an Elastic Founder Graph (EFG). An EFG is a DAG that, as an EDS, represents an MSA as a sequence of sets of strings; each set is called a block, and each string inside a block is represented as a labeled node. The difference with EDSs is that the nodes of two consecutive blocks are not forced to be fully connected. This means that, while in an EDS a match can always pair any string of a set with any string of the next set, in an EFG it might be the case that only some of these pairings are allowed. Figure 1b illustrates these differences. Allowing for more selective connectivity between consecutive blocks also means that finding a match for a string in an EFG is harder than in an EDS. This is because EDSs are a special case of EFGs, hence the hardness results for the former carry to the latter. Specifically, a previous work [26] showed that, under the Orthogonal Vectors Hypothesis (OVH), no index for EDSs constructed in polynomial time can provide queries in time O(|Q| + | T | δ |Q| β ), where | T | is the number of sets of strings, |Q| is the length of the pattern and β < 1 or δ < 1. Nevertheless, in this work we present an even tighter quadratic lower bound for EFGs, proving that, under OVH, an index built in time O(|E| α ) cannot provide queries in time O(|Q| + |E| δ |Q| β ), where |E| is the number of edges and β < 1 or δ < 1. Notice that | T | could even be o(|E|) (e.g. an EFG of two fully connected blocks), hence our lower bound more closely relates to the total size of an EFG. Additionally, the earlier lower bound [26] naturally applies only to indexing EDSs, and is obtained by performing many hypothetical fast queries; ours is derived by first proving a quadratic OVH-based lower bound for the online string matching problem in EFGs, and then using a general result [18] to simply translate this into an indexing lower bound.
Then, in order to break through these lower bounds, we identify two natural classes of EFGs, which respect what we call repeat-free and semi-repeat-free properties. The repeatfree property (Figure 1c) forces each string in each block to occur only once in the entire graph, and the semi-repeat-free property ( Figure 1d) is a weaker form of this requirement. Thanks to these properties, we can more easily locate substrings of a query string in repeatfree EFGs and semi-repeat-free EFGs. In particular, (semi-)repeat-free EFGs and EDSs can be indexed in polynomial time for linear time string matching.
One might think that these time speedups come with a significant cost in terms of flexibility. Instead, the special structure of these EFGs do not hinder their expressive power. Indeed, we show that an MSA can be "optimally" segmented into blocks inducing a repeatfree or semi-repeat-free EFG. Clearly, this depends on how one chooses to define optimality. We consider three optimality notions: maximum number of blocks, minimum maximum block height, and minimum maximum block length. In Figure 1d, the first score is 3, second is 3, and the third is 5. The two latter notions stem from the earlier work on segmentations [13,42], now combined with the (semi)-repeat-free constraint. The first is the simplest optimality notion, now making sense combined with the (semi)-repeat-free constraint.
For each of these optimality notions, we give a polynomial-time dynamic programming algorithm that converts an MSA into an optimal (semi-)repeat-free EFG if such exists. For the first and the third notion combined with the semi-repeat-free constraint, we derive more involved solutions with almost optimal O(mn log m) and O(mn log m + n log log n) running time, respectively. Futhermore, we give an (optimal) O(mn) time solution for the special case of MSA without gap symbols. The algorithm for the special case uses a monotonicity property not holding with gaps. With general MSAs we delve into the combinatorial properties of repetitive string collections synchronized with gaps and show how to use string data structures in this setting. The techniques can be easily adapted for other notions of optimality.
Another class of graphs that admits efficient indexing are Wheeler graphs [22], which offer an alternative way to model an EFG and thus a MSA. However, it is NP-complete to recognize if a given graph is a Wheeler graph [26], and thus, to use the efficient algorithmic machinery around Wheeler graphs [2] one needs to limit the focus on indexable graphs that admit efficient construction. Indeed, we show that any EFG that respects the repeat-free

GTTAC
(a) A column-by-column segmentation of an MSA on the left, leading to the variation graph on the right.

A G C G A C T A G A T A C A G C -A C T A G -T A G A G C G A T T A G T T A C
A different segmentation of the MSA, leading to the EDS in the center, and the EFG on the right. Notice that in an EDS every node is connected with all nodes to the right, while in an EFG edges are added only if their endpoints are consecutive in some row of the MSA (as in the case of variation graphs).

GTTAC
(c) A segmentation of the MSA that leads to a repeat-free EFG (i.e. no node label has another occurrence on some path of the EFG).
A segmentation of the MSA that leads to a semi-repeat-free EFG (i.e. no node label has another occurrence on some path of the EFG, except as a prefix of another node in the same segment). An occurrence of query Q = CGACTAGTA in EFG is depicted in red. As can be seen, such query does not have an occurrence in a single row of the MSA. Notice that in all graphs (except the EDS) edges are added only between nodes that are observed as consecutive in some row of the MSA.
property can be reduced to a Wheeler graph in polynomial time. Interestingly, we were not able to modify this reduction to cover the semi-repeat-free case, leaving it open if these two notions of graph indexability have indeed different expressive power, and whether there are more graph classes with distinctive properties in this context. The paper is structured as follows. At the high level, we first focus on gapless MSAs, and then we extend the results to the general case. In more detail, Section 2 defines the founder graph concepts and explores some basic techniques. Section 3 gives the first indexing results using just classical data structures as a warm up. Section 4 covers linear time construction of repeat-free (non-elastic) founder graphs from gapless MSAs. Section 5 improves over the basic indexing results using succinct data structures. Section 6 gives the proof of conditional indexing hardness when moving from the gapless case to the general case of EFGs. Indexing results are generalized to (semi-)repeat-free EFGs in Section 7. Construction results are generalized to (semi-)repeat-free EFGs in Section 8. Connection to Wheeler graphs is considered in Section 9. Implementation is discussed in Section 10. Finally, future directions are discussed in Section 11.

Definitions and basic tools 2.1 Strings
We denote integer intervals by [i..j]. Let Σ = {1, . . . , σ} be an alphabet of size |Σ| = σ. A string T [1..n] is a sequence of symbols from Σ, i.e. T ∈ Σ n , where Σ n denotes the set of strings of length n under the alphabet Σ. A suffix of string The length of a string T is denoted |T |. The empty string is the string of length 0. In particular, substring T [i..j] where j < i is the empty string. The lexicographic order of two strings A and B is naturally defined by the order of the alphabet: , then the shorter one is regarded as smaller. However, we usually avoid this implicit comparison by adding end marker 0 to the strings. Concatenation of strings A and B is denoted AB.

Elastic founder graphs
As mentioned in the introduction, our goal is to compactly represent an MSA using an elastic founder graph. In this section we formalize these concepts.
A multiple sequence alignment MSA[1..m, 1..n] is a matrix with m strings drawn from Σ ∪ {-}, each of length n, as its rows. Here -/ ∈ Σ is the gap symbol. For a string X ∈ (Σ ∪ {-}) * , we denote spell(X) the string resulting from removing the gap symbols from X. Let P be a partitioning of [1..n], that is, a sequence of subintervals P = [x 1 ..
, where x 1 = 1, y b = n, and for all j > 2, x j = y j−1 + 1. A segmentation S of MSA[1..m, 1..n] based on partitioning P is a sequence of b sets S k = {spell(MSA[i, x k ..y k ]) | 1 ≤ i ≤ m} for 1 ≤ k ≤ b; in addition, we require for a (proper) segmentation that spell(MSA[i, x k ..y k ]) is not an empty string for any i and k. We call set S k a block, while MSA[1..m, x k ..y k ] or just [x k ..y k ] is called a segment. The length of block S k is L(S k ) = y k − x k + 1 and the height of block S k is H(S k ) = |S k |. Segmentation naturally leads to the definition of a founder graph through the block graph concept: Definition 1 (Block Graph). A block graph is a graph G = (V, E, ) where : V → Σ + is a function that assigns a string label to every node and for which the following properties hold.
With gapless MSAs, block S k equals segment MSA[1..m, x k ..y k ], and in that case the founder graph is a block graph induced by segmentation S. The idea is to have a graph in which the nodes represent the strings in S while the edges retain the information of how such strings can be recombined to spell any sequence in the original MSA. With general MSAs with gaps, we consider the following extension, with an analogy to EDSs [9]: Definition 2 (Elastic block and founder graphs). We call a block graph elastic if its third condition is relaxed in the sense that each V i can contain non-empty variable-length strings. An elastic founder graph (EFG) is an elastic block graph G(S) = (V, E, ) induced by a segmentation S as follows: By definition, (elastic) founder and block graphs are acyclic. For convention, we interpret the direction of the edges as going from left to right. Consider a path P in G(S) between any two nodes. The label (P ) of P is the concatenation of labels of the nodes in the path. Let Q be a query string. We say that Q occurs in G(S) if Q is a substring of (P ) for any path P of G(S). Figure 1 illustrates such a query.
As we later learn, some further properties on founder graphs are needed for supporting fast queries: We also consider a variant that is relevant due to variable-length strings in the blocks: for v ∈ V occurs in G(S) only as prefix of paths starting with w ∈ V , where w is from the same block as v.
These definitions also apply to general elastic block graphs and to elastic degenerate strings as their special case.
We note that not all MSAs admit a segmentation leading to a (semi-)repeat-free EFG, e.g. an alignment with rows -A and AA. However, our algorithms detect such cases, thus one can build an EFG consisting of just one block with the rows of the MSA (with gaps removed). Such EFGs can be indexed using standard string data structures to support efficient queries.

Basic tools
A trie [16] of a set of strings is a rooted directed tree with outgoing edges of each node labeled by distinct characters such that there is a root to leaf path spelling each string in the set; the shared part of the root to leaf paths to two different leaves spell the common prefix of the corresponding strings. Such a trie can be computed in O(N log σ) time, where N is the total length of the strings, and it supports string queries that require O(q log σ) time, where q is the length of the queried string.
In a compact trie the maximal non-branching paths of a trie become edges labeled with the concatenation of labels on the path. Suffix tree is the compact trie of all suffixes of string T 0. In this case, the edge labels are substrings of T and can be represented in constant space as an interval. Such tree takes linear space and can be constructed in linear time [20] so that when reading the leaves from left to right, the suffixes are listed in their lexicographic order. The leaves hence form the suffix array [36] of string T , which is an array where T ∈ {1, 2, . . . , σ} n . A generalized suffix tree or array is one built on a set of strings. In this case, string T above is the concatenation of the strings with symbol 0 between each.
.m] be a query string. If Q occurs in T , then the locus or implicit node of Q in the suffix tree of T is (v, k) such that Q = XY , where X is the path spelled from the root to the parent of v and Y is the prefix of length k of the edge from the parent of v to v. The leaves of the subtree rooted at v are then all the suffixes sharing the common prefix Q. Let the left-and right-most leaves in that subtree be the c-th and d-th smallest suffixes of T 0.
.d] and does not occur elsewhere. We use heavily this connection of suffix tree node v and suffix array interval SA[c..d]. Moreover, there are succinct data structures to do this mapping in both directions in constant time [46].
Let aX and X be paths spelled from the root of a suffix tree to nodes v and w, respectively. Then one can store a suffix link from v to w. Implicit suffix links for implicit nodes are defined analogously, but they are not stored explicitly. In many algorithms, one can simulate implicit suffix links through explicit suffix links, as the work amortizes to a constant per step.
An Aho-Corasick automaton [1] is a trie of a set of strings with additional pointers (faillinks). While scanning a query string, these pointers (and some shortcut links on them) allow to identify all the positions in the query at which a match for any of the strings occurs. Construction of the automaton takes the same time as that of the trie. Queries take O(q log σ + occ) time, where occ is the number of matches.  [8].
A fully-functional bidirectional BWT index [6] expands the steps to allow contracting symbols from the left or from the right. That is, substring Q[l..r] can be modified into Q[l + 1..r] (left contraction) or to Q[l..r − 1] (right contraction) and the the corresponding interval pair can be updated in constant time.
Among the auxiliary structures used in BWT-based indexes, we explicitly use the rank and select structures: String B[1..n] from binary alphabet is called a bitvector. Operation rank(B, i) returns the number of 1s in B[1..i]. Operation select(B, j) returns the index i containing the j-th 1 in B. Both queries can be answered in constant time using an index requiring o(n) bits in addition to the bitvector itself [31].
We summarize a result that we use later.

Deterministic index construction for integer alphabets
We now describe how to replace the randomized linear time construction of the bidirectional BWT index with a deterministic one, so that left extensions and right contractions are supported in O(log σ) time. We need only a single-directional subset of the index of Belazzougui and Cunial [6], consisting of an index on the BWT, augmented with balanced parentheses representations of the topologies of the suffix tree of T and of the suffix link tree of the reverse of T , such that the nodes corresponding to maximal repeats in both topologies are marked [6]. Belazzougui et al. showed how to construct the BWT of a string O(n) deterministic time and O(n log σ) bits of space [8]. Their algorithm can be used to construct both the BWT of T and the BWT of the reverse of T in O(n) time. We can build the bidirectional BWT index [7,47] by indexing both BWTs as wavelet trees [27]. The bidirectional index can then be used to construct both of the required tree topologies in O(n log σ) time using bidirectional extension operations and the counter-based topology construction method of Belazzougui et al. [8]. See the supplement of a paper on variable order Markov models by Cunial et al. [15] for more details on the construction of the tree topologies. The topologies are then indexed for various navigational operations required by the contraction operation described in [6].
This index enables left extensions in O(log σ) time using the BWT of T , and right contractions in constant time using the succinct tree topologies as shown by Belazzougui and Cunial [6]. We summarize the result in the lemma below. We note that it may be possible to improve the time of left extensions to O(log log σ) by replacing monotone minimum perfect hash functions by slower yet deterministic linear time constructable data structures in all constructions leading to Theorem 6.7 of Belazzougui et al. [8].

Indexable repeat-free founder graphs
We now consider non-elastic founder graphs induced from gapless MSAs, and later turn back to the general case. We show that there exists a family of founder graphs that admit a polynomial time constructable index structure supporting fast string matching. First, a trivial observation: the input multiple alignment is a founder graph for the segmentation consisting of only one segment. Such founder graph (set of sequences) can be indexed in linear time to support linear time string matching [8]. Now, the question is, are there other segmentations that allow the resulting founder graph to be indexed in polynomial time? We show that this is the case. Proposition 1. Repeat-free founder graphs can be indexed in polynomial time to support polynomial time string queries.
To prove the proposition, we construct such an index and show how queries can be answered efficiently. Our first solution uses just classical data structures, and works as a warm up: Later we improve this solution using succinct data structures, and while doing so we exploit the connections to the derivations in this section.
Let P (v) be the set of all paths starting from node v and ending in a sink node. Let Then one can binary search any query string Q in P to find out if it occurs in G(S) or not. The problem with this approach is that P is of exponential size.
However, if we know that G(S) is repeat-free, we know that the lexicographic order of where w is the node following v on the path L, except against other suffix path labels starting with denote the set of suffix path labels cut in this manner. Now the corresponding set P = ∪ v∈V,1≤i≤| (v)| P (v, i) is no longer of exponential size. Consider again binary searching a string Q in sorted P . If Q occurs in P then it occurs in G(S). If not, Q has to have some (v) for v ∈ V as its substring in order to occur in G(S).
To figure out if Q contains (v) for some v ∈ V as its substring, we build an Aho-Corasick automaton [1] To verify such a potential match, we need several tries [16].
Assume now we have located (using the Aho- corresponding to the leaf we reached in the trie. If the search succeeds after reading Q [1] we have found a path in G(S) spelling Q[1..j]. We repeat the analogous procedure with Q[j..m] starting from trie F(v). That is, we can verify a candidate occurrence of Q in G(S) in O(|Q| log σ) time, as the search in the tries takes O(log σ) time per step.
We are now ready to specify a theorem that reformulates Proposition 1 in detailed form. .j] = (v) fails. That is, we found another occurrence of (v) in G, which is a contradiction with the repeat-free property. Hence, the verification starting from an arbitrary candidate match is sufficient. With preprocessing time O(N log σ) we can build the Aho-Corasick automaton [1]. The tries can be built in The search for a candidate match and the following verification take O(|Q| log σ) time.
We are left with the case of short queries not spanning a complete node label. To avoid the costly binary search in sorted P , we instead construct the unidirectional BWT index [8] for the concatenation C = i∈{1,2,...,b} v∈V i ,(v,w)∈E (v) (w)0. Concatenation C is thus a string of length O(|E|L) from alphabet {0, 1, 2, . . . , σ}. The unidirectional BWT index for C can be constructed in O(|C|) time, so that in O(|Q|) time, one can find out if Q occurs in C [8]. This query equals that of binary search in P .
The above result can also be applied to degenerate strings [3]. These are special case of elastic degenerate strings with equal length strings inside each block, and can thus be seen as fully connected block graphs. Corollary 1. The results of Theorem 1 hold for a repeat-free degenerate string a.k.a. a fully connected repeat-free founder graph.
Observe that N < |C| ≤ 2mn, where C is the concatenation in the proof above (whose length was bounded by O(L|E|)), and m and n are the number of rows and number of columns, respectively, in the multiple sequence alignment from where the founder graph is induced. That is, the index construction algorithms of the above theorems can be seen to be take time almost linear in the (original) input size, namely, O(mn log σ) time. We study succinct variants of these indexes in Sect. 5, and also improve the construction and query times to linear as side product.

Construction of repeat-free founder graphs
Now that we know how to index repeat-free founder graphs, we turn our attention to the construction of such graphs from a given and MSA. For this purpose, we will adapt the dynamic programming segmentation algorithms for founders [13,42].
The idea is as follows. Let S be a segmentation of MSA[1..m, 1..n]. We say S is valid (or repeat-free) if it induces a repeat-free founder graph G(S) = (V, E). A segment in S is valid (or repeat-free) if S is valid. We build such valid S considering valid segmentation of prefixes of MSA from left to right, looking at shorter valid segmentations appended with a valid new segment.

Characterization lemma
Given a segmentation S and founder graph G(S) = (V, E) induced by S, we can ensure that it is valid by checking if, for all v ∈ V , (v) occurs in the rows of the MSA only in the interval of the block be the partitioning corresponding to a segmentation S inducing a block graph G = (V, E). The segmentation S is valid if and only if, for all blocks Proof. To see that this is a necessary condition for the validity of S, notice that each row of the MSA can be read through G, so if (v) occurs elsewhere than inside the block, then these extra occurrences make S invalid. To see that this is a sufficient condition for the validity of S, we observe the following: is not a substring of any row of input MSA. Then any substring of U either occurs in some row of the input MSA or it includes (u) as its substring.
c) Thus, any substring of a path in G either is a substring of some row of the input MSA, or it includes (u) of case b) as its substring. d) Let α be a substring of a path of G that includes (u) as its substring. If (z) = α for some z ∈ V , then (u) appears at least twice in the MSA. Substring α makes S invalid only if (u) does.

From characterization to a segmentation
Among the valid segmentations, we wish to select an optimal segmentation under some goodness criteria. We consider three score functions for the valid segmentations, one maximizing the number of blocks, one minimizing the maximum height of a block, and one minimizing the maximum length of a block. The latter two have been studied earlier without the repeatfree constraint, and non-trivial linear time solutions have been found [13,42], while the first score function makes sense only with this new constraint.
Let s(j ) be the score of an optimal scoring segmentation S 1 , S 2 , . . . , S b of prefix MSA[1..m, 1..j ] for a selected scoring scheme. Then gives the score of an optimal scoring repeat-free segmentation where is an operator depending on the scoring scheme and w(x, j , j) is a function on the score x of the segmentation of S 1 , S 2 , . . . , S b and on the last block . For initialization, set s(0) = 0. Moreover, when there is no valid segmentation for some j, s(j) = ∞. Finally, to fix this recurrence so that s(n) equals the minimum of maximum length of blocks over valid segmentations of MSA[1..m, 1..n], set = min and w(x, j , j) = max(x, j − j ). For initialization, set s(j) = 0. Moreover, when there is no valid segmentation for some j, set s(j) = ∞.
To derive efficient dynamic programming recurrences for these scoring functions, we separate the computation into the preprocessing phase and into the main computation. In the preprocessing phase, we compute values v(j) and f (j), 1 ≤ j ≤ n, defined as follows.
may not be defined for small j and f (j) may not be defined for large j (short blocks may not be repeat-free).
Assuming values v(j) have been preprocessed, we can simplify recurrence (1) into by observing that left-extensions of valid segments are also valid. We use this equation later for deriving a linear time solution for minimizing the maximum length of a block score.
With values f (j) we can use an analogous observation that right-extensions of valid segments are also valid. This observation directly yields forward-propagation dynamic programming solutions for maximizing the number of blocks score and for minimizing the maximum length of a block score. These are given in Algorithms 1 and 2. We leave it for future work to derive similar result for minimizing the maximum height of a block score.
Input: Right-extensions (j, f (j)) sorted from smallest to largest order by second component: . . , f (j n ) are not defined. Output: Score of an optimal repeat-free segmentation maximizing the number of blocks.
Correctness of Algorithms 1 and 2 follow from the fact that when computing the score at column j, all earlier segmentations that are safe to be extended with a new segment Input: Right-extensions (j, f (j)) sorted from smallest to largest order by second component: . . , f (j n ) are not defined. Output: Score of an optimal semi-repeat-free segmentation minimizing the maximum segment length. Initialize one-dimensional search trees T and I with keys 0, 1, 2, . . . , 2n, with all keys associated with values ∞; x ← x + 1; end minmaxlength(j) ← min(T .RangeMin(j + 1, ∞), I.RangeMin(−∞, j) + j); end return minmaxlength(n); Algorithm 2: An O(n log n) time algorithm for finding an optimal semi-repeat-free segmentation minimizing the maximum segment length. Minimization over an empty set is assumed to return ∞. Operation Upgrade(k, v) sets key k to value v if the previous value is larger. Operation RangeMin(a, b) returns the smallest value associated with keys in range [a..b]. Both operations can be supported in O(log n) time with standard balanced search trees. ending at j are considered. We formalize this argument for Algorithm 2, as the proof for the other is analogous and easier. Assume by induction that minmaxlength(j ) is the score max i:1≤i≤b L(S i ) of an optimal semi-repeat-free segmentation S 1 , S 2 , . . . , S b of MSA[1..m, 1..j ], for j < j. Each minmaxlength(j ) is added to the data structures when the corresponding segmentation can be considered to be appended with segment S b+1 corresponding to MSA[1..m, j + 1..j], for j ≥ f (j ), so that the result is a semi-repeat-free segmentation. The minimum values from the data structures equal the definition of segmentation score max i:1≤i≤b+1 L(S i ): To see this, we have two cases to consider: a) for j such that minmaxlength(j ) > j −j the score of the segmentation ending at j extended with [j +1..j] is minmaxlength(j ), and b) for j such that minmaxlength(j ) ≤ j − j the score of the segmentation ending at j extended with [j + 1..j] is j − j . The query intervals guarantee that the minima is returned, with the latter adjusted by +j so that it gives j − j for minimum −j in tree tree I, corresponding to the cases a) and b). Initialization guarantees that score of the first segment is correctly computed. Traceback from minmaxlength(n) gives an optimal semi-repeat-free segmentation.

Preprocessing
We can do the preprocessing for values v(j) and f (j) in O(mn) time. The idea is to build a BWT index on the MSA rows, and then search all rows backward from right to left in parallel (with everything reversed for the latter values). Once we reach a column j where all suffixes have altogether exactly m occurrences (their union of BWT intervals is of size m), then MSA[1..m, j ..n] is a valid segment. Then we can drop the last column (do right-contract on all rows) and continue left-extensions until finding the largest j such that MSA[1..m, j ..n − 1] is a valid segment. Continuing this way, we can find for each column j the value v(j) = j − 1. The bottleneck of the approach is the computation of the size of the union of intervals, but we can avoid a trivial computation by exploiting the repeat-free property and the order in which these intervals are computed. Proof. We consider values v(j) as the other case is symmetric. Let us build the bidirectional BWT index [8] of MSA rows concatenated into one long string with some separator symbols added between rows. We will run several phases in synchronization over this BWT index, but we explain them first as if they would be run independently.

Faster algorithm for minimizing the maximum block length
Recall Eq. (2). Let us consider the score w(x, j , j) = max(s(j ), j − j ) with = min, that is, minimizing the maximum block length over valid segmentations. Algorithm 2 solved this problem in near-linear time, but now we improve this to linear using values v(j) instead of f (j). The basic observation is that v(J) ≤ v(J + 1) ≤ · · · ≤ v(n), for some J > 0, and hence the range where the minimum is taken grows as j grows.
Cazaux et al. [13] considered a similar recurrence and gave a linear time solution for it. In what follows we modify that technique to work with valid ranges.
Proof. By the definition of x(.), for any j ∈ [1.
• Case where B = max(A, B) and C = max(C, D). We can assume that B > A (in the other case, we take A = max (A, B)) and as A > C, we have B > C + 1 and thus max(A + 1, B) > max(C + 1, D) which is impossible because max(A + 1, B) < max(C + 1, D). Proof. We need just to compare k = max(j − j , s(j )) and s(j ) where j is in arg min j ∈[j +1..v(j)] s(j ). If k is smaller than s(j ), k is smaller than all the s(j ) with j ∈ [j + 1..v(j)] and thus for all max(j − j , s(j )). Hence we have j = max arg min j ∈[j ..v(j)] max(j − j , s(j )).
Otherwise, s(j ) ≥ k and as k ≥ j − j , max(j − j , s(j )) ≥ k. In this case j = max arg min j ∈[j ..v(j)] max(j − j , s(j )). By using the constant time semi-dynamic range maximum query by Cazaux et al. [13] on the array s(.), we can obtain in constant time j and thus check the equality in constant time. to v(j) with the equality of Lemma 6 until one is true and thus corresponds to x(j). Finally, we add s(j) = max(j − x(j), s(x(j))) to the constant time semi-dynamic range maximum query and continue with j + 1.

Succinct index for repeat-free founder graphs
Recall the indexing solutions of Sect. 3 and the definitions from Sect. 2.
We now show that explicit tries and Aho-Corasick automaton can be replaced by some auxiliary data structures associated with the Burrows-Wheeler transformation of the concatenation C = i∈{1,2,...,b} v∈V i ,(v,w)∈E (v) (w)0.
Consider interval SA[i..k] in the suffix array of C corresponding to suffixes having (v) as prefix for some v ∈ V . From the repeat-free property it follows that this interval can be split into two subintervals, SA[i. We are now ready to present the search algorithm that uses only the BWT of C and some small auxiliary data structures. We associate two bitvectors B and E to the BWT of C as follows. We set B We can now strictly improve Theorem 1 and Corollary 1 as follows. Proof. As we expand the search interval in BWT, it is evident that we still find all occurrences for short patterns that span at most two nodes, like in the proof of Theorem 1. We need to show that a) the expansions do not yield spurious occurrences for such short patterns and b) the expansions yield exactly the occurrences for long patterns that we earlier found with the Aho-Corasick and tries approach. In case b), notice that after an expansion step we are indeed in an interval SA[i.
.k] where all suffixes match (v) and thus corresponds to a node v ∈ V . The suffix of the query processed before reaching interval SA[i..k] must be at least of length | (v)|. That is, to mimic Aho-Corasick approach, we should continue with the trie R(v). This is identical to taking a backward step from BWT[i..k], and continuing therein to follow the rest of this implicit trie. To conclude case b), we still need to show that we reach all the same nodes as when using Aho-Corasick, and that the search to other direction with L(v) can be avoided. These follow from case a), as we see.
In case a), before doing the first expansion, the search is identical to the original algorithm in the proof of Theorem 1. After the expansion, all matches to be found are those of case b). That is, no spurious matches are reported. Finally, no search interval can include  Figure 2: Gadgets G be , G 0 and G 1 . Each gadget is organized into three rows, each row encoding a different partitioning of the strings bbbb, eeee, 0000, 1111. This ensures that, when combining these gadgets in Figure 3, edges can be controlled to go within the same row, or to the row below. two distinct node labels, so the search reaches the only relevant node label, where the Aho-Corasick and trie search simulation takes place. We reach all such nodes that can yield a full match for the query, as the proof of Theorem 1 shows that it is sufficient to follow an arbitrary candidate match.
As we only need a standard backward step, we can use a unidirectional BWT index constructable in deterministic O(L|E|) time supporting a backward step in constant time [8].

Conditional hardness of indexing EFGs
We now turn our attention to general MSAs and elastic founder graphs induced from their segmentation. With non-elastic founder graphs, we have seen that the repeat-free property makes them indexable, but for now we have no proof that such property is necessary. For elastic founder graphs, we are able to derive such a conditional lower bound.
Namely, we show a reduction from Orthogonal Vectors (OV) to the problem of matching a query string in an EFG, continuing the line of research conducted on many related (degenerate) string problems [3,17,25,29]. The OV problem is to find out if there exist x ∈ X and y ∈ Y such that x · y = 0, given two set X and Y of n binary vectors each. We construct string Q using X and graph G using Y . Then, we show that Q has a match in G if and only if X and Y form a "yes"-instance of OV. We condition our results on the following OV hypothesis, which is implied by the Strong Exponential Time Hypothesis [30].
Definition 5 (Orthogonal Vectors Hypothesis (OVH) [50]). Let X, Y be the two sets of an OV instance, each containing n binary vectors of length d. 1 For any constant > 0, no algorithm can solve OV in time O(poly(d)n 2− ).

Query string
We build string Q by combining string gadgets Q 1 , . . . , Q n , one for each vector in X, plus some additional characters. To build string Q i , first we place four b characters, then we scan vector x i ∈ X from left to right. For each entry of x i , we place sub-string Q i,h consisting of four 0 characters if x i [h] = 0, or four 1 characters if x i [h] = 1. Finally, we place four e characters. For example, vector x i = 101 results into string Full string Q is then the concatenation Q = bbbbQ 1 Q 2 . . . Q n eeee. The reason behind these specific quantities will be clear when discussing the structure of the graph.

Elastic founder graph
We build graph G combining together three different sub-graphs: G L , G M , G R (for left, middle and right). Our final goal is to build a graph structured in three logical "rows". We denote the three rows of G M as G M 1 , G M 2 , G M 3 , respectively. The first and the third rows of G, along with subgraphs G L and G R (introduced to allow slack), can match any vector. The second row matches only sub-patterns encoding vectors that are orthogonal to the vectors of set Y . The key is to structure the graph such that the pattern is forced to utilize the second row to obtain a full match. We present the full structure of the graph in Figure 3, which shows the graph built on top of vector set {100, 011, 010}. In particular, G M consists of n gadgets G j M , one for each vector y j ∈ Y . The key elements of these subgraphs are gadgets G be , G 0 and G 1 (see Figure 2), which allow to stack together multiple instances of strings b 4 , e 4 , 1 4 , 0 4 . The overall structure mimics the one in [17], except for the new idea from Figure 2.

Detailed structure of the graph.
Sub-graph G L (Figure 3a) consists of a starting segment with a single node labeled b 4 , followed by n − 1 sub-graphs G 1 L , . . . , G n−1 L , in this order. Each G i L has d + 2 segments, and is obtained as follows. First, we place a segment containing only one node with label b 4 , then we place d other segments, each one containing two nodes with labels 1 4 and 0 4 . Finally, we place a segment containing two nodes with labels b 4 and e 4 .
The nodes in each segment are connected to all nodes in the next segment, with the exception of the last segment of each G i L : in this case, the node with label 1 4 and the one with label 0 4 are connected only to the e 4 -node of the next (and last) segment of such G i L . Sub-graph G R (Figure 3c) is similar to sub-graph G L , and it consists in n − 1 parts G 1 R , . . . , G n−1 R , followed by a segment with a single node labeled e 4 . Part G i R has d + 2 segments, and is constructed almost identically to G i L . The differences are that, in the first segment of G i R , we place two nodes labeled b 4 and e 4 , while in the last segment we place only one node, which we label e 4 .
As in G L , the nodes in each segment are connected to all nodes in the next segment, with the exception of the first segment of each G i R : in this case, the node labeled e 4 has no outgoing edge.
Sub-graph G M (Figure 3b) implements the main logic of the reduction, and it uses three building blocks, G be , G 0 and G 1 , which are organized in three rows, as shown in Figure 2.  Sub-graph G M has n parts, G 1 M , . . . , G n M , one for each of the vectors y 1 , . . . , y n in set Y . Each G j M is constructed, from left to right, as follows. First, we place a G be gadget. Then, we scan vector y j from left to right and, for each position h ∈ {1, . . . , d}, we place a G 0 gadget if the h-th entry is y j [h] = 0, or a G 1 if y j [h] = 1. Finally, we place another G be gadget.
For the edges, we first consider each gadget G j M separately. Let G h and G h+1 , be the gadgets encoding y j [h] and y j [h + 1], respectively. We fully connect the nodes of G h to the nodes of G h+1 row by row, respecting the structure of the segments. Then we connect, row by row, the b-nodes of the left G be to the leftmost G h , which encodes y j [1], and the nodes of the rightmost G h , which encodes y j [d], to the e-nodes of the right G be , again row by row. We repeat the same placement of the edges for every vector G h , G h+1 , 1 ≤ h ≤ d − 1; this construction is shown in Figure 3b.
To conclude the construction of G M , we need to connect all the G j M gadgets together. Consider the right G be of gadget G j M , and the left G be of gadget G j+1 M . The edges connecting these two gadgets are depicted in Figure 3b, which shows that following a path we can either remain in the same row or move to the row below, but we cannot move to the row above. Moreover, sub-pattern b 8 can be matched only in the first and second row, while sub-pattern e 8 only in the second and third rows.
In proving the correctness of the reduction, we will use G M 1 , G M 2 and G M 3 to refer to the sub-graphs of G M consisting of only the nodes and edges of the first, second and third row, respectively. Formally, for t ∈ {1, 2, 3}, V M t ⊂ V V M t ⊂ V is the set of nodes placed in the t-th row of each G be , G 0 or G 1 gadget belonging to sub-graph G M , and . We will use the notation G j M 2 to refer to the nodes belonging to both G j M and G M 2 , excluding the ones in G M 1 and G M 3 , and the edges connecting them.
Final graph G is obtained by combining sub-graphs G L , G M and G R . To this end, we connect the nodes in the last segment of G L with the b-nodes in the first and second row of the left G be gadget of G 1 M . Finally, we connect the e-nodes in the second and third row of the right G be gadget of G n M with both the b 4 -node and e 4 -node in the first segment of G R . Figures 3a, 3b and 3c can be visualized together, in this order, as one big picture of final graph G. In Figures 3a and 3c we also included the adjacent segment of G M to show the connection.

OVH conditional hardness
The proof of correctness is similar to the one in [17], but with adaptations to the elastic founder graph. We prove three lemmas concerning G M 2 , which are key for the correctness. The first lemma is a straightforward consequence of the structure of G M 2 and the fact that it is directed. (⇒) By Lemma 7, we can focus on the d distinct and consecutive nodes of G j M 2 that match Q i . In particular we know that each sub-string Q i,h matches in the second row of either the h-th gadget G 0 or the h-th gadget G 1 . Consider vectors x i ∈ X and y j ∈ Y . If Q i,h = 1 4 has a match in G j M 2 it means that its h-th gadget is a G 0 , and hence y j [h] = 0, implying (⇐) Consider vectors x i ∈ X and y j ∈ Y that are such that x i ·y j = 0. For h = 1, 2, . . . , d, if y j [h] = 0 then the h-th gadget of G j M 2 is a G 0 gadget, and Q i,h can surely match it. If y j [h] = 1 it must hold that x i [h] = 0, since x i · y j = 0. Thus Q i,h = 0 4 , and it can have a match in the h-th gadget of G j M 2 , no matter if it is a G 0 or G 1 gadget. Finally, sub-strings b 4 and e 4 can have a match in the G be gadgets at the beginning and end of G j M 2 , respectively. All characters of Q i have now a matching node and the definition of the edges allows to visit all such nodes via a matching path starting in the left G be gadget of G j M 2 and ending in the right G be gadget of G j M 2 . Our first lower bound is on matching a query string in an EFG without indexing. Proof. First, notice that the reduction that we presented for query string Q and EFG G is correct. Indeed, Lemma 9 guarantees that Q has a match in G if and only if a sub-string Q i has a match in G M , and this holds, by Lemma 8, if and only if x i · y j = 0. Thus, string Q has a match in G if and only if there exist vectors x i ∈ X and y j ∈ Y which are orthogonal.
The reduction requires linear time and space in the size O(nd) of the OV problem, and this is because of the construction of string Q and graph G. On one hand, when we define string Q, we place a constant number of characters for each entry of each vector, thus |Q| = O(nd). On the other hand, sub-graphs G L , G M 1 , G M 2 , G M 3 and G R all consist of O(n) structures, each one containing O(d) nodes, and a constant number of edges for each node, for an overall size of O(nd).
Hence, given two sets of vectors X and Y , we can perform our reduction obtaining string We obtain the indexing lower bound by proving that the above reduction is a linear independent-components (lic) reduction, as defined by [18,Definition 3].
Theorem 7. For any α, β, δ > 0 such that β + δ < 2, there is no algorithm preprocessing an EFG G = (V, E, ) in time O(|E| α ) such that for any query string Q we can find a match for Q in G in time O(|Q| + |E| δ |Q| β ), unless OVH is false. This holds even if restricted to an alphabet of size 4.
Proof. It is enough to notice that the reduction from OV that we presented is a lic reduction. Namely, (1) the reduction is correct and can be performed in linear time and space O(nd) (recall the proof of Theorem 6), and (2) query string |Q| is defined using only vector set X and it is independent from vector set Y , while elastic founder graph G is built using only vector set Y and it is independent from vector set X. Hence, Corollary 1 in [18] can be applied, proving our thesis.

Indexing EFGs
Let us now consider how to extend the indexing results to the general case of MSAs with gaps. The idea is that gaps are only used in the segmentation algorithm to define the valid ranges, and that is the only place where special attention needs to be taken; elsewhere, whenever a substring from MSA rows is read, gaps are treated as empty strings. That is, A-GC-TA-becomes AGCTA.
As we later see, segmentation becomes more difficult with gaps, and we need to consider a relaxed variant of prefix-free property for obtaining efficient algorithms. Recall that in a semi-repeat-free EFG a node label can appear as a prefix of another node label inside the same block.

Repeat-free case
As the reader can check, the indexing solutions in Sections 3 and 5 work verbatim with the repeat-free elastic founder graphs; the property of having equal-length of strings inside the blocks is not exploited in the algorithms.

Semi-repeat-free case
The case of semi-repeat-free elastic founder graphs is slightly more complex, and we need to combine and extend the previous solutions. The following lemma is the key property needed for the solution.
Lemma 10. Consider a semi-repeat free EFG G = (V, E). String (v) (w), where (v, e) ∈ E, can only appear in G as a prefix of paths starting with v.
Proof. Assume for contradiction that (v) (w) is a prefix of a path starting inside the label of some v ∈ V , v = v. Then (v) is a prefix of such path, and this is only possible if v is in the same block as v and (v) is a proper prefix of (v ): otherwise G would not be semi-repeat free. Then | (v)| < | (v )| and (w) has an occurrence in a path starting inside the label (v ). This is a contradiction on the fact that G is semi-repeat free.
Consider the suffix tree of the concatenation D = v,w,u:(v,w)∈E,(w,u)∈E ( (v) (w) (u)) −1 0. For suffixes of type α (w) −1 (v) −1 0, where α is a (possibly empty) string, we store node identifier v in the corresponding leaf of the tree. Clearly, queries spanning less than three nodes can be located from this suffix tree.
Consider a longer query Q[1..q] whose suffix spans at least three nodes in the graph. We search it backwards in the suffix tree until reaching a locus after which we cannot proceed with Q[i], but could continue with 0. Then we know that Q[i + 1..q] matches a path starting with (v) (w) in G. Due to Lemma 10, any leaf in the subtree rooted at the current locus in the suffix tree (which is spelling Q[i + 1..q] −1 ) stores v. Since we cannot know in advance if Q is a longer query, we have stored identifiers v only when this case applies. Once we have identified v, we only need to check if we can read Q[1.
.i] following a path to the left from v, which is exactly what we did in Sect. 3 using tries The semi-repeat-free property guarantees that no node label can be a suffix of another node label (even inside the same block), and hence the leaves of the tries R(v) correspond to exactly one row each. The left-extensions are hence not branching, as the search always narrows down to one row (leaf), before continuing on the next trie (see Sect. 3). Proof. Each node label (v) is added to D at most 3H 2 times, as H is an upper bound for the number of edges from and to v. The length of D is then bounded by O (N H 2 ). Construction of suffix tree on D can be done in linear time [20]. In polynomial time, the nodes of the suffix tree can be preprocessed with perfect hash functions, such that following a downward path takes constant time per step.
We note that the index can be modified to report only matches that are (gap-oblivious) substrings of the MSA rows: Short patterns spanning only one edge are already such. Longer patterns can have only one occurrence in G, and it suffices to verify them with a regular string index on the MSA. Such modified scheme makes the approach functionality equivalent with wide range of indexes designed for repetitive collections [23,24,34,38,39,40,41] and shares the benefit of alignment-based indexes of Na et al. [38,39,40,41] in reporting the aligned matches only once, where e.g. r-index [24] needs to report all occurrences.
Using compressed suffix trees, different space-time tradeoffs can be achieved. In Section 9, we develop an alternative compressed indexing scheme for the repeat-free case using Wheeler graphs.

Construction of (semi-)repeat-free EFGs
Now that we have seen that (semi-)repeat-free EFGs are easy to index, it remains to consider their construction. First, we observe that the algorithms in Sect. 4 do not work verbatim: Theorem 4 is based on Eq. (2), but now this recurrence is no longer valid, as left-extension of a valid block may not be a valid block. A counterexample is shown in Table 1. On the other hand, Algorithms 1 and 2 use the right-extension property of a valid block, and this holds even with general MSAs. .j ] is (semi-)repeat-free for all j such that f (j) < j ≤ n. Table 1: Semi-repeat-free segment and its extension (to the left) into a non-valid segment.
Here the distinct strings X, Y , and Z, |X| = |Y | do not appear elsewhere in MSA, except Z is a prefix of Y . The longer segment is non-valid, because X is not a prefix but a suffix of AX. Reversing the definition does not help, as the same segment contains AZ as prefix of AY . Node labels correspond to the string spelled from the root to the node. We assume ACA, AGA, and GCA only appear in the region of the MSA visualized, while GC and A appear also elsewhere.
However, these algorithms assume we have precomputed for each j the smallest integer f (j) such that MSA[1..m, j +1..f (j)] is a (semi-)repeat-free segment. The earlier sliding windows preprocessing algorithm for values f (j) inherently assumes the values are monotonic (as is the case with gapless MSAs), but this does not hold in the general case: Consider again Table 1. Let j be the column just before the longer segment of MSA. Then f (j) > |X| + 1 and f (j + 1) = |X|.
In order to be able to use Algorithms 1 and 2, we derive a new preprocessing algorithm for values f (j) that does not assume monotonicity.

Preprocessing for the non-monotonic case
As an internal part of the algorithm we need an efficient data structure to maintain a dynamic set of non-overlapping intervals. Let I be a set of integer intervals, i.e., Proof. Consider a balanced binary search tree with leaves corresponding to intervals of I, sorted by values a i for [a i ..b i ] ∈ I. Leaf i stores a i as key value and b i − a i + 1 as span value. Each internal node v stores the maximum key and sum of span values of the leaves in its subtree. Assuming the data structure has been maintained with these values, answering query span([a..b]) can be done as follows: Locate the O(log |I|) internal nodes that form a non-overlapping cover on the keys in the range [a..b] (by searching keys a and b and picking the subtrees bypassed and within the interval). Return the sum of span values stored in those nodes.
It remains to consider how the values can be maintained during insertions and deletions, and during the resulting rebalancing operations. On each such operation, there are O(log |I|) internal nodes affected on the upward path from the the leaf to the root. It is sufficient to consider one such affected node v assuming its left child and right child r have already been updated accordingly. We set key(v) = key(r) and span(v) = span( ) + span(r). Proof. Figure 4 illustrates the algorithm to be described in the following. Consider the generalized suffix tree of {spell(MSA[i, 1..n]) | 1 ≤ i ≤ m}. For each j, locate the subset W of (implicit) suffix tree nodes corresponding to {spell(MSA[i, j + 1..n]) | 1 ≤ i ≤ m}; ; these are the colored nodes in Fig. 4. If the number of leaves covered by the subtrees rooted at W is greater than m, f (j) remains undefined.
Otherwise, we know that f (j) ≤ n, and our aim is to decrease the right boundary, starting with n, until we have reached column f (j). We do this one row at a time, recording values f i (j) such that in the end f (j) = max i f i (j). We initialize f i (j) = n for all i. Suffix tree nodes corresponding to rows whose values f i (j) are final are stored in set F , initially empty. Suffix tree nodes corresponding to rows whose values f i (j) are redundant (to be detailed later) are stored in set R, initially empty.
To start this process, a) pick an (implicit) suffix tree node v ∈ W corresponding to spell(MSA[i, j + 1..f i (j)]) for some i. Let w be the parent of v. It corresponds to string spell(MSA[i, j + 1..j ]) for some j . Then b) consider replacing v with w in W . If the number of leaves covered by the subtrees rooted at W ∪ F ∪ R is still m, then this replacement is safe, and we can set f i (j) = j . Safe replacements are shown as black nodes in Fig. 4, while the gray nodes are unsafe replacements. We also move from W to R all w ∈ W that are located in the subtree rooted at w (not including w), as these nodes are now redundant; we will consider later how to compute their values f i (j). Otherwise, instead of replacement, move v from W to F , as we have found the minimum valid range with regards to row i: . This string is spelled by reading the path from the root to w and then reading one symbol on the edge (w, v). This string is thus the shortest string having the same occurrences as spell(MSA[i, j + 1..f i (j)]), and we can safely assign as final value f i (j) = j +k. Repeat these steps a) and b) until V is empty. At that point, decreasing of the right boundary is no longer possible on any row. However, we only have computed f i (j) for i such that there is v ∈ F corresponding to spell(MSA[i, j+1..f i (j)]). We also need to compute values f i (j) for i such that there is v ∈ R corresponding to spell(MSA[i , j + 1..f i (j)]). Note that each v ∈ R was made redundant by another node, which in turn, may have been made redundant on its turn. We can store these relationships as a forest of trees.
Root of each tree corresponds to some v ∈ F and rest of the nodes are from R. Now, consider a root v ∈ F of some of the trees corresponding to spell(MSA[i, j + 1..f i (j)]) and a node v ∈ R of the same tree. We can assign f i (j) = j for smallest j such that |spell(MSA[i , j + 1..j ])| = |spell(MSA[i, j + 1..f i (j)])|. E.g. for row i = 2 in Fig. 4, we have f 2 (j) = j + 3, as |spell(AC − A)| = 3 = |ACA|. Then we can set set f (j) = max i f i (j).
To achieve the claimed running time, we use backward searching on the unidirectional BWT index [8] on the concatenation of strings {spell(MSA[i, 1..n]) | 1 ≤ i ≤ m} (with special markers added between) to find all the suffix array intervals corresponding to sets {spell(MSA[i, j + 1..n]) | 1 ≤ i ≤ m} for all j. This takes O(mn) time. To find the largest j for which the union of suffix array intervals is of size m, we can sort the intervals at each column and compute the size of the union by a simple scanning. This takes O(mn log m) time overall. Now we need to show that the process of reducing the right boundary for a fixed column can be done in O(m log m) time. Mapping from a suffix array interval to the (compressed) suffix tree node takes constant time [46]. Steps a) and b) are repeated at most m times at any column j: Either some row i gets completed, or at least one row becomes redundant. In both cases, size of W decreases at each step. The most time consuming part in this process is to compute the number of leaves in the union of subtrees. We can do this in O(m log m) time, by mapping the nodes back to suffix array intervals, and then computing the size of the union of intervals as above. However, we can only afford to do this at the first step of the process. For the rest of the steps we use Lemma 11: To be able to use the lemma, we need to ensure only non-overlapping intervals are stored in the data structure. Thus, at the first step we remove duplicates and intervals that are nested in another one in O(m log m) time, and store the remaining intervals to the structure of Lemma 11. While doing so, we move the suffix tree nodes corresponding to these removed intervals to the set R of redundant nodes. This is safe, as initially the union of the intervals is of size m (no extra occurrences in the intervals), and hence the steps a) and b) would anyway move the suffix tree nodes corresponding to those intervals to R at some point. Consider now step a) with w being parent of suffix tree node v. Let [a..b] be the suffix array interval corresponding to w. We can query span([a..b]) from the data structure, and if the answer is m, we remove the intervals in the query range, and insert [a..b] in their place.
It remains to consider how to find the first non-gap symbol MSA[i, j + k] at row i after MSA[i, j ], and how to find the smallest j such that |spell(MSA[i , j + 1..j ])| = x given x. These can be done in constant time after O(mn) time preprocessing for rank and select queries on bitvectors marking locations of the the gap symbols.

Prefix-free EFGs
While Algorithms 1 and 2 would work also for the prefix-free case, it appears difficult to modify the preprocessing for the same.
Instead of separate preprocessing and dynamic programming to compute the score of an optimal segmentation, we proceed directly with the main recurrence. We consider only minimizing the maximum block length score, as for this score we can derive a non-trivial parameterized solution.
Let e(j ) be the score of a minimum scoring repeat-free segmentation S 1 , S 2 , . . . , S b of prefix MSA[1..m, 1..j ], where the score is defined as max i:1≤i≤b L(S i ). Then with e(0) initialized to 0. When there is no valid segmentation for some j, e(j) = j + 1. Recall that at any step of the preprocessing algorithm of Sect. 4.3, we had a non-nested set I = {[i a ..i a ]} a∈{1,2...m} of intervals. We exploited the non-nestedness in the use of a bitvectors M (marking suffixes of current column), B (BWT interval beginning), and E (BWT interval ending) to detect if I contains only BWT intervals of suffixes of the current column of the MSA. This gave us the linear time algorithm. Now that the intervals can become nested, these bitvectors no longer work as intended. Instead we resort to a generic method to check nestedness, and to compute the size of the union of distinct intervals in I, when no nestedness is detected. If no nestedness is detected, and the size of the union is m, we know that the range in consideration is valid. This can be done in m log m time e.g. by sorting the interval endpoints and simple scanning to maintain how many active intervals there are at the endpoints. If there is more than one active interval at any point, the range is not valid. Otherwise the range is valid, and the size of the union of intervals is just the sum of their lengths. This nestedness check and the computation of the union of intervals is repeated at each column. Proof. The unidirectional BWT index can be constructed in O(mn) time and each leftextension takes constant time [8]. We can start comparing max(j − j , e(j )) from j = j − 1 decreasing j by one each step and maintaining e(j) as the minimum value so far. Once value j − j grows bigger than current e(j), we know that the value of e(j) can no longer decrease. This means we can decrease j exactly e(j) times. At each decrease of j , we do m left-extensions (one for each row), and then check for the validity by computing the union of search intervals in O(m log m) time. This gives the claimed running time. Traceback from e(n) gives an optimal repeat-free segmentation.

Connection to Wheeler graphs
Wheeler graphs, also known as Wheeler automata, are a class of labeled graphs that admit an efficient index for path queries [22]. We now give an alternative way to index repeat-free elastic block graphs by transforming the graph into an equivalent Wheeler automaton.
We view a block graph as a nondeterministic finite automaton (NFA) by adding a new initial state and edges from the source node to the starts of the first block, and expanding each string of each block to a path of states. To conform with automata notions, we define that the label of an edge is the label of the destination node.
We denote the repeat-free NFA with F . First we determinize it with the standard subset construction for the reachable subsets of states. The states of the DFA are subsets of states of the NFA such that there is an edge from subset S 1 to subset S 2 with label c iff S 2 is the set of states at the destinations of edges labeled with c from S 1 . We only represent the subsets of states reachable from the subset containing only the initial state. We call the deterministic graph G. See Figures 5 and 6 for an example.
A DFA is indexable as a Wheeler graph if there exists an order < on the nodes such that if u < v, then every incoming path label to u is colexicographically smaller than every incoming path label to v (recall that the colexicographic order of strings is the lexicographic order of the reverses of the strings). The repeat-free property guarantees that the nodes at the ends of the blocks can be ordered among themselves by picking an arbitrary incoming path as the sorting key.
To make sure that the rest of the nodes are sortable, we modify the graph so that if a node is not at the end of a block, we make it so that the incoming paths to the node do not branch backward before the backward path reaches the end of a previous block. This is done by turning each block into a set of disjoint trees, where the roots of the trees are the ending nodes of the previous block, in a way that preserves the language of the automaton. The roots may have multiple incoming edges from the leaves of the previous tree. See Figure  7 for an example. The formal definition of the transformation and the proof of sortability are in Sect. 9.1. We denote the transformed graph with G and obtain the following result: Lemma 13. The number of nodes in G is at most O(N W ), where W is the maximum number of strings in a block of F and N is the total number of nodes in F .
The Wheeler order < of the transformed graph can be found by running the XBWT sorting algorithm on a spanning tree of the graph, as shown by Alanko et al. [2]. Finally, we can find the minimum equivalent Wheeler graph by running the general Wheeler graph minimization algorithm of Alanko et al. [2].
With the input graph now converted into a Wheeler graph, one can deploy succinct data structures supporting fast pattern matching [22,Lemma 4], leading to the following result: Corollary 3. A repeat-free founder/block graph G or a repeat-free elastic degenerate string can be indexed in O(N H) time into a Wheeler-graph-based data structure occupying O(N H log |Σ|) bits of space, where N is the total number of characters in the node labels  of G, H is the height of G (maximum number of strings in a block of G), and Σ is the alphabet. Later, using the data structure, one can find out in O(|Q| log |Σ|) time if a given query string Q occurs in G.

Wheeler graph details and proofs
Suppose we have the NFA F corresponding to a repeat-free EFG, and let the G be the determinization of F defined above. Denote with P (v) the set of path labels from the initial state to v. Since the graph is a DFA, all sets P (v) are disjoint. Denote with P min (v) and P max (v) the colexicographic minimum and maximum of P (v) respectively. We denote the colexicographic order relation with ≺.
We say that a node v is atomic if for all path labels α in the graph, we have α ∈ P (v) iff P min (v) α P max (v). A DFA is Wheeler if and only if all its nodes are atomic [2]. In this case, the Wheeler order < of nodes is defined so that v < u if α ≺ β for some strings α ∈ P (v) and β ∈ P (u) (this is well-defined when all nodes are atomic).
We will expand the graph so that all its nodes become atomic. To achieve this, we process the nodes in a topological order. If the current node v is at the end of a block, we do nothing. Otherwise, we apply the following transformation. Suppose the in-neighbors of v are u 1 , . . . , u k and the out-neighbors of v are w 1 , . . . , w l . We delete v, add k new nodes v 1 , . . . v k and add the sets of edges {(u i , v i ) | 1 ≤ i ≤ k} and {(v i , w j ) | 1 ≤ i ≤ k and 1 ≤ j ≤ l}. In other words, we distribute the in-edges of v and duplicate the out-edges. The automaton remains deterministic after this. Figure 7 shows an example of the expansion.
We denote the resulting graph after all transformations with G . First we note that by construction, the language of G is the same as the language of G: any path from the initial state in G can be mapped to a corresponding path with the same label in G and vice versa. Next, we show that the graph is Wheeler-sortable: Lemma 14. Every node v in G is atomic.
Proof. If v is in the first block G , then |P (v)| = 1, so v is trivially atomic. Suppose v is not in the first block. Then all strings in P (v) are of the form αβγ, with |α| ≥ 0, |β| > 0 and |γ| ≥ 0, such that β occurs in the NFA F only once (repeat-free property) and γ is a prefix of some string of a block. By the construction of G , strings β and γ are the same for all strings in P (v). Consider a string δ in G such that P min (v) δ P max (v). Then δ must be suffixed by βγ, so it follows that δ ∈ P (v), since (1) all occurrences of β lead to the same node u at the end of the previous block in G and (2) all paths from u with the label γ must lead to v, or otherwise G is not deterministic, a contradiction.
Next, we show that transformation from F to G increases the number of nodes by at most a polynomial amount.
Lemma 13. The number of nodes in G is at most O(N W ), where W is the maximum number of strings in a block of F and N is the total number of nodes in F .
Proof. Each non-source node v of G can be associated to a pair (u, α), where u is the node of G reached by walking from v back to the end of the previous block, and α is the label of the path from u to v. If v is at the first block, we set u to the added source node. If v is at the end of a block, we set u = v and set α to be the empty string. These pairs are all distinct because G is deterministic.
We bound the number of nodes in G by bounding the number of possible distinct pairs (u, α). Each end-of-block node v of G is paired only with prefixes of strings from the next block. Let end(b) be the set of nodes of F that are at the end of block b and let f (b) be the set of nodes at block b in F . Then the total number of possible pairs with nonempty α is at most B−1 b=0 |end(b)| · |f (b + 1)|, where B is the number of blocks, with block zero corresponding to the initial state. The number of pairs with empty α is

Implementation
We implemented construction and indexing of (semi-)repeat-free (elastic) founder graphs. The implementation is available at https://github.com/algbio/founderblockgraphs. Some proof-of-concept experiments can be found in the conference version of the paper [33].

Discussion
One characterization of our solution is that we compact those vertical repeats in MSA that are not horizontal repeats. This can be seen as positional extension of variable order de Bruijn graphs. Also, our solution is parameter-free unlike de Bruijn approaches that always need some threshold k, even in the variable order case.
The founder graph concept could also be generalized so that it is not directly induced from a segmentation. One could consider cyclic graphs having the same repeat-free property. This could be an interesting direction in defining parameter-free de Bruijn graphs. This paper only scratches the surface of a new family of pangenome representations. There are myriad of options how to optimize among the valid segmentations [13,42]. We studied some of these here, but left open how to e.g. minimize the maximum number of distinct strings in a segment (i.e. height of the graph) [42], or how to control the overexpressiveness of the graph. For the former, our ongoing work gives an efficient solution [44].
Other open problems include strengthening the conditional indexing lower bound to cover non-elastic founder graphs, and improving the running time for constructing (semi-)repeatfree elastic founder graphs. For the latter, our ongoing work improves the preprocessing time to linear, and consequently shows that a semi-repeat segmentation maximizing the number of blocks can be computed in linear time [45].
We focused here on the theoretical aspects of indexable founder graphs. Our preliminary experiments [33] show that the approach works well in practice on multiple sequence alignments without gaps. In our future work, we will focus on making the approach practical also in the general case.