Linear time minimum segmentation enables scalable founder reconstruction
Abstract
Background
We study a preprocessing routine relevant in pangenomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pangenomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NPhard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set \({\mathcal {R}} = \{R_1, \ldots , R_m\}\) of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1, n] into set P of disjoint segments such that each segment \([a,b] \in P\) has length at least L and the number \(d(a,b)=\{R_i[a,b] :1\le i \le m\}\) of distinct substrings at segment [a, b] is minimized over \([a,b] \in P\). The distinct substrings in the segments represent founder blocks that can be concatenated to form \(\max \{ d(a,b) :[a,b] \in P \}\) founder sequences representing the original \({\mathcal {R}}\) such that crossovers happen only at segment boundaries.
Results
We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier \(O(mn^2)\).
Conclusions
Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/foundersequences.
Keywords
Pangenome indexing Founder reconstruction Dynamic programming Positional Burrows–Wheeler transform Range minimum queryAbbreviations
 pBWT
positional Burrows–Wheeler transform
 LCP
longest common prefix
 RMQ
range maximum query
 MIBP
maximum intersection between two partitions
Background
A key problem in pangenomics is to develop a sufficiently small, efficiently queriable, but still descriptive representation of the variation common to the subject under study [1]. For example, when studying human population, one would like to take all publicly available variation datasets (e.g. [2, 3, 4]) into account. Many approaches encode the variation as a graph [5, 6, 7, 8, 9, 10] and then one can encode the different haplotypes as paths in this graph [11]. An alternative has been proposed [12] based on a compressed indexing scheme for a multiple alignment of all the haplotypes [13, 14, 15, 16, 17]. In either approach, scalability is hampered by the encoding of all the haplotypes.
We suggest to look for a smaller set of representative haplotype sequences to make the above pangenomic representations scalable.
Finding such set of representative haplotype sequences that retain the original contiguities as well as possible, is known as the founder sequence reconstruction problem [18]. In this problem, one seeks a set of d founders such that the original m haplotypes can be mapped with minimum amount of crossovers to the founders. Here a crossover means a position where one needs to jump from one founder to another to continue matching the content of the haplotype in question. Unfortunately, this problem is \({\textsf {NP}}\)hard even to approximate within a constant factor [19].
For founder reconstruction to be scalable to the pangenomic setting, one would need an algorithm to be nearly linear to the input size. With this is mind, we study a relaxation of founder reconstruction that is known to be polynomial time solvable: Namely, when limiting all the crossovers to happen at the same locations, one obtains a minimum segmentation problem specific to founder reconstruction [18]. A dynamic programming algorithm solves this problem in \(O(n^2m)\) time [18], where m is the number of haplotypes and n is the length of each of them.
In this paper, we improve the running time of solving the minimum segmentation problem of founder reconstruction to O(mn) (linear in the input size).
We also implement the new algorithm, as well as a further heuristic that aims to minimize crossovers over the segment boundaries (yielded by the optimal solution to the minimum segmentation problem). In our experiments, we show that the approach is practical on human genome scale setting. Namely, we apply the implementation on a multiple alignment representing 5009 haplotypes of human chromosome 6, and the result is 130 founder sequences with the average distance of two crossovers being 9624 bases. Preserving such long contiguities in just 2.5% of the original input space is promising for the accuracy and scalability of the short read alignment and variant calling motivating our study.
The main technique behind the improvement is the use of positional Burrows–Wheeler transform (pBWT) [20], and more specifically its extension to larger alphabets [21]. While the original dynamic programming solution uses O(nm) time to look for the best preceding segment boundary for each column of the input, we observe that at most m values in pBWT determine segment boundaries where the number of distinct founder substrings change. Minimums on the already computed dynamic programming values between each such interesting consecutive segment boundaries give the requested result. However, it turns out that we can maintain the minimums directly in pBWT internal structures (with some modifications) and have to store only the last L computed dynamic programming values, thus spending only \(O(m + L)\) additional space, where L is the input threshold on the length of each segment. The segmentation is then reconstructed by standard backtracking approach in O(n) time using an array of length n.
Preliminary version of this work appeared in WABI 2018 [22].
Methods
Notation and problem statement
For a string \(s = c_1 c_2 \cdots c_n\), denote by s its length n. We write s[i] for the letter \(c_i\) of s and s[i, j] for the substring \(c_i c_{i + 1} \ldots c_j\). An analogous notation is used for arrays. For any numbers i and j, the set of integers \(\{x \in {\mathbb {Z}} :i \le x \le j\}\) (possibly empty) is denoted by [i, j].
The input for our problem is the set \({\mathcal {R}} = \{R_1,\ldots ,R_m\}\) of strings of length n, called recombinants. A set \({\mathcal {F}} = \{F_1,\ldots ,F_d\}\) of strings of length n is called a founder set of \({\mathcal {R}}\) if for each string \(R_i \in {\mathcal {R}}\), there exists a partition \(P_i\) of the segment [1, n] into disjoint subsegments such that, for each \([a,b] \in P_i\), the string \(R_i[a,b]\) is equal to \(F_j[a,b]\) for some \(j \in [1,d]\). The partition \(P_i\) together with the mapping of the segments \([a, b] \in P_i\) to substrings \(F_j[a, b]\) is called a parse of \(R_i\) in terms of \({\mathcal {F}}\), and a set of parses for all \(R_i \in {\mathcal {R}}\) is called a parse of \({\mathcal {R}}\) in terms of \({\mathcal {F}}\). The integers a and \(b + 1\), for \([a,b] \in P_i\), are called crossover points; thus, in particular, 1 and \(n + 1\) are always crossover points.
It follows from the definition that, in practice, it makes sense to consider founder sets only for prealigned recombinants. Throughout the paper we implicitly assume that this is the case, although all our algorithms, clearly, work in the unaligned setting too but the produce results may hardly make any sense.
We consider the problem of finding a “good” founder set \({\mathcal {F}}\) and a “good” corresponding parse of \({\mathcal {R}}\) according to a reasonable measure of goodness. Ukkonen [18] pointed out that such measures may contradict each other: for instance, a minimum founder set obviously has size \(d = \max _{j\in [1,n]} \vert \{R_1[j], \ldots , R_m[j]\} \vert\), but parses corresponding to such set may have unnaturally many crossover points; conversely, \({\mathcal {R}}\) is a founder set of itself and the only crossover points of its trivial parse are 1 and \(n + 1\), but the size m of this founder set is in most cases unacceptably large. Following Ukkonen’s approach, we consider compromise parameterized solutions. The minimum founder set problem is, given a bound L and a set of recombinants \({\mathcal {R}}\), to find a smallest founder set \({\mathcal {F}}\) of \({\mathcal {R}}\) such that there exists a parse of \({\mathcal {R}}\) in terms of \({\mathcal {F}}\) in which the distance between any two crossover points is at least L (the crossover points may belong to parses of different recombinants, i.e., for \([a, b] \in P_i\) and \([a', b'] \in P_j\), where \(P_i\) and \(P_j\) are parses of \(R_i\) and \(R_j\), we have either \(a = a'\) or \(a  a' \ge L\)).
The minimum founder set problem and the minimum segmentation problem are connected: any segmentation S with segments of length at least L induces in an obvious way a founder set of size \(\max \{ \vert {\mathcal {R}}[j,k] \vert :{\mathcal {R}}[j,k] \in S \}\) and a parse in which all crossover points are located at segment boundaries (and, hence, at distance at least L from each other); conversely, if \({\mathcal {F}}\) is a founder set of \({\mathcal {R}}\) and \(\{j_1, \ldots , j_p\}\) is the sorted set of all crossover points in a parse of \({\mathcal {R}}\) such that \(j_q  j_{q1} \ge L\) for \(q \in [2,p]\), then \(S = \{{\mathcal {R}}[j_{q1},j_q{}1] :q \in [2,p]\}\) is a segmentation of \({\mathcal {R}}\) with segments of length at least L and \(\max \{ \vert {\mathcal {R}}[j,k] \vert :{\mathcal {R}}[j,k] \in S \} \le {\mathcal {F}}\).
Our main result is an algorithm that solves the minimum segmentation problem in O(mn) time (linear in the input size). The solution normally does not uniquely define a founder set of \({\mathcal {R}}\): for instance, if the built segmentation of \({\mathcal {R}} = \{baaaa, baaab, babab\}\) is \(S = \{{\mathcal {R}}[1,1], {\mathcal {R}}[2,3], {\mathcal {R}}[4,5]\}\), then the possible founder sets induced by S are \({\mathcal {F}}_1 = \{baaab, babaa\}\) and \({\mathcal {F}}_2 = \{baaaa, babab\}\). In other words, to construct a founder set, one concatenates fragments of recombinants corresponding to the found segments in a certain order. We return to this ordering problem in the section describing experiments and now focus on the details of the segmentation problem.
Hereafter, we assume that the input alphabet \(\Sigma\) is the set \([0,\Sigma {}1]\) of size O(m), which is a natural assumption considering that the typical alphabet size is 4 in our problem. It is sometimes convenient to view the set \({\mathcal {R}} = \{R_1, \ldots , R_m\}\) as a matrix with m rows and n columns. We say that an algorithm processing the recombinants \({\mathcal {R}}\) is streaming if it reads the input from left to right “columnwise”, for each k from 1 to n, and outputs an answer for each set of recombinants \(\{R_1[1,k], \ldots , R_m[1,k]\}\) immediately after reading the “column” \(\{R_1[k], \ldots , R_m[k]\}\). The main result of the paper is the following theorem.
Theorem 1
Given a bound L and recombinants \({\mathcal {R}} = \{R_1, \ldots , R_m\}\), each having length n, there is an algorithm that computes (1) in a streaming fashion in O(mn) time and \(O(m + L)\) space. Using an additional array of length n, one can also find in O(n) time a segmentation on which (1) is attained, thus solving the minimum segmentation problem.
Minimum segmentation problem
 i.
the sorted sequence \(j_{k,1}, \ldots , j_{k,r_k}\)
 ii.
the numbers \(\vert {\mathcal {R}}[j_{k,h+1},k] \vert\), for \(h \in [0,r_k]\)
 iii.
the values \(\min \{ M(j) :j_{k,h} \le j < j_{k,h+1} \}\), for \(h \in [0,r_k].\)
To reconstruct a segmentation corresponding to the found solution M(n), we build along with the values M(k) an array of size n whose kth element, for each \(k \in [1,n]\), stores 0 if \(M(k) = \vert {\mathcal {R}}[1,k] \vert\), and stores a number \(j \in [1,k{}L]\) such that \(M(k) = \max \{M(j), \vert {\mathcal {R}}[j{+}1,k] \vert \}\) otherwise; then, the segmentation can be reconstructed from the array in an obvious way in O(n) time. In order to maintain the array, our algorithm computes, for each \(k \in [1,n]\), along with the values \(\min \{ M(j) :j_{k,h} \le j < j_{k,h+1} \}\), for \(h \in [0,r_k]\), positions j on which these minima are attained (see below). Further details are straightforward and, thence, omitted.
Positional Burrows–Wheeler transform
 1.
\(a_k[1,m]\) is a permutation of [1, m] such that \(R'_{a_k[1],k} \le \cdots \le R'_{a_k[m],k}\) lexicographically;
 2.
\(d_k[i]\), for \(i \in [1,m]\), is an integer such that \(R_{a_k[i]}[d_k[i], k]\) is the longest common suffix of \(R_{a_k[i]}[1,k]\) and \(R_{a_k[i1]}[1,k]\), and \(d_k[i] = k + 1\) if either this suffix is empty or \(i = 1\).
Example 1
Lemma 2
The arrays \(a_k[1,m]\) and \(d_k[1,m]\) can be computed from \(a_{k1}[1,m]\) and \(d_{k1}[1,m]\) in O(m) time, assuming the input alphabet is \([0, \Sigma {}1]\) with \(\Sigma =O(m)\).
Proof
Given \(a_{k1}\) and \(d_{k1}\), we are to show that Algorithm 1 correctly computes \(a_k\) and \(d_k\). Since, for any \(i, j \in [1,m]\), we have \(R'_{i,k} \le R'_{j,k}\) iff either \(R_i[k] < R_j[k]\), or \(R_i[k] = R_j[k]\) and \(R'_{i,k1} \le R'_{j,k1}\) lexicographically, it is easy to see that the array \(a_k\) can be deduced from \(a_{k1}\) by radix sorting the sequence of pairs \(\{ (R_{a_{k1}[i]}[k], R'_{a_{k1}[i],k1}) \}_{i=1}^m\). Further, since, by definition of \(a_{k1}\), the second components of the pairs are already in a sorted order, it remains to sort the first components by the counting sort. Accordingly, in Algorithm 1, the first loop counts occurrences of letters in the sequence \(\{R_i[k]\}_{i=1}^m\) using an auxiliary array \(C[0,\Sigma ]\); as is standard in the counting sort, the second loop modifies the array C so that, for each letter \(b \in [0,\Sigma {}1]\), \(C[b] + 1\) is the first index of the “bucket” that will contain all \(a_{k1}[i]\) such that \(R_{a_{k1}[i]}[k] = b\); finally, the third loop fills the buckets incrementing the indices \(C[b] \leftarrow C[b] + 1\), for \(b = R_{a_{k1}[i]}[k]\), and performing the assignments \(a_k[C[b]] \leftarrow a_{k1}[i]\), for \(i = 1,\ldots ,m\). Thus, the array \(a_k\) is computed correctly. All is done in \(O(m + \Sigma )\) time, which is O(m) since the input alphabet is \([0,\Sigma {}1]\) and \(\Sigma  = O(m)\).
The last three lines of the algorithm are responsible for computing \(d_k\). Denote the length of the longest common prefix of any strings \(s_1\) and \(s_2\) by \(\mathtt {LCP} (s_1, s_2)\). The computation of \(d_k\) relies on the following wellknown fact: given a sequence of strings \(s_1, \ldots , s_r\) such that \(s_1 \le \cdots \le s_r\) lexicographically, one has \(\mathtt {LCP} (s_1, s_r) = \min \{\mathtt {LCP} (s_{i1}, s_{i}) :1 < i \le r\}\). Suppose that the last loop of the algorithm, which iterates through all i from 1 to m, assigns \(a_k[i'] \leftarrow a_{k1}[i]\), for a given \(i \in [1,m]\) and some \(i' = C[b]\). Let j be the maximum integer such that \(j < i\) and \(R_{a_{k1}[j]}[k] = R_{a_{k1}[i]}[k]\) (if any). The definition of \(a_k\) implies that \(a_k[i'  1] = a_{k1}[j]\) if such j exists. Hence, \(\mathtt {LCP} (R'_{a_k[i'  1],k}, R'_{a_k[i'],k}) = 1 + \min \{\mathtt {LCP} (R'_{a_{k1}[\ell  1], k1}, R'_{a_{k1}[\ell ], k1}) :j{<}\ell {\le }i\}\) if such number j does exist, and \(\mathtt {LCP} (R'_{a_k[i'  1],k}, R'_{a_k[i'],k}) = 0\) otherwise. Therefore, since \(d_k[i']\) equals \(k + 1  \mathtt {LCP} (R'_{a_k[i'],k}, R'_{a_k[i'1],k})\), we have either \(d_k[i'] = \max \{d_{k1}[\ell ] :j < \ell \le i\}\) or \(d_k[i'] = k + 1\) according to whether the required j exists. To find j, we simply maintain an auxiliary array \(P[0, \Sigma {}1]\) such that on the ith loop iteration, for any letter \(b \in [0,\Sigma {}1]\), P[b] stores the position of the last seen b in the sequence \(R_{a_{k1}[1]}[k], R_{a_{k1}[2]}[k], \ldots , R_{a_{k1}[i1]}[k]\), or \(P[b] = 0\) if b occurs for the first time. Thus, \(d_k\) is computed correctly.
In order to calculate the maximums \(\max \{d_{k1}[\ell ] :P[b] \le \ell \le i\}\) in O(1) time, we build a range maximum query (RMQ) data structure on the array \(d_{k1}[1,m]\) in O(m) time [23]. Therefore, the running time of Algorithm 1 is O(m). \(\square\)
In practice the bottleneck of the algorithm is the RMQ data structure, which, although answers queries in O(1) time, has a sensible constant under the bigO in the construction time. We could naively compute the maximums by scanning the ranges \(d_{k1}[P[b]{+}1,i]\) from left to right but such algorithm works in quadratic time since same ranges of \(d_{k1}\) might be processed many times in the worst case. Our key idea is to store the work done by a simple scanning algorithm to reuse it in future queries. We store this information right in the arrays \(a_{k1}\) and \(d_{k1}\) rewriting them; in particular, since \(a_{k1}\) is accessed sequentially from left to right in the last loop, the range \(a_{k1}[1,i]\) is free to use after the ith iteration.
More precisely, after the ith iteration of the last loop, the subarrays \(a_{k1}[1,i]\) and \(d_{k1}[1,i]\) are modified so that the following invariant holds: for any \(j \in [1,i]\), \(j < a_{k1}[j] \le i + 1\) and \(d_{k1}[j] = \max \{d'_{k1}[\ell ] :j \le \ell < a_{k1}[j]\}\), where \(d'_{k1}\) denotes the original array \(d_{k1}\) before modifications; note that the invariant holds if one simply puts \(a_{k1}[j] = j + 1\) without altering \(d_{k1}[j]\). Then, to compute \(\max \{d'_{k1}[\ell ] :j\le \ell \le i\}\), we do not have to scan all elements but can “jump” through the chain \(j, a_{k1}[j], a_{k1}[a_{k1}[j]], \ldots , i\) and use maximums precomputed in \(d_{k1}[j], d_{k1}[a_{k1}[j]], d_{k1}[a_{k1}[a_{k1}[j]]], \ldots , d_{k1}[i]\); after this, we redirect the “jump pointers” in \(a_{k1}\) to \(i + 1\) and update the maximums in \(d_{k1}\) accordingly. This idea is implemented in Algorithm 2. Notice the new line \(a_{k1}[i] \leftarrow i + 1\) in the main loop (it is commented), which erases \(a_{k1}[i]\) and makes it a part of the “jump table”. The correctness of the algorithm is clear. But it is not immediate even that the algorithm works in \(O(m\log m)\) time. The next lemma states that the bound is actually even better, \(O(m\log \Sigma )\).
Lemma 3
Algorithm 2 computes the arrays \(a_k[1,m]\) and \(d_k[1,m]\) from \(a_{k1}[1,m]\) and \(d_{k1}[1,m]\) in \(O(m\log \Sigma )\) time, assuming the input alphabet is \([0, \Sigma {}1]\) with \(\Sigma =O(m)\).
Proof
Fix \(i \in [1,m]\). The ith iteration of the last loop in the algorithm computes the maximum in a range \(d'_{k1}[i',i]\), where \(d'_{k1}\) is the original array \(d_{k1}\) before modifications and \(i' = P[b] + 1\) for some b and P. Let \(\ell _i = i  i'\). Denote \({\tilde{\ell }} = \frac{1}{m} \sum _{i=1}^m \ell _i\), the “average query length”. We are to prove that the running time of the algorithm is \(O(m\log {\tilde{\ell }})\), which implies the result since \(m{\tilde{\ell }} = \sum _{i=1}^m \ell _i\) and \(\sum _{i=1}^m \ell _i \le \Sigma  m\). The latter inequality follows from the fact that the query ranges correponding to the same symbol are nonoverlapping.
For position j, denote by p(j) the number of nonscaling touches of j. We are to prove that \(P = \sum _{j=1}^m p(j) \le 2 m \log {\tilde{\ell }}\). Let \(q_h(j)\) denote the value of \(a_{k1}[j]  j\) in the hth nonscaling touch of j, for \(h \in [1,p(j)]\). Suppose that this hth touch happens during the processing of a query \(\mathsf {maxd}(i  \ell _i, i)\). By the definition, \(j + q_h(j)\) follows j in the sequence of touched positions. Since the touch of j is nonscaling, we have \(ij> ia_{k1}[j]= ijq_h(j)> 2^q\), where q is the largest integer such that \(i  j > 2^q\). Since \(ij\le 2^{q+1}\), there holds \(q_h(j) < 2^q\). Since \(\mathsf {maxd}(i  \ell _i, i)\) assigns \(a_{k1}[j] \leftarrow i + 1\), we have \(a_{k1}[j]  j> i  j > 2^q\) after the query. In other words, we had \(a_{k1}[j]  j = q_h(j) < 2^q\) before the query and have \(a_{k1}[j]  j > 2^q\) after. This immediately implies that \(q_h(j) \ge 2^{h1}\), for \(h \in [1,p(j)]\), and, therefore, every position can be touched in the nonscaling way at most \(O(\log m)\) times, implying \(P = O(m\log m)\). But we can deduce a stronger bound. Since the sum of all values \(a_{k1}[j]j\) for all positions j touched in a query \(\mathsf {maxd}(i  \ell _i, i)\) is equal to \(\ell _i\), we can bound the total sum of values \(q_h(j)\) by \(\sum _{j = 1}^m \sum _{h = 1}^{p(j)} q_h(j) \le \sum _{i=1}^m \ell _i = m{\tilde{\ell }}\). On the other hand, we have \(\sum _{j = 1}^m \sum _{h = 1}^{p(j)} q_h(j) \ge \sum _{j = 1}^m \sum _{h = 1}^{p(j)} 2^{h1} = \sum _{j=1}^m 2^{p(j)}  m\). The wellknown property of the convexity of the exponent is that the sum \(\sum _{j=1}^m 2^{p(j)}\) is minimized whenever all p(j) are equal, i.e., \(\sum _{j=1}^m 2^{p(j)} \ge \sum _{j=1}^m 2^{P / m}\). Hence, once \(P > 2 m \log {\tilde{\ell }}\), we obtain \(\sum _{j = 1}^m \sum _{h = 1}^{p(j)} q_h(j) \ge \sum _{j = 1}^m 2^{P / m}  m > m{\tilde{\ell }}^2  m\), which is larger than \(m{\tilde{\ell }}\) for \({\tilde{\ell }} \ge 2\) (for the case \({\tilde{\ell }} < 2\) the claim follows directly), contradicting \(\sum _{j = 1}^m \sum _{h = 1}^{p(j)} q_h(j) \le m{\tilde{\ell }}\). Thus, \(P = \sum _{j=1}^m p(j) \le 2 m \log {\tilde{\ell }}\).
It remains to consider scaling touches. The definition implies that each query \(\mathsf {maxd}(i{}\ell _i, i)\) performs at most \(\log \ell _i\) scaling touches. Thus, it suffices to upperbound \(\sum _{i=1}^m \log \ell _i\). Since the function \(\log\) is concave, the sum \(\sum _{i=1}^m \log \ell _i\) is maximized whenever all \(\ell _i\) are equal, i.e., \(\sum _{i=1}^m \log \ell _i \le \sum _{i=1}^m \log (\frac{1}{m} \sum _{j=1}^m \ell _j) = m\log {\tilde{\ell }}\), hence the result follows. \(\square\)
Modification of the pBWT
We are to modify the basic pBWT construction algorithm in order to compute the sequence \(j_{k,1}, \ldots , j_{k,r_k}\) of all positions \(j \in [1,kL]\) in which \(\vert {\mathcal {R}}[j,k] \vert \ne \vert {\mathcal {R}}[j + 1,k] \vert\), and to calculate the numbers \(\vert {\mathcal {R}}[j_{k,h+1},k] \vert\) and \(\min \{ M(j) :j_{k,h} \le j < j_{k,h+1} \}\), for \(h \in [0,r_k]\) (assuming \(j_{k,0} = 0\) and \(j_{k,r_k+1} = k  L + 1\)); see the beginning of the section. As it follows from (3), these numbers are sufficient to calculate M(k), as defined in (2) and (3), in O(m) time. The following lemma reveals relations between the sequence \(j_{k,1}, \ldots , j_{k,r_k}\) and the array \(d_k\).
Lemma 4
Consider recombinants \({\mathcal {R}} = \{R_1, \ldots , R_m\}\), each having length n. For \(k \in [1,n]\) and \(j \in [1,k  1]\), one has \(\vert {\mathcal {R}}[j,k] \vert \ne \vert {\mathcal {R}}[j + 1,k] \vert\) iff \(j = d_k[i]  1\) for some \(i \in [1,m]\).
Proof
Suppose that \(\vert {\mathcal {R}}[j,k] \vert \ne \vert {\mathcal {R}}[j + 1,k] \vert\). It is easy to see that \(\vert {\mathcal {R}}[j,k] \vert > \vert {\mathcal {R}}[j + 1,k] \vert\), which implies that there are two indices h and \(h'\) such that \(R_h[j + 1, k] = R_{h'}[j + 1, k]\) and \(R_h[j] \ne R_{h'}[j]\). Denote by \(a_k^{1}[h]\) the number x such that \(a_k[x] = h\). Without loss of generality, assume that \(a_k^{1}[h] < a_k^{1}[h']\). Then, there exists \(i \in [a_k^{1}[h] + 1, a_k^{1}[h']]\) such that \(R_{a_k[i  1]}[j + 1, k] = R_{a_k[i]}[j + 1, k]\) and \(R_{a_k[i  1]}[j] \ne R_{a_k[i]}[j]\). Hence, \(d_k[i] = j + 1\).
Suppose now that \(j \in [1, k  1]\) and \(j = d_k[i]  1\), for some \(i \in [1,m]\). Since \(j < k\) and \(d_k[1] = k + 1\), we have \(i > 1\). Then, by definition of \(d_k\), \(R_{a_k[i1]}[j + 1, k] = R_{a_k[i]}[j + 1, k]\) and \(R_{a_k[i1]}[j] \ne R_{a_k[i]}[j]\), i.e., \(R_{a_k[i]}[j + 1, k]\) can be “extended” to the left in two different ways, thus producing two distinct strings in the set \({\mathcal {R}}[j, k]\). Therefore, \(\vert {\mathcal {R}}[j, k] \vert > \vert {\mathcal {R}}[j + 1, k] \vert\). \(\square\)

\(s_k[1,r]\) contains all distinct elements from \(d_k[1,m]\) in the increasing sorted order;

\(e_k[1,m]\): for \(j \in [1,m]\), \(e_k[j]\) is equal to the unique index such that \(s_k[e_k[j]] = d_k[j]\);

\(t_k[1,r]\): for \(j \in [1,r]\), \(t_k[j]\) is equal to the number of times \(s_k[j]\) occurs in \(d_k[1,m]\);

\(u_k[1,r]\): for \(j \in [1,r]\), \(u_k[j] = \min \{M'(\ell ) :s_k[j{}1]{}1 \le \ell < s_k[j]{}1\}\), assuming \(s_k[0] = 1\).
Example 2
In Example 1, where \(m = 6\), \(k = 7\), and \(\Sigma = \{a,c,t\}\), we have \(r = 4\), \(s_k = [3, 5, 7, 8]\), \(t_k = [2, 1, 1, 2]\), \(e_k = [4, 4, 2, 1, 3, 1]\). It is easy to see that the array \(s_k\) marks positions of the branching nodes in the trie from Fig. 1 in the increasing order (in the special case \(s_k[1] = 1\), \(s_k[1]\) does not mark any such node). Suppose that \(L = 3\), so that \(k  L = 4\). Then, \(u_k[1] = M(1)\), \(u_k[2] = \min \{M(2), M(3)\}\), \(u_k[3] = \min \{M(4), M'(5)\} = M(4)\) since \(M'(5) = +\infty\), and \(u_k[4] = M'(6) = +\infty\). The use of \(u_k\) is discussed in the sequel.
Lemma 5
Consider a set of recombinants \({\mathcal {R}} = \{R_1, \ldots , R_m\}\), each of which has length n. For \(k \in [1,n]\) and \(j \in [1,s_k]\), one has \(\vert {\mathcal {R}}[s_k[j]  1, k] \vert = t_k[j] + t_k[j + 1] + \cdots + t_k[t_k]\).
Proof
Denote \(\ell = k  s_k[j] + 1\), so that \({\mathcal {R}}[s_k[j]  1, k] = {\mathcal {R}}[k  \ell , k]\). Suppose that \(\ell = 0\). Note that \(R_{a_k[1]}[k] \le \cdots \le R_{a_k[m]}[k]\). Since \(d_k[i] = k + 1\) iff either \(i = 1\) or \(R_{a_k[i1]}[k] \ne R_{a_k[i]}[k]\), it is easy to see that \(\vert {\mathcal {R}}[k,k] \vert\), the number of distinct letters \(R_i[k]\), is equal to the number of time \(k + 1 = s_k[s_k]\) occurs in \(d_k\), i.e., \(t_k[t_k]\).
Suppose that \(\ell > 0\). It suffices to show that \(\vert {\mathcal {R}}[k  \ell , k] \vert  \vert {\mathcal {R}}[k  \ell + 1, k] \vert = t_k[j]\). For \(i \in [1,m]\), denote by \(R'_i\) the string \(R_i[k] R_i[k  1] \cdots R_i[k  \ell ]\). Fix \(w \in {\mathcal {R}}[k  \ell + 1, k]\). Since \(R'_{a_k[1]} \le \cdots \le R'_{a_k[m]}\) lexicographically, there are numbers h and \(h'\) such that \(R_{a_k[i]}[k  \ell + 1, k] = w\) iff \(i \in [h,h']\). Further, we have \(R_{a_k[h]}[k  \ell ] \le R_{a_k[h + 1]}[k  \ell ] \le \cdots \le R_{a_k[h']}[k  \ell ]\). Thus, by definition of \(d_k\), for \(i \in [h + 1, h']\), we have \(R_{a_k[i1]}[k  \ell ] \ne R_{a_k[i]}[k  \ell ]\) iff \(d_k[i] = k  \ell + 1 = s_k[j]\). Note that \(d_k[h] > s_k[j]\). Therefore, the number of strings \(R_i[k  \ell , k]\) from \({\mathcal {R}}[k  \ell , k]\) having suffix w is equal to one plus the number of integers \(s_k[j]\) in the range \(d_k[h, h']\), which implies \(\vert {\mathcal {R}}[k  \ell , k] \vert  \vert {\mathcal {R}}[k  \ell + 1, k] \vert = t_k[j]\). \(\square\)
It remains to describe how we maintain \(a_k, e_k, s_k, t_k, u_k\).
Lemma 6
Algorithm 3 computes the arrays \(a_k, e_k, s_k, t_k, u_k\) from the numbers \(M(k  L)\) and \(M(k  1)\), and from the arrays \(a_{k1}, e_{k1}, s_{k1}, t_{k1}, u_{k1}\) in O(m) time, assuming the input alphabet is \([0,\Sigma {}1]\) with \(\Sigma  = O(m)\).
Proof
Let us analyze Algorithm 3 that computes \(a_k, e_k, s_k, t_k, u_k\). By definition, \(d_{k1}[i] = s_{k1}[e_{k1}[i]]\) for \(i \in [1,m]\). The first line of the algorithm initializes \(s_k\) so that \(d_{k1}[i] = s_k[e_{k1}[i]]\), for \(i \in [1,m]\), and \(s_k[s_k] = k + 1\). Since after this initialization \(s_k\), obviously, is in the sorted order, one has, for \(i, j \in [1,m]\), \(e_{k1}[i] \le e_{k1}[j]\) iff \(d_{k1}[i] \le d_{k1}[j]\) and, therefore, for \(\ell \in [i,j]\), one has \(d_{k1}[\ell ] = \max \{d_{k1}[\ell '] :i \le \ell ' \le j\}\) iff \(e_{k1}[\ell ] = \max \{e_{k1}[\ell '] :i \le \ell ' \le j\}\). Based on this observation, we fill \(e_k\) in lines 3–12 so that \(d_k[i] = s_k[e_k[i]]\), for \(i \in [1,m]\), using exactly the same approach as in Algorithm 1, where \(d_k\) is computed, but instead of the assignment \(d_k[C[b]] \leftarrow k + 1\), we have \(e_k[C[b]] \leftarrow s_k\) since \(s_k[s_k] = k + 1\). Here we also compute \(a_k\) in the same way as in Algorithm 1.
The loop in line 13 fills \(t_k\) so that, for \(i \in [1,s_k]\), \(t_k[i]\) is the number of occurrences of the integer i in \(e_k\) (\(t_k\) was zero initialized in line 3). Since, for \(i \in [1,m]\), we have \(d_k[i] = s_k[e_k[i]]\) at this point, \(t_k[i]\) is also the number of occurrences of the integer \(s_k[i]\) in \(d_k[1,m]\).
By definition, \(s_k\) must contain only elements from \(d_k\), but this is not necessarily the case in line 14. In order to fix \(s_k\) and \(t_k\), we simply have to remove all elements \(s_k[i]\) for which \(t_k[i] = 0\), moving all remaining elements of \(s_k\) and nonzero elements of \(t_k\) to the left accordingly. Suppose that, for some h and i, we have \(e_k[h] = i\) and the number \(s_k[i]\) is moved to \(s_k[j]\), for some \(j < i\), as we fix \(s_k\). Then, \(e_k[h]\) must become j. We utilize an additional temporary array \(tmp[1,s_k]\) to fix \(e_k\). The loop in lines 16–23 fixes \(s_k\) and \(t_k\) in an obvious way; once \(s_k[i]\) is moved to \(s_k[j]\) during this process, we assign \(tmp[i] = j\). Then, \(s_k\), \(t_k\), \(u_k\) (\(u_k\) is discussed below) are resized in line 24, and the loop in line 25 fixes \(e_k\) using tmp.
Recall that \([s_k[j  1]  1, s_k[j]  2]\), for \(j \in [1,s_k]\), is a system of disjoint segments covering \([0, k  1]\) (assuming \(s_k[0] = 1\)). It is now easy to see that this system is obtained from the system \([s_{k1}[j  1]  1, s_{k1}[j]  2]\), with \(j \in [1, s_{k1}]\) (assuming \(s_{k1}[0] = 1\)), by adding the new segment \([k  1, k  1]\) and joining some segments together. The second line of the algorithm copies \(u_{k1}\) into \(u_k\) and adds \(M'(k  1)\) to the end of \(u_k\), so that, for \(j \in [1, u_{k1}]\), \(u_k[j]\) is equal to the minimum of \(M'(\ell )\) for all \(\ell\) from the segment \([s_{k1}[j  1]  1, s_{k1}[j]  2]\) and \(u_k[u_{k1}{+}1] = M'(k  1)\) is the minimum in the segment \([k  1, k  1]\). (This is not completely correct since \(M'\) has changed as k was increased; namely, \(M'(k  L)\) was equal to \(+\infty\) but now is equal to \(M(k  L)\)). As we join segments removing some elements from \(s_k\) in the loop 16–23, the array \(u_k\) must be fixed accordingly: if \([s_k[j  1]  1, s_k[j]  2]\) is obtained by joining \([s_{k1}[h  1]  1, s_{k1}[h]  2]\), for \(j' \le h \le j''\), then \(u_k[j] = \min \{u_{k1}[h] :j' \le h \le j''\}\). We perform such fixes in line 17, accumulating the latter minimum. We start accumulating a new minimum in line 20, assigning \(u_k[j + 1] \leftarrow u_{k1}[i + 1]\). If at this point the ready minimum accumulated in \(u_k[j]\) corresponds to a segment containing the position \(k  L\), we have to fix \(u_k\) taking into account the new value \(M'(k  L) = M(k  L)\); we do this in line 21. To avoid accessing out of range elements in \(u_k\) and \(u_{k1}\) in line 20, we add a “dummy” element in, respectively, \(u_k\) and \(u_{k1}\) in line 15. \(\square\)
Besides all the arrays of length m, Algorithm 3 also requires access to \(M(k  L)\) and, possibly, to \(M(k  1)\). During the computation of M(k) for \(k \in [1,n]\), we maintain the last L calculated numbers \(M(k  1), M(k  2), \ldots , M(k  L)\) in a circular array, so that the overall required space is \(O(m + L)\); when k is incremented, the array is modified in O(1) time in an obvious way. Thus, Lemma 6 implies Theorem 1
If, as in our case, one does not need \(s_k, t_k, u_k\) for all k, the arrays \(s_k\), \(t_k\), \(u_k\) can be modified inplace, i.e., \(s_k\), \(t_k\), \(u_k\) can be considered as aliases for \(s_{k1}\), \(t_{k1}\), \(u_{k1}\), and yet the algorithm remains correct. Thus, we really need only 7 arrays in total: \(a_k\), \(a_{k1}\), \(e_k\), \(e_{k1}\), s, t, u, where s, t, u serve as \(s_k\), \(t_k\), \(u_k\) and the array tmp can be organized in place of \(a_{k1}\) or \(e_{k1}\). It is easy to maintain along with each value \(u_k[j]\) a corresponding position \(\ell\) such that \(u_k[j] = M'(\ell )\); these positions can be used then to restore the found segmentation of \({\mathcal {R}}\) using backtracking (see the beginning of the section). To compute \(e_k\), instead of using an RMQ data structure, one can adapt in an obvious way Algorithm 2 rewriting the arrays \(a_{k1}\) and \(e_{k1}\) during the computation, which is faster in practice but theoretically takes \(O(m\log \Sigma )\) time by Lemma 3. We do not discuss further details as they are straightforward.
From segmentation to founder set
Now we are given a segmentation \({\mathcal {S}}\) of \({\mathcal {R}}\) and we wish to produce a founder set \({\mathcal {F}}\) that obeys the segment boundaries. Recall that such founder set corresponds to a parse \({\mathcal {P}}\) of \({\mathcal {R}}\) with respect to segmentation \({\mathcal {S}}\). We conjecture that finding an optimal parse/founder set that minimizes the number of crossovers at segment boundaries is an NPhard problem, but unfortunately we have not been able to prove this claim. Therefore, we continue by proposing three natural strategies of which two latter have interesting theoretical properties. The first of the strategies is a naive baseline, second is a greedy strategy, and third one is based on maximum weight perfect matching in a bipartite graph analogous to one by Ukkonen [18]. This latter strategy provides an optimal solution for a special case, and greedy gives a 2approximation for the same special case. We will present all the three strategies first for the special case and then describe how to turn the general case to this special case (however loosing all optimality guarantees while doing so). We compare the naive baseline with the perfect matching in our experiments.
Assume (for our special case) that each segment in \({\mathcal {S}}\) induces exactly M(n) distinct substrings in \({\mathcal {R}}\). Then the naive baseline strategy to produce a founder set is to concatenate the distinct substrings of segment 1 with the distinct substrings of segment 2 in random order, and continue this process form left to right until M(n) founder sequences of length n are produced. For the latter two strategies, the idea is that instead of a random permutation, we aim to find a permutation that gives a concatenation order that minimizes the number of crossovers at each segment boundary. For this purpose, it is sufficient to consider two consecutive segments [a, b] and \([b+1,c]\) as two partitions of the rows of \({\mathcal {R}}\). Namely, consider a distinct substring X of a segment [a, b] and an induced set \(A \subseteq \{1,2,\ldots m\}\) such that \(R_i[a,b]=X\) for all \(i\in A\). Analogously, consider a distinct substring Y of a segment \([b+1,c]\) and an induced set \(B \subseteq \{1,2,\ldots m\}\) such that \(R_i[b+1,c]=Y\) for all \(i\in B\). If the concatenation XY forms the content F[a, c] of some founder F, then this concatenation causes \(m\vert A\cap B \vert\) crossovers. Hence, to minimize crossovers, one seeks to maximize the intersection between two partitions, studied next.
Problem of maximum intersection between two partitions. Let a be an integer. Given two partitions \(E_1\) and \(E_2\) of \(\{1,\ldots ,a\}\) with \(\vert E_1 \vert = \vert E_2 \vert\), the problem of Maximum Intersection Between two Partitions (MIBP) is to find the bijection f from \(E_1\) to \(E_2\) which maximizes \(\sum _{x \in E_1} \vert x \cap f(x) \vert\).
By using the bipartite graph defined between the elements of \(E_1\) and the elements of \(E_2\) and such that for \(x \in E_1\) and \(y \in E_2\), the weight of this edge is \(w(x,y) = \vert x \cap y \vert\), a maximum weight perfect matching of this graph gives an optimal solution of MIBP, and hence this problem can be solved in polynomial time.
We can define the greedy algorithm related to MIBP as the the greedy algorithm related to the problem of maximum weight perfect matching in the previous bipartite graph. As the greedy algorithm for maximum weight perfect matching is \(\frac{1}{2}\)approximation [24], we have the same ratio of approximation for the greedy algorithm for MIBP.
Lemma 7
Let \(E_1\) and \(E_2\) be two partitions of \(\{1,\ldots ,a\}\) with \(\vert E_1 \vert = \vert E_2 \vert\). We can compute the greedy algorithm for MIBP of \(E_1\) and \(E_2\) in O(a) time.
Proof
Let E be a partition of \(\{1,\ldots ,a\}\) and \(\prec\) be a total order on E, we denote by \(G_E\) the array of elements of E of size a such that for all i, \(G_E[i] = e_i\) where \(i \in e_i \in E\). Let be \(x \in E_1\) and \(y \in E_2\). We have \(w(x,y) = \vert x \cap y \vert = \vert \{i \in \{1,\ldots ,a\} \  \ i \in x \cap y\} \vert = \vert \{i \in \{1,\ldots ,a\} \  \ G_{E_1}[i] = x \text { and } G_{E_2}[i] = y \} \vert\). It follows that the number of edges of no zero weight is at most a. By using Radix sort, we can compute in O(a) the sorted array of elements of \(\{1,\ldots ,a\}\) following the order where \(i<j\) iff \(G_{E_1}[i] \prec G_{E_1}[j]\) or \(G_{E_1}[i] = G_{E_1}[j]\) and \(G_{E_2}[i] \prec G_{E_2}[j]\). With this array, as for all \(x \in E_1\) and \(y \in E_2\) \(w(x,y) \le a\), we can compute (by further Radix sort and renaming steps) in O(a) the ordered list \([(x_1,y_1),\ldots ,(x_q,y_q)]\) such that \(w(x_1,y_1) \ge \cdots \ge w(x_q,y_q) > 0\) with \(q \le a\). By taking the elements in the order of this list, we can compute in O(a) two arrays f and \(f^{1}\) of size \(\vert E_1 \vert\) such that \(\{(i,f[i]) \  \ i \in E_1 \}\) and \(\{(f^{1}[i],i) \  \ i \in E_2 \}\) represent the same solution of the greedy algorithm for MIBP. \(\square\)
Optimal founder set for the special case. Now we can solve independently the MIBP problem for each pair of consecutive segments, resulting to the following theorems, where the first one follows directly also from earlier constructions [18], and the latter from Lemma 7.
Theorem 8
([18]) Given a segmentation \({\mathcal {S}}\) of \({\mathcal {R}}\) such that each segment induces exactly K distinct substrings in \({\mathcal {R}}\), then we can construct an optimal parse \({\mathcal {P}}\) of \({\mathcal {R}}\) (and hence the corresponding set of founders) in polynomial time.
Theorem 9
Given a segmentation \({\mathcal {S}}\) of \({\mathcal {R}}\) such that each segment induces exactly K distinct substrings in \({\mathcal {R}}\), then we can construct a greedy parse \({\mathcal {P}}\) of \({\mathcal {R}}\) (and hence the corresponding set of founders) that has at most twice as many crossovers than the optimal parse in \(O(\vert {\mathcal {S}} \vert \times m )\) time and \(O(\vert {\mathcal {S}} \vert \times m )\) space.
Results
We implemented the segmentation algorithm using Algorithm 2 to build the pBWT arrays and computed the minimum number of founders with the given value of L using the recursion in Eq. 3. This part of the implementation corresponds to Lemma 3, and thus the overall time complexity of the implemented approach is \(O(mn \log \Sigma )\). After computing the minimum number of founders, we use backtracking to determine the optimal segmentation. Since we use the pBWT arrays to determine the distinct substrings in each segment, as part of the first phase of building the arrays we also store samples and now update them to the segment boundary positions in parallel. We proceed to join adjacent segments from left to right until the number of distinct substrings in one segment would exceed the minimum number of founders, and finally we concatenate the substrings to generate founder sequences. The implementation outputs for each segment the distinct founder sequence fragments, and associates to each fragment the set of haplotypes containing that fragment as a substring at that location (these are easily deduced given the segmentation and the positional BWT structures). Our implementation uses integer vectors from the SDSL library [25].
As our goal is to produce reference sequences for aligning short reads, we wanted to find a good value of L to generate a segmentation suitable for this purpose. In particular, we wanted to have the length of most segments clearly above a typical read length, such that most reads could be aligned without hitting a recombination site.
We used the chromosome 6 variants from the phase 3 data of the 1000 Genomes Project [2] as the starting point. We converted the variant data to a multiple sequence alignment with vcf2multialign,^{1} which resulted in 5009 haplotype sequences of equal length (including the reference sequence) of approximately 171 million characters. In order to reduce the running time of our tool, we discarded columns of identical characters as they would not affect the number of recombination sites. This reduced each sequence to approximately 5.38 million characters.
Summarized results with 5009 input sequences
L  Number of founders  Average segment length  Median number of recombinations  Average distance between recombinations 

10  130  2395  15,794  9624 
12  246  4910  11,716  14,025 
14  331  6467  9759  17,126 
16  462  9312  7801  21,860 
18  766  14,383  5593  30,571 
20  1057  20,151  4411  39,090 
40  1513  30,228  3228  54,386 
80  3093  67,994  1176  146,655 
The results look promising, as using 130 founders instead of 5009 haplotypes as the input to our pangenome indexing approach [12] will result into significant saving of resources; this solves the space bottleneck, and the preprocessing of founder reconstruction also saves time in the heavy indexing steps.
Our intention was to compare our tool to an implementation of Ukkonen’s algorithm [19]. However, initial testing with four input sequences showed that the latter implementation is not practical with a data set of this size.
Conclusions
As our experiments indicate that one can reduce 5009 haplotypes down to 130 founders with the average distance of two crossovers being 9624 bases, one can expect short read alignment and variant calling to become practical on such pangenomic setting. We are investigating this on our tool PanVC [12], where one can simply replace its input multiple alignment with the one made of the founder sequences. With graphbased approaches, slightly more effort is required: Input variations are encoded with respect to the reference, so one first needs to convert variants into a multiple alignment, apply the founder reconstruction algorithm, and finally convert the multiple alignment of founder sequences into a directed acyclic graph. PanVC toolbox provides the required conversions. Alternatively, one can construct the pangenome graph using other methods, and map the founder sequences afterwards to the paths of the graph: If original haplotype sequences are already spelled as paths, each founder sequence is a concatenation of existing subpaths, and can hence be mapped to a continuous path without alignment (possibly requiring adding a few missing edges).
Finally, it will be interesting to see how much the contiguity of the founder sequences can still be improved with different formulations of the segmentation problem. We are investigating a variant with the number of founder sequenced fixed.
Footnotes
Notes
Acknowledgements
We wish to thank the anonymous reviewers for useful suggestions that helped us to improve the presentation.
Authors' contributions
TN and VM initiated the study and observed that pBWT can be exploited to improve the earlier solution. BC and DK developed the details of the linear time algorithm and wrote major parts of the manuscript. TN implemented the algorithm and conducted the experiments. All authors read and approved the final manuscript.
Funding
This work was partially supported by the Academy of Finland (Grant 309048).
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
References
 1.Computational PanGenomics Consortium. Computational pangenomics: status, promises and challenges. Brief Bioinform. 2018;19(1):118–35.Google Scholar
 2.The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526(7571):68–74.CrossRefGoogle Scholar
 3.Exome Aggregation Consortium. Analysis of proteincoding genetic variation in 60,706 humans. Nature. 2016;536(7616):285–91.CrossRefGoogle Scholar
 4.The UK10K Consortium. The UK10K project identifies rare variants in health and disease. Nature. 2015;526(7571):82–90.CrossRefGoogle Scholar
 5.Schneeberger K, Hagmann J, Ossowski S, Warthmann N, Gesing S, Kohlbacher O, Weigel D. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 2009;10:98.CrossRefGoogle Scholar
 6.Huang L, Popic V, Batzoglou S. Short read alignment with populations of genomes. Bioinformatics. 2013;29(13):361–70.CrossRefGoogle Scholar
 7.Sirén J, Välimäki N, Mäkinen V. Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans Comput Biol Bioinform. 2014;11(2):375–88.CrossRefGoogle Scholar
 8.Dilthey A, Cox C, Iqbal Z, Nelson MR, McVean G. Improved genome inference in the MHC using a population reference graph. Nat Genet. 2015;47:682–8.CrossRefGoogle Scholar
 9.Maciuca S, del Ojo Elias C, McVean G, Iqbal Z. A natural encoding of genetic variation in a Burrows–Wheeler transform to enable mapping and genome inference. In: Proceedings of the 16th international workshop on algorithms in boinformatics, WABI 2016, Aarhus, Denmark, August 22–24, 2016. Lecture Notes in Computer Science, vol. 9838; 2016. p. 222–33.Google Scholar
 10.Erik Garrison, Jouni Sirén, Novak Adam M, Hickey Glenn, Eizenga Jordan M, Dawson Eric T, Jones William, Garg Shilpa, Markello Charles, Lin Michael F, Paten Benedict, Durbin Richard. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol. 2018;36:875. https://doi.org/10.1038/nbt.4227.CrossRefGoogle Scholar
 11.Sirén J, Garrison E, Novak AM, Paten B, Durbin R. Haplotypeaware graph indexes. In: 18th international workshop on algorithms in bioinformatics, WABI 2018, August 20–22, 2018, Helsinki, Finland. LIPIcs, vol. 113. Schloss Dagstuhl  LeibnizZentrum fuer Informatik, Wadern, Germany; 2018. p. 4–1413.Google Scholar
 12.Valenzuela D, Norri T, Niko V, Pitkänen E, Mäkinen V. Towards pangenome read alignment to improve variation calling. BMC Genom. 2018;19(Suppl 2):87.CrossRefGoogle Scholar
 13.Mäkinen V, Navarro G, Sirén J, Välimäki N. Storage and retrieval of highly repetitive sequence collections. J Comput Biol. 2010;17(3):281–308.CrossRefGoogle Scholar
 14.Navarro G. Indexing highly repetitive collections. In: Proceedings of 23rd international workshop on combinatorial algorithms (IWOCA). LNCS 7643; 2012. p. 274–9.Google Scholar
 15.Wandelt S, Starlinger J, Bux M, Leser U. Rcsi: scalable similarity search in thousand(s) of genomes. PVLDB. 2013;6(13):1534–45.Google Scholar
 16.Ferrada H, Gagie T, Hirvola T, Puglisi SJ. Hybrid indexes for repetitive datasets. Philos Trans R Soc A. 2014;372(2016):20130137.CrossRefGoogle Scholar
 17.Gagie T, Puglisi SJ. Searching and indexing genomic databases via kernelization. Front Bioeng Biotechnol. 2015;3:12.CrossRefGoogle Scholar
 18.Ukkonen E. Finding founder sequences from a set of recombinants. In: Proceedings of second international workshop on algorithms in bioinformatics, WABI 2002, Rome, Italy, September 17–21, 2002; 2002. p. 277–86.Google Scholar
 19.Rastas P, Ukkonen E. Haplotype inference via hierarchical genotype parsing. In: Proceedings of the 7th international workshop on algorithms in bioinformatics, WABI 2007, Philadelphia, PA, USA, September 8–9, 2007; 2007. p. 85–97.Google Scholar
 20.Durbin R. Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics. 2014;30(9):1266–72.CrossRefGoogle Scholar
 21.Mäkinen V, Norri T. Applying the positional Burrows–Wheeler transform to allpairs hamming distance. Inf Process Lett. 2019;146:17–9. https://doi.org/10.1016/j.ipl.2019.02.003.CrossRefGoogle Scholar
 22.Norri T, Cazaux B, Kosolobov D, Mäkinen V. Minimum segmentation for pangenomic founder reconstruction in linear time. In: 18th international workshop on algorithms in bioinformatics, WABI 2018, August 20–22, 2018, Helsinki, Finland. LIPIcs, vol. 113. Schloss Dagstuhl  LeibnizZentrum fuer Informatik, Wadern, Germany; 2018. p. 15–11515.Google Scholar
 23.Fischer J, Heun V. Spaceefficient preprocessing schemes for range minimum queries on static arrays. SIAM J Comput. 2011;40(2):465–92. https://doi.org/10.1137/090779759.CrossRefGoogle Scholar
 24.Karp RM, Vazirani UV, Vazirani VV. An optimal algorithm for online bipartite matching. In: Proceedings of the twentysecond annual ACM symposium on Theory of computing, STOC. ACM; 1990. p. 352–8.Google Scholar
 25.Gog S, Beller T, Moffat A, Petri M. From theory to practice: plug and play with succinct data structures. In: 13th international symposium on experimental algorithms, (SEA 2014); 2014. p. 326–37.Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.