Abstract
Background
Genomewide optical maps are ordered highresolution restriction maps that give the position of occurrence of restriction cut sites corresponding to one or more restriction enzymes. These genomewide optical maps are assembled using an overlaplayoutconsensus approach using raw optical map data, which are referred to as Rmaps. Due to the high errorrate of Rmap data, finding the overlap between Rmaps remains challenging.
Results
We present Kohdista, which is an indexbased algorithm for finding pairwise alignments between single molecule maps (Rmaps). The novelty of our approach is the formulation of the alignment problem as automaton path matching, and the application of modern indexbased data structures. In particular, we combine the use of the Generalized Compressed Suffix Array (GCSA) index with the wavelet tree in order to build Kohdista. We validate Kohdista on simulated E. coli data, showing the approach successfully finds alignments between Rmaps simulated from overlapping genomic regions.
Conclusion
we demonstrate Kohdista is the only method that is capable of finding a significant number of high quality pairwise Rmap alignments for large eukaryote organisms in reasonable time.
Similar content being viewed by others
Background
There is a current resurgence in generating diverse types of data, to be used alone or in concert with short read data, in order to overcome the limitations of short read data. Data from an optical mapping system [1] is one such example and has itself become more practical with falling costs of highthroughput methods. For example, the current BioNano Genomics Irys System requires one week and $1000 USD to produce the Rmap data for an average size eukaryote genome, whereas, it required $100,000 and 6 months in 2009 [2]. These technological advances and the demonstrated utility of optical mapping in genome assembly [3,4,5,6,7] have driven several recent tool development efforts [8,9,10].
Genomewide optical maps are ordered highresolution restriction maps that give the position of occurrence of restriction cut sites corresponding to one or more restriction enzymes. These genomewide optical maps are assembled using an overlaplayoutconsensus approach using raw optical map data, which are referred to as Rmaps. Hence, Rmaps are akin to reads in genome sequencing. In addition, to the the inaccuracies in the fragment sizes, there is the possibility of cut sites being spuriously added or deleted; which makes the problem of finding pairwise alignments between Rmaps challenging. To date, however, there is no efficient, nonproprietary method for finding pairwise alignments between Rmaps, which is the first step in assembling genomewide maps.
Several existing methods are superficially applicable to Rmap pairwise alignments but all programs either struggle to scale to even moderate size genomes or require significant further adaptation to the problem. Several methods exhaustively evaluate all pairs of Rmaps using dynamic programming. One of these is the method of Valouev et al. [11], which is capable of solving the problem exactly but requires over 100,000 CPU hours to compute the alignments for rice [12]. The others are SOMA [13] and MalignerDP [10] which are designed only for semiglobal alignments instead of overlap alignments, which are required for assembly.
Other methods reduce the number of map pairs to be individually considered by initially finding seed matches and then extending them through more intensive work. These include OMBlast [9], OPTIMA [8], and MalignerIX [10]. These, along with MalignerDP, were designed for a related alignment problem of aligning consensus data but cannot consistently find high quality Rmap pairwise alignments in reasonable time as we show later. This is unsurprising since these methods were designed for either already assembled optical maps or in silico digested sequence data for one of their inputs, both having a lower error rate than Rmap data. In addition, Muggli et al. [14] presented a method called Twin, which aligns assembled contigs to a genomewide optimal map. Twin varies from these previous methods in that it is unable to robustly find alignments between pairs of Rmaps due to the presence of added or missing cutsites.
In this paper, we present a fast, errortolerant method for performing pairwise Rmap alignment that makes use of a novel FMindex based data structure. Although the FMindex can naturally be applied to short read alignment [15, 16], it is nontrivial to apply it to Rmap alignment. The difficulty arises from: (1) the abundance of missing or false cut sites, (2) the fragment sizes require inexact fragmentfragment matches (e.g. 1547 bp and 1503 bp represent the same fragment), (3) the Rmap sequence alphabet consists of all unique fragment sizes and is so extremely large (e.g., over 16,000 symbols for the goat genome). The second two challenges render inefficient the standard FMindex backward search algorithm, which excels at exact matching over small alphabets since each step of the algorithm extends the search for a query string by a single character c. If the alphabet is small (say DNA alphabet) then a search for other symbols of the alphabet other than c can be incorporated without much cost to the algorithm’s efficiency. Yet, if the alphabet is large enough this exhaustive search becomes impractical. The wavelet tree helps to remedy this problem. It allows efficiently answering queries of the form: find all symbols that allow extension of the backward search by a single character, where the symbol is within the range \([\alpha _1 \ldots \alpha _k]\) and where \(\alpha _1\) and \(\alpha _k\) are symbols in the alphabet such that \(\alpha _1\le \alpha _k\) [17]. In the case of optical mapping data, the alphabet is all fragment sizes. Thus, Muggli et al. [14] showed that constructing the FMindex and wavelet tree from this input can allow for sizing error to be account for by replacing each query in the FM index backward search algorithm with a range query supported by the wavelet tree, i.e., if the fragment size in the query string is x then the wavelet tree can support queries of the form: find all fragment sizes that allow extension of the backward search by a single fragment, where the fragment size in the range \([x  y, x + y]\) occur, where y is a threshold on the error tolerance.
Muggli et al. [14] demonstrated that the addition of the wavelet tree can remedy the first two problems, i.e., sizing error and alphabet size, but the first and mostnotable challenge requires a more complex indexbased data structure. The addition of the wavelet tree to the FMindex is not enough to allow for searches that are robust to inserted and deleted cut sites. To overcome the challenge of having added or deleted cut sites while still accommodating the other two challenges, we develop Kohdista, an indexbased Rmap alignment program that is capable of finding all pairwise alignments in large eukaryote organisms.
We first abstract the problem to that of approximatepath matching in a directed acyclic graph (DAG). The Kohdista method then indexes a set of Rmaps represented as a DAG, using a modified form of the generalized compressed suffix array (GCSA), which is a variant of the FMindex developed by Sirén et al. [18]. Hence, the constructed DAG, which is stored using the GCSA, stores all Rmaps, along with all variations obtained by considering all speculative added and deleted cut sites. The GCSA stores the DAG in a manner such that paths in DAG may be queried efficiently. If we contrast this to naïve automaton implementations, the GCSA has two advantages: it is space efficient, and it allows for efficient queries. Lastly, we demonstrate that challenges posed by the inexact fragment sizes and alphabet size can be overcome, specifically in the context of the GCSA, via careful use of a wavelet tree [17], and via using statistical criteria to control the quality of the discovered alignments.
Next, we point out some practical considerations concerning Kohdista. First, we note that Kohdista can be easily parallelized since once the GCSA is constructed from the Rmap data, it can be queried in parallel on as many threads as there are Rmaps to be queried. Next, in this paper, we focus on finding all pairwise alignments that satisfy some statistical constraints—whether they be global or local alignments. Partial alignments can be easily obtained by considering the prefix or suffix of the query Rmap and relaxing the statistical constraint.
We verify our approach on simulated E. coli Rmap data by showing that Kohdista achieves similar sensitivity and specificity to the method of Valouev et al. [12], and with more permissive alignment acceptance criteria 90% of Rmap pairs simulated from overlapping genomic regions. We also show the utility of our approach on larger eukaryote genomes by demonstrating that existing published methods require more than 151 h of CPU time to find all pairwise alignments in the plum Rmap data; whereas, Kohdista requires 31 h. Thus, we present the first fullyindexed method capable of finding all match patterns in the pairwise Rmap alignment problem.
Preliminaries and definitions
Throughout we consider a string (or sequence) \(S = S[1 \ldots n] = S[1]S[2] \ldots S[n]\) of \(S = n\) symbols drawn from the alphabet \([1 \ldots \sigma ]\). For \(i=1, \ldots ,n\) we write S[i…n] to denote the suffix of S of length \(ni+1\), that is \(S[i \ldots n] = S[i]S[i+1] \ldots S[n]\), and S[1…i] to denote the prefix of S of length i. S[i…j] is the substring \(S[i]S[i+1] \ldots S[j]\) of S that starts at position i and ends at j. Given a sequence S[1, n] over an alphabet \(\Sigma = \{1, \ldots ,\sigma \}\), a character \(c \in \Sigma\), and integers i,j, \({\textsf {rank}}_c(S,i)\) is the number of times that c appears in S[1, i], and \({\textsf {select}}_c(S,j)\) is the position of the jth occurrence of c in S. We remove S from the functions when it is implicit from the context.
Overview of optical mapping
From a computer science viewpoint, restriction mapping (by optical or other means) can be seen as a process that takes in two sequences: a genome \({\mathsf {A}}[1,n]\) and a restriction enzyme’s restriction sequence \({\mathsf {B}}[1,b]\), and produces an array (sequence) of integers \({\textsf {C}}\), the genome restriction map, which we define as follows. First define the array of integers \({\textsf {C}}[1,m]\) where \({\textsf {C}}[i] = j\) if and only if \({\mathsf {A}}[j \ldots j+b] = {\mathsf {B}}\) is the ith occurrence of \({\mathsf {B}}\) in \({\mathsf {A}}\). Then \({\textsf {R}}[i] = ({\textsf {C}}[i]{\textsf {C}}[i1])\), with \({\textsf {R}}[1] = {\textsf {C}}[1]1\). In words, \({\textsf {R}}\) contains the distance between occurrences of \({\mathsf {B}}\) in \({\mathsf {A}}\). For example, if we let \({\mathsf {B}}\) be act and \({\mathsf {A}}= {\texttt {atacttactggactactaaact}}\) then we would have \({\textsf {C}}= 3,7,12,15,20\) and \({\textsf {R}}= 2,4,5,3,5\). In reality, \({\textsf {R}}\) is a consensus sequence formed from millions of erroneous Rmap sequences. The optical mapping system produces millions of Rmaps for a single genome. It is performed on many cells of an organism and for each cell there are thousands of Rmaps (each at least 250 Kbp in length in publicly available data). The Rmaps are then assembled to produce a genomewide optical map. Like the final \({\textsf {R}}\) sequence, each Rmap is an array of lengths—or fragment sizes—between occurrences of \({\mathsf {B}}\) in \({\mathsf {A}}\).
There are three types of errors that an Rmap (and hence with lower magnitude and frequency, also the consensus map) can contain: (1) missing and false cuts, which are caused by an enzyme not cleaving at a specific site, or by random breaks in the DNA molecule, respectively; (2) missing fragments that are caused by desorption, where small (\(< 1\) Kbp ) fragments are lost and so not detected by the imaging system; and (3) inaccuracy in the fragment size due to varying fluorescent dye adhesion to the DNA and other limitations of the imaging process. Continuing again with the example above where \({\textsf {R}}= 2,4,5,3,5\) is the errorfree Rmap: an example of an Rmap with the first type of error could be \({\textsf {R}}' = 6,5,3,5\) (the first cut site is missing so the fragment sizes 2, and 4 are summed to become 6 in \({\textsf {R}}'\)); an example of an Rmap with the second type of error would be \({\textsf {R}}'' = 2,4,3,5\) (the third fragment is missing); and lastly, the third type of error could be illustrated by \({\textsf {R}}''' = 2,4,7,3,5\) (the size of the third fragment is inaccurately given).
Frequency of errors
In the optical mapping system, there is a 20% probability that a cut site is missed and a 0.15% probability of a false break per Kbp, i.e., error type (1) occurs in a fragment. Popular restriction enzymes in optical mapping experiments recognize a 6 bp sequence giving an expected cutting density of 1 per 4096 bp. At this cutting density, false breaks are less common than missing restriction sites (approx. \(0.25 * .2 = .05\) for missing sites vs. 0.0015 for false sites per bp). The error in the fragment size is normally distributed with a mean of 0 bp, and a variance of \(\ell \sigma ^2\), where \(\ell\) is equal to the fragment length and \(\sigma = .58\) kbp [11].
Suffix arrays, BWT and backward search
The suffix array [19] \({\textsf {SA}}_{{\mathsf {X}}}\) (we drop subscripts when they are clear from the context) of a sequence \({\mathsf {X}}\) is an array \({\textsf {SA}}[1 \ldots n]\) which contains a permutation of the integers [1...n] such that \({\mathsf {X}}[{\textsf {SA}}[1] \ldots n]< {\mathsf {X}}[{\textsf {SA}}[2] \ldots n]< \cdots < {\mathsf {X}}[{\textsf {SA}}[n] \ldots n].\) In other words, \({\textsf {SA}}[j] = i\) iff \({\mathsf {X}}[i \ldots n]\) is the \(j{\text{ th }}\) suffix of \({\mathsf {X}}\) in lexicographic order. For a sequence \({\mathsf {Y}}\), the \({\mathsf {Y}}\)interval in the suffix array \({\textsf {SA}}_{{\mathsf {X}}}\) is the interval \({\textsf {SA}}[s \ldots e]\) that contains all suffixes having \({\mathsf {Y}}\) as a prefix. The \({\mathsf {Y}}\)interval is a representation of the occurrences of \({\mathsf {Y}}\) in \({\mathsf {X}}\). For a character c and a sequence \({\mathsf {Y}}\), the computation of \(c{\mathsf {Y}}\)interval from \({\mathsf {Y}}\)interval is called a left extension.
The Burrows–Wheeler Transform \({\textsf {BWT}}[1 \ldots n]\) is a permutation of \({\mathsf {X}}\) such that \({\textsf {BWT}}[i] = {\mathsf {X}}[{\textsf {SA}}[i]1]\) if \({\textsf {SA}}[i]>1\) and $ otherwise [20]. We also define \({\textsf {LF}}[i] = j\) iff \({\textsf {SA}}[j] = {\textsf {SA}}[i]1\), except when \({\textsf {SA}}[i] = 1\), in which case \({\textsf {LF}}[i] = I\), where \({\textsf {SA}}[I] = n\). Ferragina and Manzini [21] linked \({\textsf {BWT}}\) and \({\textsf {SA}}\) in the following way. Let \({\textsf {C}}[c]\), for symbol c, be the number of symbols in \({\mathsf {X}}\) lexicographically smaller than c. The function \({\textsf {rank}}({\mathsf {X}},c,i)\), for sequence \({\mathsf {X}}\), symbol c, and integer i, returns the number of occurrences of c in \({\mathsf {X}}[1 \ldots i]\). It is well known that \({\textsf {LF}}[i] = {\textsf {C}}[{\textsf {BWT}}[i]] + {\textsf {rank}}({\textsf {BWT}},{\textsf {BWT}}[i],i)\). Furthermore, we can compute the left extension using \({\textsf {C}}\) and \({\textsf {rank}}\). If \({\textsf {SA}}[s \ldots e]\) is the \({\mathsf {Y}}\)interval, then \({\textsf {SA}}[{\textsf {C}}[c]+{\textsf {rank}}({\textsf {BWT}},c,s),{\textsf {C}}[c]+{\textsf {rank}}({\textsf {BWT}},c,e)]\) is the \(c{\mathsf {Y}}\)interval. This is called backward search, and a data structure supporting it is called an FMindex [21].
To support rank queries in backward search, a data structure called a wavelet tree can be used [17]. It occupies \(n\log \sigma + o(n\log \sigma )\) bits of space and supports \({\textsf {rank}}\) queries in \(O(\log \sigma )\) time. Wavelet trees also support a variety of more complex queries on the underlying string efficiently. We refer the reader to Gagie et al. [17] for a more thorough discussion of wavelet trees. One such query we will use in this paper is to return the set X of distinct symbols occurring in S[i, j], which takes \(O(X\log \sigma )\) time.
The pairwise Rmap alignment problem
The pairwise Rmap alignment problem aims to align one Rmap (the query) \({\textsf {R}}_q\) against the set of all other Rmaps in the dataset (the target). We denote the target database as \({\textsf {R}}_1 \ldots {\textsf {R}}_n\), where each \({\textsf {R}}_i\) is a sequence of \(m_i\) fragment sizes, i.e, \({\textsf {R}}_i = [f_{i1}, \ldots , f_{im_i}]\). An alignment between two Rmaps is a relation between them comprising groups of zero or more consecutive fragment sizes in one Rmap associated with groups of zero or more consecutive fragments in the other. For example, given \({\textsf {R}}_i = [4, 5, 10, 9, 3]\) and \({\textsf {R}}_j = [10, 9, 11]\) one possible alignment is \(\{[4,5], [10]\}, \{ [10], [9]\}, \{[9], [11]\}, \{[3], []\}\). A group may contain more than one fragment (e.g. [4, 5]) when the restriction site delimiting the fragments is absent in the corresponding group of the other Rmap (e.g [10]). This can occur if there is a false restriction site in one Rmap, or there is a missing restriction site in the other. Since we cannot tell from only two Rmaps which of these scenarios occurred, for the purpose of our remaining discussion it will be sufficient to consider only the scenario of missed (undigested) restriction sites.
Implementation
We now describe the algorithm behind Kohdista. Three main insights enable our indexbased aligner for Rmap data: (1) abstraction of the alignment problem to a finite automaton; (2) use of the GCSA for storing and querying the automaton; and (3) modification of backward search to use a wavelet tree in specific ways to account for the Rmap error profile.
Finite automaton
Continuing with the example in the background section, we want to align \({\textsf {R}}' = 6,5,3,5\) to \({\textsf {R}}''' = 2,4,7,3,5\) and vice versa. To accomplish this we cast the Rmap alignment problem to that of matching paths in a finite automaton. A finite automaton is a directed, labeled graph that defines a language, or a specific set of sequences composed of vertex labels. A sequence is recognized by an automaton if it contains a matching path: a consecutive sequence of vertex labels equal to the sequence. We represent the target Rmaps as an automaton and the query as a path in this context.
Backbone
The automaton for our target Rmaps can be constructed as follows. First, we concatenate the \({\textsf {R}}_1 \ldots {\textsf {R}}_n\) together into a single sequence with each Rmap separated by a special symbol which will not match any query symbol. Let \({\textsf {R}}^*\) denote this concatenated sequence. Hence, \({\textsf {R}}^* = [f_{11}, \ldots ,f_{1m_1}, \ldots , f_{n1}, \ldots ,f_{nm_n}]\). Then, construct an initial finite automaton \({\mathsf {A}}= (V, E)\) for \({\textsf {R}}^*\) by creating a set of vertices \(v^i_1 \ldots v^i_m\), one vertex per fragment for a total of \({\textsf {R}}^*\) vertices and each vertex is labeled with the length its corresponding fragment. Edges are then added connecting vertices representing consecutive pairs of elements in \({\textsf {R}}^*\). Also, introduce to \({\mathsf {A}}\) a starting vertex \(v_1\) labeled with # and a final vertex \(v_f\) labeled with the character $. All other vertices in \({\mathsf {A}}\) are labeled with integral values. This initial set of vertices and edges is called the backbone. The backbone by itself is only sufficient for finding alignments with no missing cut sites in the query. The backbone of an automaton constructed for a set containing \({\textsf {R}}'\) and \({\textsf {R}}''\) would be #, 6, 5, 3, 5, 999, 2, 4, 3, 5$, using 999 as an unmatchable value. Next, extra vertices (“skip vertices”) and extra edges are added to \({\mathsf {A}}\) to allow for the automaton to accept all valid queries. Figure 1a illustrates the construction of \({\mathsf {A}}\) for a single Rmap with fragment sizes 2, 3, 4, 5, 6.
Skip vertices and skip edges
We introduce extra vertices labeled with compound fragments to allow missing cut sites (first type of error) to be taken into account in querying the target Rmaps. We refer to these as skip vertices as they provide alternative path segments which skip past two or more backbone vertices. Thus, we add a skip vertex to \({\mathsf {A}}\) for every pair of consecutive vertices in the backbone, as well as every triple of consecutive vertices in the backbone, and label these vertices as the sum of the corresponding vertices. For example, vertex labeled with 7 connecting 2 and 5 in 1a is an example of a skip vertex. Likewise, 5, 9, 11 are other skip vertices. Skip vertices corresponding to a pair of vertices in the backbone would correspond to a single missing cutsite and similarly, skip vertices corresponding to a trip of vertices in the backbone correspond to two consecutive missing cutsites. The probability of more than two consecutive missing cutsites is negligible [11], and thus, we do not consider more than pairs or triples of vertices. Finally, we add skip edges which provide paths around vertices with small labels in the backbone. These allow allow for desorption (the second type of error) to be taken into account in querying \({\textsf {R}}^*\).
Generalized compressed suffix array
We index the automaton with the GCSA for efficient storage and path querying. The GCSA is a generalization of the FMindex for automata. We will explain the GCSA by drawing on the definition of the more widelyknown FMindex. As stated in the background section, the FMindex is based on the deep relationship between the \({\textsf {SA}}\) and the \({\textsf {BWT}}\) data structures of the input string \({\mathsf {X}}\). The \({\textsf {BWT}}\) of an input string is formed by sorting all characters of the string by the lexicographic order of the suffix immediately following each character. The main properties the FMindex exploits in order to perform queries efficiently are (a) \({\textsf {BWT}}[i] = {\mathsf {X}}[{\textsf {SA}}[i]1]\); and (b) given that \({\textsf {SA}}[i] = j\), and \({\textsf {C}}[c]\) gives the position of the first suffix in \({\textsf {SA}}\) prefixed with character c, then using small auxiliary data structures we can quickly determine \(k = {\textsf {C}}[{\textsf {BWT}}[i]] + {\textsf {rank}}({\textsf {BWT}},{\textsf {BWT}}[i],i)\), such that \({\textsf {SA}}[k] = j1\). The first of these properties is simply the definition of the \({\textsf {BWT}}\). The second is, because the symbols of \({\mathsf {X}}\) occur in the same order in both the single character prefixes in the suffix array and in the \({\textsf {BWT}}\), given a set of sorted suffixes, prepending the same character onto each suffix does not change their order. Thus, if we consider all the suffixes in a range of \({\textsf {SA}}\) which are preceded by the same symbol c, that subset will appear in the same relative order in \({\textsf {SA}}\): as a contiguous subinterval of the interval that contains all the suffixes beginning with c. Hence, by knowing where the position of the internal in \({\textsf {SA}}\) corresponding to a symbol, and the \({\textsf {rank}}\) of an instance of that symbol, we can identify the \({\textsf {SA}}\) position beginning with that instance from its position in \({\textsf {BWT}}\). A rank data structure over the \({\textsf {BWT}}\) constitutes a sufficient compressed index of the suffix array needed for traversal.
To generalize the FMindex to automata from strings, we need to efficiently store the vertices and edges in a manner such that the FMindex properties still hold, allowing the GCSA to support queries efficiently. An FMindex’s compressed suffix array for a string S encodes a relationship between each suffix S and its left extension. Hence, this suffix array can be generalized to edges in a graph that represent a relationship between vertices. The compressed suffix array for a string is a special case where the vertices are labeled with the string’s symbols in a nonbranching path.
Prefixsorted automata
Just as backward search for strings is linked to suffix sorting, backward searching in the BWT of the automaton requires us to be able to sort the vertices (and a set of the paths) of the automaton in a particular way. This property is called prefixsorted by Sirén et al. [18]. Let \(A = (V,E)\) be a finite automaton, let \(v_{V}\) denote its terminal vertex, and let \(v \in V\) be a vertex. We say v is prefixsorted by prefix p(v) if the labels of all paths from v to \(v_{V}\) share a common prefix p(v), and no path from any other vertex \(u \ne v\) to \(v_{V}\) has p(v) as a prefix of its label. Automaton A is prefixsorted if all vertices are prefixsorted. See Fig. 1a for an example of a nonprefix sorted automaton and a prefix sorted automaton. A nonprefix sorted automaton can be made prefix sorted through a process of duplicating vertices and their incoming edges but dividing their outgoing edges between the new instances. We refer the reader to Sirén et al. [18]) for a more thorough explanation of how to transform a nonprefix sorted automaton to a prefixsorted one.
Clearly, the prefixes p(v) allow us to sort the vertices of a prefixsorted automaton into lexicographical order. Moreover, if we consider the list of outgoing edges (u, v), sorted by pairs (p(u), p(v)), they are also sorted by the sequences \(\ell (u)p(v)\), where \(\ell (u)\) denotes the label of vertex u. This dual sortedness property allows backward searching to work over the list of vertex labels (sorted by p(v)) in the same way that is does for the symbols of a string ordered by their following suffixes in normal backward search for strings.
Each vertex has a set of one or more preceding vertices and therefore, a set of predecessor labels in the automaton. These predecessor label sets are concatenated to form the \({\textsf {BWT}}\). The sets are concatenated in the order defined by the above mentioned lexicographic ordering of the vertices. Each element in \({\textsf {BWT}}\) then denotes an edge in the automaton. Another bit vector, \({\textsf {F}}\), marks a ‘1’ for the first element of \({\textsf {BWT}}\) corresponding to a vertex and a ‘0’ for all subsequent elements in that set. Thus, the predecessor labels, and hence the associated edges, for a vertex with rank r are \({\textsf {BWT}}[{\textsf {select}}_1({\textsf {F}},r) \ldots {\textsf {select}}_1({\textsf {F}},r+1)]\). Another array, \({\textsf {M}}\), stores the outdegree of each vertex and allows the set of vertex ranks associated with a \({\textsf {BWT}}\) interval to be found using \({\textsf {rank}}()\) queries.
Exact matching: GCSA backward search
Exact matching with the GCSA is similar to the standard FMindex backward search algorithm. As outlined in the background section, FMindex backward search proceeds by finding a succession of lexicographic ranges that match progressively longer suffixes of the query string, starting from the rightmost symbol of the query. The search maintains two items—a lexicographic range and an index into the query string—and the property that the path prefix associated with the lexicographic range is equal to the suffix of the query marked by the query index. Initially, the query index is at the rightmost symbol and the range is [1…n] since every path prefix matches the empty suffix. The search continues using GCSA’s backward search step function, which takes as parameters the next symbol (to the left) in the query (i.e. fragment size in \({\textsf {R}}_q\)) and the current range, and returns a new range. The query index is advanced leftward after each backward search step. In theory, since the current range corresponds to a consecutive range in the \({\textsf {BWT}}\), the backward search could use \({\textsf {select}}()\) queries on the bit vector \({\textsf {F}}\) (see above) to determine all the edges adjacent to a given vertex and then two FMindex \({\textsf {LF}}()\) queries are applied to the limits of the current range to obtain the new one. GCSA’s implementation uses one succinct bit vector per alphabet symbol to encode which symbols precede a given vertex instead of \({\textsf {F}}\). Finally, this new range, which corresponds to a set of edges, is mapped back to a set of vertices using \({\textsf {rank}}()\) on the \({\textsf {M}}\) bit vector.
Inexact matching: modified GCSA backward search
We modified GCSA backward search in the following ways. First, we modified the search process to combine consecutive fragments into compound fragments in the query Rmap in order to account for erroneous cutsites. Secondly, we added and used a wavelet tree in order to allow efficient retrieval of substitution candidates to account for sizing error. Lastly, we introduced backtracking to allow aligning Rmaps in the presence of multiple alternative size substitutions candidates as well as alternative compound fragments for each point in the query. We now discuss these modifications in further detail below.
To accommodate possible false restriction sites that are present in the query Rmap, we generate compound fragments by summing pairs and triples of consecutive query fragment sizes. This summing of multiple consecutive query fragments is complementary to the skip vertices in the target automaton which accommodate false restriction sites in the target. We note for each query Rmap there will be multiple combinations of compound fragments generated.
Next, in order to accommodate possible sizing error in the Rmap data, we modified the backward search by adding and using a wavelet tree in our query of the GCSA. The original implementation of the GCSA does not construct or use the wavelet tree. Although it does consider alignments containing mismatches, it is limited to small alphabets (e.g., DNA alphabet), which do not necessitate the use of the wavelet tree. Here, the alphabet size is all possible fragment sizes. Thus, we construct the wavelet tree in addition to the GCSA. Then when aligning fragment f in the query Rmap, we determine the set of candidate fragment sizes that are within some error tolerance of f by enumerating the distinct symbols in the currently active backward search range of the \({\textsf {BWT}}\) using the wavelet tree algorithm of Gagie et al. [17]. As previously mentioned, this use of the wavelet tree also exists in the Twin [14] but is constructed and used in conjunction with an FMindex. We used the SDSLLite library by Gog et al. [22] to construct and store the GCSA.
Finally, since there may be multiple alternative size compatible candidates in the \({\textsf {BWT}}\) interval of \({\textsf {R}}^*\) for a compound fragment and multiple alternative compound fragments generated at a given position in query Rmap, we add backtracking to backward search so each candidate alignment is evaluated. We note that this is akin to the use of backtracking algorithms in short read alignment [15, 16]. Thus, for a given compound fragment size f generated from \({\textsf {R}}_q\), every possible candidate fragment size, \(f'\), that can be found in \({\textsf {R}}^*\) in the range \(f  t \ldots f + t\) and in the interval \(s \ldots e\) (of the \({\textsf {BWT}}\) of \({\textsf {R}}^*\)) for some tolerance t is considered as a possible substitute in the backward search.
Thus, to recap, when attempting to align each query Rmap, we consider every possible combination of compound fragments and use the wavelet tree to determine possible candidate matches during the backward search. There are potentially a large number of possible candidate alignments—for efficiency, these candidates are pruned by evaluating the alignment during each step of the search relative to statistical models of the expected error in the data. We discuss this pruning in the next subsection.
Pruning the search
Alignments are found by incrementally extending candidate partial alignments (paths in the automaton) to longer partial alignments by choosing one of several compatible extension matches (adjacent vertices to the end of a path in the automaton). To perform this search efficiently, we prune the search by computing the Chisquared CDF and binomial CDF statistics of the partial matches and use thresholds to ensure reasonable size agreement of the matched compound fragments, and the frequency of putative missing cut sites. We conclude this section by giving an example of the backward search.
Size agreement
We use the Chisquared CDF statistic to assess size agreement. This assumes the fragment size errors are independent, normally distributed events. For each pair of matched compound fragments in a partial alignment, we take the mean between the two as the assumed true length and compute the expected standard deviation using this mean. Each compound fragment deviates from the assumed true value by half the distance between them. These two values contribute two degrees of freedom to the Chisquared calculation. Thus, each deviation is normalized by dividing by the expected standard deviation, these are squared, and summed across all compound fragments to generate the Chisquared statistic. We use the standard Chisquared CDF function to compute the area under the curve of the probability mass function up to this Chisquared statistic, which gives the probability two Rmap segments from common genomic origin would have a Chisquared statistic no more extreme than observed. This probability is compared to Kohdista’s chisquaredcdfthresh and if smaller, the candidate compound fragment is assumed to be a reasonable match and the search continues.
Cut site error frequency
We use the Binomial CDF statistic to assess the probability of the number of cut site errors in a partial alignment. This assumes missing cut site errors are independent, Bernoulli processes events. We account for all the putatively conserved cut sites on the boundaries and those delimiting compound fragments in both partially aligned Rmaps plus twice the number of missed sites as the number of Bernoulli trials. We use the standard binomial CDF function to compute the sum of the probability density function up to the number of nonconserved cut sites in a candidate match. Like the size agreement calculation above, this gives the probability two Rmaps of common genomic origin would have the number of nonconserved sites seen or fewer in the candidate partial alignment under consideration. This is compared to the binomcdfthresh to decide whether to consider extensions to the given candidate partial alignment. Thus, given a set of Rmaps and input parameters binomcdfthresh and chisquaredcdfthresh, we produce the set of all Rmap alignments that have a Chisquared CDF statistic less than chisquaredcdfthresh and a binomial CDF statistic less than binomcdfthresh. Both of these are subject to the additional constraint of a maximum consecutive missed restriction site run between aligned sites of two and a minimum aligned site set cardinality of 16.
Example traversal
A partial search for a query Rmap [3 kb, 7 kb, 6 kb] in Fig. 1a and Table (b) given an error model with a constant 1 kb sizing error would proceed with steps:
 1.
Start with the semiopen interval matching the empty string [0…12).
 2.
A wavelet tree query on \({\textsf {BWT}}\) would indicate the set of symbols {5, 6, 7} is the intersection of two sets: (a) the set of symbols that would all be valid left extensions of the (currently empty) match string and (b) the set of size appropriate symbols that match our next query symbol (i.e. 6 kb, working from the right end of our query) in light of the expected sizing error (i.e. 6kb +/− 1 kb).
 3.
We would then do a GCSA backward search step on the first value in the set (5) which would yield the new interval [4…7). This new interval denotes only nodes where each node’s common prefix is compatible with the spelling of our current backward traversal path through the automaton (i.e. our short path of just [5] does not contradict any path spellable from any of the three nodes denoted in the match interval).
 4.
A wavelet tree query on the \({\textsf {BWT}}\) for this interval for values 7 kb +/− 1 kb would return the set of symbols {7}.
 5.
Another backward search step would yield the new interval [8…9). At this point our traversal path would be [7, 5] (denoted as a left extension of a forward path that we are building by traversing the graph backward). The common prefix of each node (only one node here) in our match interval (i.e. [7 kb]) is compatible with the path [7, 5]. This process would continue until backward search returns no match interval or our scoring model indicates our repeatedly left extended path has grown too divergent from our query. At this point backtracking would occur to find other matches (e.g. at some point we would backward search using the value 6 kb instead of the 5 kb obtained in step 2.)
Practical considerations
In this section we describe some of the practical considerations that were made in the implementation.
Memoization
One side effect of summing consecutive fragments in both the search algorithm and the target data structure is that several successive search steps with agreeing fragment sizes will also have agreeing sums of those successive fragments. In this scenario, proceeding deeper in the search space will result in wasted effort. To reduce this risk, we maintain a table of scores obtained when reaching a particular lexicographic range and query cursor pair. We only proceed with the search past this point when either the point has never been reached before, or has only been reached before with inferior scores.
Wavelet tree threshold
The wavelet tree allows efficiently finding the set of vertex labels that are predecessors of the vertices in the current match interval intersected with the set of vertex labels that would be compatible with the next compound fragment to be matched in the query. However, when the match interval is sufficiently small (\(< 750\)) it is faster to scan the vertices in \({\textsf {BWT}}\) directly.
Quantization
The alphabet of fragment sizes can be large considering all the measured fragments from multiple copies of the genome. This can cause an extremely large branching factor for the initial symbol and first few extensions in the search. To improve the efficiency of the search, the fragment sizes are initially quantized, thus reducing the size of the effective alphabet and the number of substitution candidates under consideration at each point in the search. Quantization also increases the number of identical path segments across the indexed graph which allows a greater amount of candidate matches to be evaluated in parallel because they all fall into the same \({\textsf {BWT}}\) interval during the search. This does, however, introduce some quantization error into the fragment sizes, but the bin size is chosen to keep this small in comparison to the sizing error.
Results
We evaluated Kohdista against the other available optical map alignment software. Our experiments measured runtime, peak memory, and alignment quality on simulated E. coli Rmaps and experimentally generated plum Rmaps. All experiments were performed on Intel Xeon computers with \(\ge\) 16 GB RAM running 64bit Linux.
Parameters used We tried OPTIMA with both “pvalue” and “score” scoring and the allMaps option and report the higher sensitivity “score” setting. We followed the OPTIMAOverlap protocol of splitting Rmaps into kmers, each containing 12 fragments as suggested in [8]. For OMBlast, we adjusted parameters maxclusteritem, match, fpp, fnp, meas, minclusterscore, and minconf. For MalignerDP, we adjusted parameters maxmisses, misspenalty, sdrate, minsd, and maxmissrate and additionally filtered the results by alignment score. Though unpublished, for comparison we also include the proprietary RefAligner software from BioNano. For RefAligner we adjusted parameters FP, FN, sd, sf, A, and S. For Kohdista, we adjusted parameters chisquaredcdfthresh and binomcdfthresh. For the method of Valouev et al. [12], we adjusted score_thresh and t_score_thresh variables in the source. In Table 1 we report statistical and computational performance for each method.
OMBlast was configured with parameters meas = 3000, minconf = 0.09, minmatch = 15 and the rest left at defaults. RefAligner was run with parameters FP = 0.15, sd = 0.6, sf = 0.2, sr = 0.0, se = 0.0, A = 15, S = 22 and the rest left at defaults. MalignerDP was configured with parameters refmaxmisses = 2, querymisspenalty = 3, querymaxmissrate = 0.5, minsd = 1500, and the rest left at defaults.
The method of Valouev et al. [12] was run with default parameters except we reduced the maximum compound fragment length (their \(\delta\) parameter) from 6 fragments to 3. We observed this method rarely included alignments containing more than two missed restriction sites in a compound fragment.
Performance on simulated E. coli Rmap data
To verify the correctness of our method, we simulated a read set from a 4.6 Mbp E. coli reference genome as follows: we started with 1,400 copies of the genome, and then generated 40 random loci within each. These loci form the ends of molecules that would undergo digestion. Molecules smaller than 250 Kbp were discarded, leaving 272 Rmaps with a combined length equating to 35x coverage depth. The cleavage sites for the XhoI enzyme were then identified within each of these simulated molecules. We removed 20% of these at random from each simulated molecule to model partial digestion. Finally, normally distributed noise was added to each fragment with a standard deviation of .58 kb per 1 kb of the fragment. This simulation resulted in 272 Rmaps. Simulated molecule pairs having 16 common conserved digestion sites become the set of “ground truth” alignments, which our method and other methods should successfully identify. Our simulation resulted in 4,305 ground truth alignments matching this criteria. Although a molecule would align to itself, these are not included in the ground truth set. This method of simulation was based on the E. coli statistics given by Valouev et al. [12] and resulting in a molecule length distribution as observed in publicly available Rmap data from OpGen, Inc.
Most methods were designed for less noisy data but in theory could address all the data error types required. For methods with tunable parameters, we tried aligning the E. coli Rmaps with combinations of parameters for each method related to its alignment score thresholds and error model parameters. We used parameterization giving results similar to those for the default parameters of the method of Valouev et al. [12] to the extent such parameters did not significantly increase the running time of each method. These same parameterization were used in the next section on plum data.
Even with tuning, we were unable to obtain pairwise alignments on E. coli for two methods. We found OPTIMA only produced self alignments with its recommended overlap protocol and report its resource use in Table 1. For MalignerIX, even when we relaxed the parameters to account for the greater sizing error and mismatch cut site frequency, it was also only able to find self alignments. This is expected as by design it only allows missing sites in one sequence in order to run faster. Thus no further testing was performed with MalignerIX or OPTIMA. We did not test SOMA [13] as earlier investigation indicate it would not scale to larger genomes [14]. We omitted TWIN [14] as it needs all cut sites to match. With tuning, Kohdista, MAligner, the method of Valouev et al. [12], RefAligner and OMBlast produced reasonable results on the E.coli data. Results for the best combinations of parameters encountered during tuning can be seen in Figs. 2 and 3. We observed that most methods could find more ground truth alignments as their parameters were relaxed at the expense of additional false positives, as illustrated in these figures. However, only the method of Valouev et al. and Kohdista approached recall of all ground truth alignments.
Table 1 illustrates the results for Kohdista and the competing methods with parameters optimized to try to match those of Valouev et al. [12], as well as results using Kohdista with a more permissive parameter setting. All results include both indexing as well as search time. Kohdista took two seconds to build its data structures. Again, Kohdista uses Chisquared and binomial CDF thresholds to prune the backtracking search when deciding whether to extend alignments to progressively longer alignments. More permissive match criteria, using higher thresholds, allows more Rmaps to be reached in the search and thus to be considered aligned, but it also results in less aggressive pruning in the search, thus lengthening runtime. As an example, note that when Kohdista was configured with a much relaxed Chisquared CDF threshold of .5 and a binomial CDF threshold of .7, it found 3925 of the 4305 (91%) ground truth alignments, but slowed down considerably. This illustrates the index and algorithm’s capability in handling all error types and achieving high recall.
Performance on plum Rmap data
The Beijing Forestry University and other institutes assembled the first plum (Prunus mume) genome using short reads and optical mapping data from OpGen Inc. We test the various available alignment methods on the 139,281 plum Rmaps from June 2011 available in the GigaScience repository. These Rmaps were created with the BamHI enzyme and have a coverage depth of 135x of the 280 Mbp genome. For the plum dataset, we ran all the methods which approach the statistical performance of the method of Valouev et al. [12] when measured on E. coli. Thus, we omitted MalignerIX and OPTIMA because they had 0% recall and precision on E. coli. Our results on this plum dataset are summarized in Table 2.
Kohdista was the fastest and obtained more alignments than the competing methods. When configured with the Chisquared CDF threshold of .02 and binomial CDF threshold of .5, it took 31 h of CPU time to test all Rmaps for pairwise alignments in the plum Rmap data. This represents a 21× speedup over the 678 h taken by the dynamic programming method of Valouev et al. [12]. MalignerDP and OMBlast took 214 h and 151 h, respectively. Hence, Kohdista has a 6.9× and 4.8× speedup over MalignerDP and OMBlast. All methods used less than 13 GB of RAM and thus, were considered practical from a memory perspective. Kohdista took 11 min to build its data structures for Plum.
To measure the quality of the alignments, we scored each pairwise alignment using Valouev et al. [12] and presented histograms of these alignment scores in Fig. 4. For comparison, we also scored and present the histogram for random pairs of Rmaps. The method of Valouev et al. [12] produces very few but highscoring alignments and although it could theoretically be altered to produce a larger number of alignments, the running time makes this prospect impractical (678 h). Although Kohdista and RefAligner produce highquality alignments, RefAligner produced very few alignments (10,039) and required almost 5x more time to do so. OMBlast and Maligner required significantly more time and produced significantly lower quality alignments.
Conclusions
In this paper, we demonstrate how finding pairwise alignments in Rmap data can be modelled as approximatepath matching in a directed acyclic graph, and combining the GCSA with the wavelet tree results in an indexbased data structure for solving this problem. We implement this method and present results comparing Kohdista with competing methods. By demonstrating results on both simulated E. coli Rmap data and real plum Rmaps, we show that Kohdista is capable of detecting high scoring alignments in efficient time. In particular, Kohdista detected the largest number of alignments in 31 h. RefAligner, a proprietary method, produced very few high scoring alignments (10,039) and requires almost 5× more time to do so. OMBlast and Maligner required significantly more time and produced significantly lower quality alignments. The method of Valouev et al. [12] produced high scoring alignments but required more than 21× time to do.
Availability and requirements
Project name: Kohdista.
Project home page: https://github.com/mmuggli/KOHDISTA.
Operating system(s): Developed for 32bit and 64bit Linux/Unix environments.
Programming language: C/C++.
Other requirements: GCC 4.2 or newer.
License: MIT license.
Any restrictions to use by nonacademics: Nonneeded.
Availability of data and materials
Kohdista is available at https://github.com/mmuggli/KOHDISTA/. No original data was acquired for this research. The simulated E.coli data generated and analysed during the current study are available at https://github.com/mmuggli/KOHDISTA. The plum (Prunus mume) dataset used in this research was acquired from the GigaScience repository http://gigadb.org/dataset/view/id/100084/File_sort/size.
Abbreviations
 DAG:

directed acyclic graph (DAG)
 SA:

suffix array
 GCSA:

generalized compressed suffix array
 BWT:

Burrows–Wheeler Transform
References
Dimalanta ET, Lim A, Runnheim R, Lamers C, Churas C, Forrest DK, de Pablo JJ, Graham MD, Coppersmith SN, Goldstein S, Schwartz DC. A microfluidic system for large DNA molecule arrays. Anal Chem. 2004;76(18):5293–301.
Bionano Genomics Ilc. Bionano Genomics Launches Irys, a novel platform for complex human genome analysis; 2012. https://bionanogenomics.com/pressreleases/bionanogenomicslaunchesirysanovelplatformforcomplexhumangenomeanalysis/.
Reslewic S, et al. Wholegenome shotgun optical mapping of Rhodospirillum rubrum. Appl Environ Microbiol. 2005;71(9):5511–22.
Zhou S, et al. A wholegenome shotgun optical map of Yersinia pestis strain KIM. Appl Environ Microbiol. 2002;68(12):6321–31.
Zhou S, et al. Shotgun optical mapping of the entire Leishmania major Friedlin genome. Mol Biochem Parasitol. 2004;138(1):97–106.
Chamala S, et al. Assembly and validation of the genome of the nonmodel basal angiosperm amborella. Science. 2013;342(6165):1516–7.
Dong Y, et al. Sequencing and automated wholegenome optical mapping of the genome of a domestic goat ( capra hircus). Nat Biotechnol. 2013;31(2):136–41.
Verzotto D, et al. Optima: sensitive and accurate wholegenome alignment of errorprone genomic maps by combinatorial indexing and technologyagnostic statistical analysis. GigaScience. 2016;5(1):2.
Leung AK, Kwok TP, Wan R, Xiao M, Kwok PY, Yip KY, Chan TF. OMBlast: alignment tool for optical mapping using a seedandextend approach. Bioinformatics. 2017;33(3):311–9.
Mendelowitz LM, Schwartz DC, Pop M. Maligner: a fast ordered restriction map aligner. Bioinformatics. 2016;32(7):1016–22.
Valouev A, o Li L, Liu YC, Schwartz DC, Yang Y, Zhang Y, Waterman MS. Alignment of optical maps. J Comput Biol. 2006;13(2):442–62.
Valouev A, et al. An algorithm for assembly of ordered restriction maps from single DNA molecules. Proc Natl Acad Sci. 2006;103(43):15770–5.
Nagarajan N, Read TD, Pop M. Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics. 2008;24(10):1229–35.
Muggli MD, Puglisi SJ, Boucher C. Efficient indexed alignment of contigs to optical maps. In: Proceedings of the 14th workshop on algorithms in bioinformatics (WABI). Berlin: Springer; 2014. p. 68–81.
Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25(14):1754–60.
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):25.
Gagie T, Navarro G, Puglisi SJ. New algorithms on wavelet trees and applications to information retrieval. Theor Comput Sci. 2012;426/427:25–41.
Sirén J, Välimäki N, Mäkinen V. Indexing graphs for path queries with applications in genome research. IEEE/ACM Trans Comput Biol Bioinformatics. 2014;11(2):375–88.
Manber U, Myers GW. Suffix arrays: a new method for online string searches. SIAM J Sci Comput. 1993;22(5):935–48.
Burrows M, Wheeler DJ. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California. 1994.
Ferragina P, Manzini G. Indexing compressed text. J ACM. 2005;52(4):552–81.
Gog S, Beller T, Moffat A, Petri M. From theory to practice: plug and play with succinct data structures. In: Proceedings of the 13th international symposium on experimental algorithms, (SEA). 2014. p. 326–37.
Acknowledgements
The authors would like to thank Jouni Sirén for many insightful conversations concerning the GCSA.
Funding
MDM, SJP, and CB were funded by the National Science Foundation (1618814). SJP was also supported in part by the Academy of Finland via Grant Number 294143.
Author information
Authors and Affiliations
Contributions
SJP, MDM, and CB conceived of the concept and designed the algorithm and data structures for the methods described in this paper. MDM and CB designed the experiments. MDM implemented the method, and performed all experiments and software testing. MDM and CB drafted the manuscript. All authors read and edited the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A preliminary version appeared in the proceedings of WABI 2018
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Muggli, M.D., Puglisi, S.J. & Boucher, C. Kohdista: an efficient method to index and query possible Rmap alignments. Algorithms Mol Biol 14, 25 (2019). https://doi.org/10.1186/s1301501901609
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1301501901609