Advertisement

Orientation of Ordered Scaffolds

  • Sergey AganezovEmail author
  • Max A. Alekseyev
Conference paper
  • 721 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10562)

Abstract

Despite the recent progress in genome sequencing and assembly, many of the currently available assembled genomes come in a draft form. Such draft genomes consist of a large number of genomic fragments (scaffolds), whose order and/or orientation (i.e., strand) in the genome are unknown. There exist various scaffold assembly methods, which attempt to determine the order and orientation of scaffolds along the genome chromosomes. Some of these methods (e.g., based on FISH physical mapping, chromatin conformation capture, etc.) can infer the order of scaffolds, but not necessarily their orientation. This leads to a special case of the scaffold orientation problem (i.e., deducing the orientation of each scaffold) with a known order of the scaffolds.

We address the problem of orientation of ordered scaffolds as an optimization problem based on given weighted orientations of scaffolds and their pairs (e.g., coming from pair-end sequencing reads, long reads, or homologous relations). We formalize this problem within the earlier introduced framework for comparative analysis and merging of scaffold assemblies (CAMSA). We prove that this problem is \(\mathsf {NP}\)-hard, and further present a polynomial-time algorithm for solving its special case, where orientation of each scaffold is imposed relatively to at most two other scaffolds. This lays the foundation for a follow-up FPT algorithm for the general case. The proposed algorithms are implemented in the CAMSA software version 2.

Keywords

Genome assembly Genome scaffolding Scaffold orientation Computational complexity Algorithms 

1 Introduction

While genome sequencing technologies are constantly evolving, they are still unable to read at once complete genomic sequences from organisms of interest. Instead, they produce a large number of rather short genomic fragments, called reads, originating from unknown locations and strands of the genome. The problem then becomes to assemble the reads into the complete genome. Existing genome assemblers usually assemble reads based on their overlap patterns and produce longer genomic fragments, called contigs, which are typically interweaved with highly polymorphic and/or repetitive regions in the genome. Contigs are further assembled into scaffolds, i.e., sequences of contigs interspaced with gaps.1 Assembling scaffolds into larger scaffolds (ideally representing complete chromosomes) is called the scaffold assembly problem.

The scaffold assembly problem is known to be \(\mathsf {NP}\)-hard [13, 16, 22, 28, 34], but there still exists a number of methods that use heuristic and/or exact algorithmic approaches to address it. The scaffold assembly problem consists of two subproblems:
  1. 1.

    determine the order of scaffolds (scaffold order problem); and

     
  2. 2.

    determine the orientation (i.e., strand of origin) of scaffolds (scaffold orientation problem).

     

Some methods attempt to solve these subproblems jointly by using various types of additional data including jumping libraries [10, 14, 19, 20, 24, 26, 31], long error-prone reads [5, 6, 11, 25, 33], homology relationships between genomes [1, 3, 4, 23], etc. Other methods (typically based on wet-lab experiments [12, 21, 27, 29, 30, 32]) can often reliably reconstruct the order of scaffolds, but may fail to impose their orientation.

The scaffold orientation problem is also known to be \(\mathsf {NP}\)-hard [9, 22]. Since the scaffold order problem can often be reliably solved with wet-lab based methods, this inspires us to consider the special case of the scaffold orientation problem with the given order of scaffolds, which we refer to as the orientation of ordered scaffolds (OOS) problem. We formulate the OOS as an optimization problem based on given weighted orientations of scaffolds and their pairs (e.g., coming from pair-end sequencing reads, long reads, or homologous relations) and further address it within the previously introduced CAMSA framework [2] for comparative analysis and merging of scaffold assemblies. We prove that the OOS is \(\mathsf {NP}\)-hard both in the case of linear genomes and in the case of circular genomes. We also present a polynomial-time algorithm for solving the special case of the OOS, where the orientation of each scaffold is imposed relatively to at most two other scaffolds.

2 Background

We start with a brief description of the CAMSA framework, which provides a unifying way to represent scaffold assemblies obtained by different methods.

2.1 CAMSA Framework

We represent an assembly of scaffolds from a set \(\mathbb {S}= \{s_i\}_{i=1}^n\) as a set of assembly points. Each assembly point is formed by an adjacency between two scaffolds from \(\mathbb {S}\). Namely, an assembly point \(p = (s_i, s_j)\) tells that the scaffolds \(s_i\) and \(s_j\) are adjacent in the assembly. Additionally, we may know the orientation of either or both of the scaffolds and thus distinguish between three types of assembly points:
  1. 1.

    p is oriented if the orientation of both scaffolds \(s_i\) and \(s_j\) is known;

     
  2. 2.

    p is semi-oriented if the orientation of only one scaffold among \(s_i\) and \(s_j\) is known;

     
  3. 3.

    p is unoriented if the orientation of neither of \(s_i\) and \(s_j\) is known.

     

We denote the known orientation of scaffolds in assembly points by overhead arrows, where the right arrow corresponds to the original genomic sequence representing a scaffold, while the left arrow corresponds to the reverse complement of this sequence. For example, \((\overrightarrow{s_i}, \overleftarrow{s_j})\), \((\overrightarrow{s_i}, s_j)\), and \((s_i, s_j)\) are oriented, semi-oriented, and unoriented assembly points, respectively. We remark that assembly points \((\overrightarrow{s_i}, \overrightarrow{s_j})\) and \((\overleftarrow{s_j}, \overleftarrow{s_i})\) represent the same adjacency between oriented scaffolds; to make this representation unique we will require that in all assembly points \((s_i, s_j)\) we have \(i<j\). Another way to represent the orientation of the scaffolds in an assembly point is by using superscripts h and t denoting the head and tail extremities of the scaffold’s genomic sequence, e.g., \((\overrightarrow{s_i}, \overrightarrow{s_j})\) can also be written as \((s_i^h, s_j^t)\).

We will need an auxiliary function \({{\mathrm{{sn}}}}(p,i)\) defined on an assembly point p and an index \(i\in \{1, 2\}\) that returns the scaffold corresponding to the component i of p (e.g., \({{\mathrm{{sn}}}}((\overrightarrow{s_i}, \overrightarrow{s_j}), 2) = s_j\)).

We define a realization of an assembly point p as any oriented assembly point that can be obtained from p by orienting the unoriented scaffolds. We denote the set of realizations of p by \({{\mathrm{\text {R}}}}(p)\). If p is oriented, than it has a single realization equal p itself (i.e., \({{\mathrm{\text {R}}}}(p)=\{p\}\)); if p is semi-oriented, then it has two realizations (i.e., \(|{{\mathrm{\text {R}}}}(p)| = 2\)); and if p is unoriented, then it has four realizations (i.e., \(|{{\mathrm{\text {R}}}}(p)| = 4\)). For example,
$$\begin{aligned} {{\mathrm{\text {R}}}}((s_i, s_j)) = \left\{ (\overrightarrow{s_i}, \overrightarrow{s_j}), (\overrightarrow{s_i}, \overleftarrow{s_j}), (\overleftarrow{s_i}, \overrightarrow{s_j}), (\overleftarrow{s_i}, \overleftarrow{s_j})\right\} . \end{aligned}$$
(1)
An assembly point p is called a refinement of an assembly point q if \({{\mathrm{\text {R}}}}(p)\subset {{\mathrm{\text {R}}}}(q)\). From now on, we assume that no assembly point in a given assembly is a refinement of another assembly point (otherwise we simply discard the latter assembly point as less informative). We refer to an assembly with no assembly point refinements as a proper assembly.

For a given assembly \(\mathbb {A}\) we will use subscripts u/s/o to denote the sets of unoriented/semi-oriented/oriented assembly points in \(\mathbb {A}\) (e.g., \(\mathbb {A}_u\subset \mathbb {A}\) is the set of all unoriented assembly points from \(\mathbb {A}\)). We also denote by \(\mathbb {S}(\mathbb {A})\) the set of scaffolds appearing in the assembly points from \(\mathbb {A}\).

We call two assembly points overlapping if they involve the same scaffold, and further call them conflicting if they involve the same extremity of this scaffold. We generalize this notion for semi-oriented and unoriented assembly points: two assembly points p and q are conflicting if all pairs of their realizations \(\{p', q'\}\in {{\mathrm{\text {R}}}}(p)\times {{\mathrm{\text {R}}}}(p)\) are conflicting. If some, but not all, pairs of the realizations are conflicting, p and q are called semi-conflicting. Otherwise, p and q are called non-conflicting.

We extend the notion of non-/semi- conflictness to entire assemblies as follows. A scaffold assembly \(\mathbb {A}\) is non-conflicting if all pairs of assembly points in it are non-conflicting, and \(\mathbb {A}\) is semi-conflicting if all pairs of assembly points are non-conflicting or semi-conflicting with at least one pair being semi-conflicting.

2.2 Assembly Realizations

For an assembly \(\mathbb {A}= \{p_i\}_{i=1}^k\), an assembly \(\mathbb {A}' = \{q_i\}_{i=1}^k\) is called a realization 2 of \(\mathbb {A}\) if there exists a permutation \(\pi \) of order k such that \(q_{\pi _i}\in {{\mathrm{\text {R}}}}(p_i)\) for all \(i=1,2,\dots ,k\). We denote by \({{\mathrm{\text {R}}}}(\mathbb {A})\) the set of realizations of assembly \(\mathbb {A}\), and by \({{\mathrm{{NR}}}}(\mathbb {A})\) the set of non-conflicting realizations among them.

We define the scaffold assembly graph \(\mathsf {SAG} (\mathbb {A})\) on the set of vertices \(\{s^h, s^t:\ s\in \mathbb {S}(\mathbb {A})\}\) and edges of two types: directed edges \((s^t, s^h)\) that encode scaffolds from \(\mathbb {S}(\mathbb {A})\), and undirected edges that encode all possible realizations of all assembly points in \(\mathbb {A}\) (Fig. 1a). We further define the order (multi)graph \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) formed by the set of vertices \(\mathbb {S}(\mathbb {A})\) and the set of undirected edges \(\{\{{{\mathrm{{sn}}}}(p,1), {{\mathrm{{sn}}}}(p,2)\}:\ p\in \mathbb {A}\}\) (Fig. 1b). The order graph can also be obtained from \(\mathsf {SAG} (\mathbb {A})\) by first contracting the directed edges, and then by substituting all edges that encode realizations of the same assembly point with a single edge (Fig. 1b).
Fig. 1.

For an assembly \(A = \{(s_1, \overrightarrow{s_2}), (\overrightarrow{s_1}, \overrightarrow{s_2}), (\overrightarrow{s_2}, \overrightarrow{s_3}), (\overrightarrow{s_3}, s_4), (\overleftarrow{s_1}, \overleftarrow{s_4}), (\overrightarrow{s_5}, s_6)\), \((\overleftarrow{s_6}, \overrightarrow{s_7}), (\overrightarrow{s_6}, s_7)\}\), (a) the scaffold assembly graph \(\mathsf {SAG} (A)\), where semi-oriented assembly points, oriented assembly points, and scaffolds are represented by dashed red edges, solid red edges, and directed black edges, respectively. (b) The order graph \({{\mathrm{\mathsf {OG}}}}(A)\). (c) The contracted order graph \({{\mathrm{\mathsf {COG}}}}(A)\). (Color figure online)

Lemma 1

For a non-conflicting realization \(\mathbb {A}'\) of an assembly \(\mathbb {A}\), \({{\mathrm{\mathsf {OG}}}}(\mathbb {A}')\) is non-branching, i.e., \(\deg (v)\le 2\) for all vertices v of \({{\mathrm{\mathsf {OG}}}}(\mathbb {A}')\).3

Proof

Each vertex v in \({{\mathrm{\mathsf {OG}}}}(\mathbb {A}')\) represents a scaffold, which has two extremities and thus can participate in at most two non-conflicting assembly points in \(\mathbb {A}'\). Hence, \(\deg (v)\le 2\).    \(\square \)

We notice that any non-conflicting realization \(\mathbb {A}'\) of an assembly \(\mathbb {A}\) provides orientation for all scaffolds involved in each connected component of \(\mathsf {SAG} (\mathbb {A}')\) (as well as of \({{\mathrm{\mathsf {OG}}}}(\mathbb {A}')\)) relatively to each other.

Theorem 2

An assembly \(\mathbb {A}\) has at least one non-conflicting realization (i.e., \(|{{\mathrm{{NR}}}}(\mathbb {A})|\ge 1\)) if and only if \(\mathbb {A}\) is non-conflicting or semi-conflicting and \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) is non-branching.

Proof

Suppose that \(|{{\mathrm{{NR}}}}(\mathbb {A})|\ge 1\) and pick any \(\mathbb {A}'\in {{\mathrm{{NR}}}}(\mathbb {A})\). Then for every pair of assembly points \(p,q\in \mathbb {A}\), their realizations in \(\mathbb {A}'\) are non-conflicting, implying that p and q are either non-conflicting or semi-conflicting. Hence, \(\mathbb {A}\) is non-conflicting or semi-conflicting. Since \({{\mathrm{\mathsf {OG}}}}(A)={{\mathrm{\mathsf {OG}}}}(A')\) and \(A'\) is non-conflicting, Lemma 1 implies that \({{\mathrm{\mathsf {OG}}}}(A)\) is non-branching.

Vice versa, suppose that \(\mathbb {A}\) is non-conflicting or semi-conflicting and \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) is non-branching. To prove that \(|{{\mathrm{{NR}}}}(\mathbb {A})|\ge 1\), we will orient unoriented scaffolds in all assembly points in \(\mathbb {A}\) without creating conflicts. Every scaffold s corresponds to a vertex v in \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) of degree at most 2. If \(\deg (v)=1\), then s participates in one assembly point p, and s is either already oriented in p or we pick an arbitrary orientation for it. If \(\deg (v)=2\), then s participates in two overlapping assembly points p and q. If s is not oriented in either of p, q, we pick an arbitrary orientation for it consistently across p and q (i.e., keeping them non-conflicting). If s is oriented in exactly one assembly point, we orient the unoriented instance of s consistently with its orientation in the other assembly point. Since conflicts may appear only between assembly points that share a vertex in \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\), the constructed orientations produce no new conflicts. On other hand, the scaffolds that are already oriented in \(\mathbb {A}\) impose no conflicts since \(\mathbb {A}\) is non-conflicting or semi-conflicting. Hence, the resulting oriented assembly points form a non-conflicting assembly from \({{\mathrm{{NR}}}}(\mathbb {A})\), i.e., \(|{{\mathrm{{NR}}}}(\mathbb {A})|\ge 1\).    \(\square \)

We remark that if \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) is branching, the assembly \(\mathbb {A}\) may be semi-conflicting but have \(|{{\mathrm{{NR}}}}(\mathbb {A})|=0\). An example is given by \(\mathbb {A}=\{(s_1,s_{i+1})\}_{i=1}^k\) with \(k>2\), which contains no conflicting assembly points (in fact, all assembly points in \(\mathbb {A}\) are semi-conflicting), but \(|{{\mathrm{{NR}}}}(\mathbb {A})|=0\).

For an assembly \(\mathbb {A}\) with \(|{{\mathrm{{NR}}}}(\mathbb {A})|\ge 1\), the orientation of some scaffolds from \(\mathbb {S}(\mathbb {A})\) does not depend on the choice of a realization from \({{\mathrm{{NR}}}}(\mathbb {A})\) (we denote the set of such scaffolds by \(\mathbb {S}_o(\mathbb {A})\)), while the orientation of other scaffolds within some assembly points varies across realizations from \({{\mathrm{{NR}}}}(\mathbb {A})\) (we denote the set of such scaffolds by \(\mathbb {S}_u(\mathbb {A})\)). Trivially, we have \(\mathbb {S}_u(\mathbb {A})\cup \mathbb {S}_o(\mathbb {A})=\mathbb {S}(\mathbb {A})\). It can be easily seen that the set \(\mathbb {S}_u(\mathbb {A})\) is formed by the scaffolds for which the orientation in the proof of Theorem 2 was chosen arbitrarily, implying the following statement.

Corollary 3

For a given assembly \(\mathbb {A}\) with \(|{{\mathrm{{NR}}}}(\mathbb {A})|\ge 1\), we have \(|{{\mathrm{{NR}}}}(\mathbb {A})|=2^{|\mathbb {S}_u(\mathbb {A})|}\).

Lemma 4

Testing whether a given assembly \(\mathbb {A}\) has a non-conflicting realization can be done in \(\mathcal {O}\left( k\cdot \log (k)\right) \) time, where \(k=|\mathbb {S}(\mathbb {A})|\).

Proof

To test whether \(\mathbb {A}\) has a non-conflicting realization, we first create a hash table indexed by \(\mathbb {S}(\mathbb {A})\) that for every scaffold \(s\in \mathbb {S}(\mathbb {A})\) will contain a list of assembly points that involve s. We iterate over all assembly points \(p\in \mathbb {A}\) and add p to two lists in the hash table indexed by the scaffolds participating in p. If the length of some list becomes greater than 2, then \(\mathbb {A}\) is conflicting and we stop. If we successfully complete the iterations, then every scaffold from \(\mathbb {S}(\mathbb {A})\) participates in at most two assembly points in \(\mathbb {A}\), and thus we made \(\mathcal {O}\left( k\right) \) steps of \(\mathcal {O}\left( \log (k)\right) \) time each.

Next, for every scaffold whose list in the hash table has length 2, we check whether the corresponding assembly points are either non-conflicting or semi-conflicting. If not, then \(\mathbb {A}\) is conflicting and we stop. If the check completes successfully, then \(\mathbb {A}\) has a non-conflicting realization by Theorem 2. The check takes \(\mathcal {O}\left( k\right) \) steps of \(\mathcal {O}\left( \log (k)\right) \) time each, and thus the total running time comes to \(\mathcal {O}\left( k\cdot \log (k)\right) \).    \(\square \)

A pseudocode for the test described in the proof of Lemma 4 is given in Algorithm 2 in the Appendix.

Lemma 5

For a given assembly \(\mathbb {A}\) with \(|{{\mathrm{{NR}}}}(\mathbb {A})|\ge 1\), the set \(\mathbb {S}_u(\mathbb {A})\) can be computed in \(\mathcal {O}\left( k\cdot \log (k)\right) \) time, where \(k=|\mathbb {S}(\mathbb {A})|\).

Proof

We will construct the set \(S = \mathbb {S}_u(\mathbb {A})\) iteratively. Initially we let \(S=\emptyset \). Following the algorithm described in the proof for Lemma 4, we construct a hash table that for every scaffold \(s\in \mathbb {S}(\mathbb {A})\) contains a list of assembly points that involve s (which takes \(\mathcal {O}\left( k\cdot \log (k)\right) \) time). Then for every \(s\in \mathbb {S}(\mathbb {A})\), we check if either of the corresponding assembly points provides an orientation for s; if not, we add s to S. This check for each scaffolds takes \(\mathcal {O}\left( 1\right) \) time, bringing the total running time to \(\mathcal {O}\left( k\cdot \log (k)\right) .\)    \(\square \)

A pseudocode for the computation of \(\mathbb {S}_u(\mathbb {A})\) described in the proof of Lemma 5 is given in Algorithm 3 in the Appendix.

3 Orientation of Ordered Scaffolds

For a non-conflicting assembly \(\mathbb {A}\) composed only of oriented assembly points, an assembly point p on scaffolds \(s_i, s_j\in \mathbb {S}(\mathbb {A})\) has a consistent orientation with \(\mathbb {A}\) if for some \(p'\in {{\mathrm{\text {R}}}}(p)\) there exists a path connecting edges \(s_i\) and \(s_j\) in \(\mathsf {SAG} (\mathbb {A})\) such that direction of edges \(s_i\) and \(s_j\) at the path ends is consistent with \(p'\) (e.g., in Fig. 1a, the assembly point \((\overrightarrow{s_1}, \overrightarrow{s_3})\) has a consistent orientation with the assembly A).

We formulate the orientation of ordered scaffolds problem as follows.

Problem 1 (Orientation of Ordered Scaffolds, OOS)

Let \(\mathbb {A}\) be an assembly and \(\mathbb {O}\) be a set4 of assembly points such that \(|{{\mathrm{{NR}}}}(\mathbb {A})|\ge 1\) and \(\mathbb {S}(\mathbb {O})\subset \mathbb {S}(\mathbb {A})\). Find a non-conflicting realization \(\mathbb {A}'\in {{\mathrm{{NR}}}}(\mathbb {A})\) that maximizes the number (total weight) of assembly points from \(\mathbb {O}\) having consistent orientations with \(\mathbb {A}'\).

From the biological perspective, the OOS can be viewed as a formalization of the case where (sub)orders of scaffolds have been determined (which defines \(\mathbb {A}\)), while there exists some information (possibly coming from different sources and conflicting) about their relative orientation (which defines \(\mathbb {O}\)). The OOS asks to orient unoriented scaffolds in the given scaffold orders in a way that is most consistent with the given orientation information.

We also remark that the OOS can be viewed as a fine-grained variant of the scaffold orientation problem studied in [9]. In our terminology, the latter problem concerns an artificial circular genome \(\mathbb {A}\) formed by the given scaffolds in an arbitrary order (so that there is a path connecting any scaffold or its reverse complement to any other scaffold in \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\)), and \(\mathbb {O}\) formed by unordered pairs of scaffolds supplemented with the binary information on whether each such pair come from the same or different strands of the genome. In contrast, in the OOS, the assembly \(\mathbb {A}\) is given and \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) does not have to be connected or non-branching, while \(\mathbb {O}\) may provide a pair of scaffolds with up to four options (as in (1)) about their relative orientation.

3.1 \(\mathsf {NP}\)-hardness of the OOS

We consider two important partial cases of the OOS, where the assembly \(\mathbb {A}\) represents a linear or circular genome up to unknown orientations of the scaffolds. In these cases, the graph \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) forms a collection of paths or cycles, respectively. Below we prove that the OOS in both these cases is \(\mathsf {NP}\)-hard.

Theorem 6

The OOS for linear genomes is \(\mathsf {NP}\)-hard.

Proof

We will construct a polynomial-time reduction from the \(\mathrm {MAX}\ 2\)-\(\mathrm {DNF}\) problem, which is known to be \(\mathsf {NP}\)-hard [7, 15]. Given an instance I of \(\mathrm {MAX}\ 2\)-\(\mathrm {DNF}\) consisting of conjunctions \(C = \{c_i\}_{i=1}^k\) on variables \(X = \{x_i\}_{i=1}^n\), we define an assembly
$$\mathbb {A}= \{ (0,x_1) \} \cup \{ (x_i, x_{i+1})\ :\ i=1,2,\dots ,n-1\}.$$
We then construct a set of assembly points \(\mathbb {O}\) from the clauses in C as follows. For each clause \(c\in C\) with two variables \(x_i\) and \(x_j\) (\(i<j\)), we add an oriented assembly point on scaffolds \(x_i, x_j\) to \(\mathbb {O}\) with the orientation depending on the negation of these variables in c (i.e., a clause \(x_i\wedge \overline{x_j}\) is translated into an assembly point \((\overrightarrow{x_i}, \overleftarrow{x_j})\)). For each clause from C with a single variable x, we add an assembly point \((\overrightarrow{0}, \overrightarrow{x})\) or \((\overrightarrow{0}, \overleftarrow{x})\) depending whether x is negated in the clause.

It is easy to see that the constructed assembly \(\mathbb {A}\) is semi-conflicting and \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) is a path, and thus by Theorem 2 \(\mathbb {A}\) has a non-conflicting realization. Hence, \(\mathbb {A}\) and \(\mathbb {O}\) form an instance of the OOS for linear genomes. A solution \(\mathbb {A}'\) to this OOS provides an orientation for each \(x\in \mathbb {S}\) that maximizes the number of assembly points from \(\mathbb {O}\) having consistent orientations with \(A'\). A solution to I is obtained from \(A'\) as the assignment of 0 or 1 to each variable x depending on whether the orientation of scaffold x in \(A'\) is forward or reverse. Indeed, since each assembly point in \(\mathbb {O}\) having consistent orientation with \(\mathbb {A}'\) corresponds to a truthful clause in I, the number of such clauses is maximized.

It is easy to see that the OOS instance and the solution to I can be computed in polynomial time, thus we constructed a polynomial-time reduction from the \(\mathrm {MAX}\ 2\)-\(\mathrm {DNF}\) to the OOS for linear genomes.    \(\square \)

Theorem 7

The OOS for circular genomes is \(\mathsf {NP}\)-hard.

Proof

We construct a polynomial-time reduction from the \(\mathrm {MAX}\)-\(\mathrm {CUT}\) problem, which is known to be \(\mathsf {NP}\)-hard [17, 18]. An instance I of \(\mathrm {MAX}\)-\(\mathrm {CUT}\) for a given a graph (VE) asks to partition the set of vertices \(V = \{v_i\}_{i=1}^n\) into two disjoint subsets \(V_1\) and \(V_2\) such that the number of edges \(\{u, v\}\in E\) with \(u\in V_1\) and \(v\in V_2\) is maximized. For a given instance I of \(\mathrm {MAX}\)-\(\mathrm {CUT}\) problem, we define the assembly
$$\mathbb {A}= \left\{ (v_i,v_{i+1})\ :\ i=1,2,\dots ,n-1\right\} \cup \left\{ (v_1,v_n) \right\} $$
and the set of assembly points
$$\mathbb {O}= \left\{ (\overrightarrow{v_i}, \overleftarrow{v_j})\ :\ \{v_i, v_j\}\in E \right\} .$$
It is easy to see that \(\mathbb {A}\) has a non-conflicting realization and \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) is a cycle, i.e., \(\mathbb {A}\) and \(\mathbb {O}\) form an instance of the OOS for circular genomes. A solution \(\mathbb {A}'\) to this OOS instance provides orientations for all elements \(\mathbb {S}(\mathbb {A})=V\) that maximizes the number of assembly points from \(\mathbb {O}\) having consistent orientations with \(\mathbb {A}'\). A solution to I is obtained as the partition of V into two disjoint subsets, depending on the orientation of scaffolds in \(\mathbb {A}'\) (forward vs reverse). Indeed, since each assembly point in \(\mathbb {O}\) having a consistent orientation with \(\mathbb {A}'\) corresponds to an edge from E whose endpoints belong to distinct subsets in the partition, the number of such edges is maximized.

It is easy to see that the OOS instance and the solution to I can be computed in polynomial time, thus we constructed a polynomial-time reduction from the \(\mathrm {MAX}\)-\(\mathrm {CUT}\) to the OOS for circular genomes.    \(\square \)

As a trivial consequence of Theorems 6 and 7, we obtain that the general OOS problem is \(\mathsf {NP}\)-hard.

Corollary 8

The OOS is \(\mathsf {NP}\)-hard.

3.2 Properties of the OOS

In this subsection, we formulate and prove some important properties of the OOS. We start with the following lemma that trivially follows from the definition of consistent orientation.

Lemma 9

Let \(\mathbb {A}\) be an assembly. An assembly point on scaffolds \(s_i, s_j\in \mathbb {S}(\mathbb {A})\) may have a consistent orientation with \(\mathbb {A}\) only if both \(s_i\) and \(s_j\) belong to the same connected component in \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\).

Theorem 10

Let \((\mathbb {A},\mathbb {O})\) be an OOS instance, and \(\mathbb {A}= \mathbb {A}_1 \cup \dots \cup \mathbb {A}_k\) be the partition such that \({{\mathrm{\mathsf {OG}}}}(\mathbb {A}_1),\dots ,{{\mathrm{\mathsf {OG}}}}(\mathbb {A}_k)\) represent the connected components of \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\). For each \(i=1,2,\dots ,k\), define \(\mathbb {O}_i = \{ p\in \mathbb {O}\ :\ {{\mathrm{{sn}}}}(p,1),{{\mathrm{{sn}}}}(p,2)\in \mathbb {S}(\mathbb {A}_i)\}\) and let \(\mathbb {A}'_i\) be a solution to the OOS instance \((\mathbb {A}_i,\mathbb {O}_i)\). Then \(\mathbb {A}'_1\cup \dots \cup \mathbb {A}'_k\) is a solution to the OOS instance \((\mathbb {A},\mathbb {O})\).

Proof

Lemma 9 implies that we can discard from \(\mathbb {O}\) all assembly points that are formed by scaffolds from different connected components in \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\). Hence, we may assume that \(\mathbb {O}= \mathbb {O}_1\cup \dots \cup \mathbb {O}_k\}\).

Lemma 9 further implies that an assembly point from \(\mathbb {O}_i\) may have a consistent orientation with \(\mathbb {A}_j\) only if \(i=j\). Therefore, any solution to the OOS instance \((\mathbb {A},\mathbb {O})\) is formed by the union of solutions to the OOS instances \((\mathbb {A}_i,\mathbb {O}_i)\).    \(\square \)

Theorem 10 allows us to focus on instances of the OOS, where \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) is connected and thus forms a path or a cycle (by Theorem 2).

The following lemma is almost trivial.

Lemma 11

Let \(\mathbb {A}\) be an assembly, and \(s_i,s_j\) be scaffolds from the same connected component C in \(\mathsf {SAG} (\mathbb {A})\). Then an unoriented assembly point \((s_i,s_j)\) has a consistent orientation with \(\mathbb {A}\). Furthermore, if C is a cycle, then any semi-oriented assembly point on \(s_i,s_j\) has a consistent orientation with \(\mathbb {A}\).

By Lemma 11, we can assume that \(\mathbb {O}\) does not contain any unoriented assembly points (i.e., \(\mathbb {O}= \mathbb {O}_o \cup \mathbb {O}_s\)). Furthermore, if \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) is a cycle, we can assume that \(\mathbb {O}=\mathbb {O}_o\) (i.e., \(\mathbb {O}\) consists of oriented assembly points only).

Below we show that an OOS instance can also be solved independently for each connected component of \({{\mathrm{\mathsf {OG}}}}(\mathbb {O}_o)\). We consider two cases depending on whether \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) forms a path or a cycle.

Case 1: \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) is a path. Let \((\mathbb {A},\mathbb {O})\) be an OOS instance such that \({{\mathrm{\mathsf {OG}}}}(\mathbb {A}) = (s_1,s_2,\dots ,s_n)\) is a path and \(\mathbb {O}= \mathbb {O}_o \cup \mathbb {O}_s\). First we notice that since \(|{{\mathrm{{NR}}}}(\mathbb {A})|\ge 1\), a solution \(\mathbb {A}'\) to this OOS can be viewed as a sequence of the same scaffolds \((s_1,s_2,\dots ,s_n)\) where every element is oriented.

Let \(C_1, C_2, \dots , C_k\) be the connected components of \({{\mathrm{\mathsf {OG}}}}(\mathbb {O}_o)\) and, for \(i=1,2,\dots ,k\), \((s_{j_{i,1}},\dots ,s_{j_{i,m_i}})\) be a sequence of vertices of \(C_i\) such that \(j_{i,1}<j_{i,2}<\dots <j_{i,m_i}\).

We define an assembly \(\mathbb {A}_i\) (\(i=1,2,\dots ,k\)) such that \({{\mathrm{\mathsf {OG}}}}(\mathbb {A}_i)\) is the path \((x_i,s_{j_{i,1}},\dots ,s_{j_{i,m_i}},y_i)\), where \(x_i\) and \(y_i\) are artificial vertices, and the assembly points in \(\mathbb {A}_i\) (corresponding to the edges in \({{\mathrm{\mathsf {OG}}}}(\mathbb {A}_i)\)) are oriented or semi-oriented. Namely, the edges \(\{x_i,s_{j_{i,1}}\}\) and \(\{s_{j_{i,m_i}},y_i\}\) correspond to semi-oriented assembly points \((\overrightarrow{x_i},s_{j_{i,1}})\) and \((s_{j_{i,m_i}},\overrightarrow{y_i})\), respectively. The orientation of the scaffolds in the assembly point corresponding to the edge \(\{s_{j_{i,l}},s_{j_{i,l+1}}\}\) is inherited from the assembly points in \(\mathbb {A}\) formed by \((s_{j_{i,l}},s_{j_{i,l}+1})\) and \((s_{j_{i,l+1}-1},s_{j_{i,l+1}})\), respectively.

We further define \(\mathbb {O}_i\) (\(i=1,2,\dots ,k\)) formed by the oriented assembly points from \(C_i\) and the following oriented assembly points. For each semi-oriented assembly point \(p\in \mathbb {O}\) formed by scaffolds \(s_m\) and \(s_l\) (\(m<l\)), \(\mathbb {O}_i\) contains (i) an oriented point \(p'\) formed by \(s_m\) and \(\overrightarrow{y_i}\) to \(\mathbb {O}_i\) whenever \(s_m\) is oriented in p and belongs to \(C_i\) (and its orientation in \(p'\) is inherited from p); and (ii) an oriented point \(p''\) formed by \(\overrightarrow{x_i}\) and \(s_l\) whenever \(s_l\) is oriented in p and belongs to \(C_i\) (and its orientation in \(p''\) is inherited from p) (Fig. 2).
Fig. 2.

Decomposition of an OOS problem instance \((\mathbb {A}, \mathbb {O})\) based on the connected components of \({{\mathrm{\mathsf {OG}}}}(\mathbb {O}_o)\). (a) The superposition of \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) (red edges) and \({{\mathrm{\mathsf {OG}}}}(\mathbb {O})\) (green edges), where arrows (if present) at the ends of green edges encode the orientation of the scaffolds in the corresponding assembly points. (b) The superposition of five graphs \({{\mathrm{\mathsf {OG}}}}(\mathbb {A}_i)\) (red edges) and three graphs \({{\mathrm{\mathsf {OG}}}}(\mathbb {O}_j)\) (green edges) constructed based on the connected components of \({{\mathrm{\mathsf {OG}}}}(\mathbb {O}_o)\). Unless \({{\mathrm{\mathsf {OG}}}}(\mathbb {A}_i)\) is formed by an isolated vertex, it contains artificial vertices \(x_i\) and \(y_i\), which coincide if \({{\mathrm{\mathsf {OG}}}}(\mathbb {A}_i)\) is a cycle. (Color figure online)

Theorem 12

Let \((\mathbb {A},\mathbb {O})\) be an OOS instance, and \(\mathbb {A}_i\) and \(\mathbb {O}_i\) (\(i=1,2,\dots ,k\)) be defined as above. For each \(i=1,2,\dots ,k\), let \(\mathbb {A}'_i\) be a solution to the OOS instance \((\mathbb {A}_i,\mathbb {O}_i)\). Then a solution \(\mathbb {A}'\) to the OOS instance \((\mathbb {A},\mathbb {O})\) can be constructed as follows. For a scaffold \(s\in \mathbb {S}(\mathbb {A})\) present in some \(\mathbb {A}'_i\), \(\mathbb {A}'\) inherits the orientation of s from \(\mathbb {A}'_i\). For a scaffold \(s\in \mathbb {S}(\mathbb {A})\) not present in any \(\mathbb {A}'_i\), \(\mathbb {A}'\) either inherits the orientation of s from \(\mathbb {A}\) (if s is oriented in any assembly point), or orients s arbitrarily.

Proof

The graph \(\mathsf {SAG} (\mathbb {A}')\) can be viewed as an ordered sequence of directed scaffold edges (interweaved with undirected edges encoding assembly points). Then each \(\mathsf {SAG} (\mathbb {A}'_i)\), with the exception of scaffold edges \(x_i\) and \(y_i\), corresponds to a subsequence of this sequence.

Each oriented assembly point \(p\in \mathbb {O}\) is formed by scaffolds uv from \(C_i\) for some \(i\in \{1,\dots ,k\}\). Then \(p\in \mathbb {O}\cap \mathbb {O}_i\) and there exist a unique path in \(\mathsf {SAG} (\mathbb {A}'_i)\) and a unique path in \(\mathsf {SAG} (\mathbb {A}')\) having the same directed edges uv at the ends. Hence, if p has a consistent orientation with one of assemblies \(\mathbb {A}'\) or \(\mathbb {A}'_i\), then it has a consistent orientation with the other.

Each semi-oriented assembly point \(p\in \mathbb {O}\) formed by scaffolds uv corresponds to an oriented assembly point \(q\in \mathbb {O}_i\) (for some i) formed by u and \(y_i\) (in which case \(u\in C_i\) and u is oriented in p), or by \(x_i\) and v (in which case \(v\in C_i\) and v is oriented in p). Without loss of generality, we assume the former case. Then there exists a unique path Q in \(\mathsf {SAG} (\mathbb {A}'_i)\) connecting directed edges u and \(y_i\), and there exists a unique path P in \(\mathsf {SAG} (\mathbb {A}')\) connecting directed edges u and v, where the orientation of u is the same in the two paths. By construction, the orientation of \(y_i\) in q matches that in Q. Hence, q has a consistent orientation with \(\mathbb {A}'_i\) if and only if the orientation of u in q matches that in Q, which happens if and only if the orientation of u in p matches its orientation in P, i.e., p has a consistent orientation with \(\mathbb {A}'\).

We proved that the number of assembly points from \(\mathbb {O}\) having consistent orientation with \(\mathbb {A}'\) equals the total number of assembly points from \(\mathbb {O}_i\) having consistent orientation with \(\mathbb {A}'_i\) for all \(i=1,2,\dots ,k\). It remains to notice that this number is maximum possible, i.e., \(\mathbb {A}'\) is indeed a solution to the OOS instance \((\mathbb {A},\mathbb {O})\) (if it is not, then the sets \(\mathbb {A}_i\) constructed from \(\mathbb {A}\) being an actual solution to the OOS will give a better solution to at least one of the subproblems).    \(\square \)

Case 2: \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) is a cycle. In this case, we can construct subproblems based on the connected components of \({{\mathrm{\mathsf {OG}}}}(\mathbb {O})\) similarly to Case 1, with the following differences. First, we assume that \(\mathbb {O}=\mathbb {O}_o\) (discarding all unoriented and oriented assembly points from \(\mathbb {O}\)). Second, we assume that \(x_i=y_i\) and thus \({{\mathrm{\mathsf {OG}}}}(\mathbb {A}_i)\) forms a cycle. Theorem 12 still holds in this case.

Articulation Vertices in \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) . While Theorem 12 allows us to divide the OOS problems into subproblems based on the connected components of \({{\mathrm{\mathsf {OG}}}}(\mathbb {O}_o)\), we show below that similar division is possible when \({{\mathrm{\mathsf {OG}}}}(\mathbb {O}_o)\) is connected but contains an articulation vertex.5

Any articulation vertex v in \({{\mathrm{\mathsf {OG}}}}(\mathbb {O}_o)\) defines a partition of \(\mathbb {S}(\mathbb {O}_o)\) into subsets:
$$\begin{aligned} \mathbb {S}(\mathbb {O}_o) = \{v\} \cup V_1 \cup V_2 \cup \dots \cup V_k, \end{aligned}$$
(2)
where \(k>1\) and the \(V_i\) represent the vertex sets of the connected components resulted from removal of v from \({{\mathrm{\mathsf {OG}}}}(\mathbb {O}_o)\).

The following theorem shows that an articulation vertex in \({{\mathrm{\mathsf {OG}}}}(\mathbb {O})\) enables application of Theorem 12.

Theorem 13

Let \((\mathbb {A}, \mathbb {O})\) be an instance of the OOS problem such that \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) and \({{\mathrm{\mathsf {OG}}}}(\mathbb {O}_o)\) are connected. Let v be an articulation vertex in \({{\mathrm{\mathsf {OG}}}}(\mathbb {O}_o)\), defining a partition (2).

If \(v\in \mathbb {S}_o(\mathbb {A})\), we introduce copies \(v_1, \dots , v_k\) of v, and construct \(\hat{\mathbb {A}}\) from \(\mathbb {A}\) by replacing a path (uvw) in \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) with a path \((u,v_1,v_2,\dots ,v_k,w)\) where all \(v_i\) inherit the orientation from v. Then we construct \(\hat{\mathbb {O}}\) from \(\mathbb {O}\) by replacing in each assembly point p formed by v and \(u\in V_i\) (for some \(i\in \{1,2,\dots ,k\})\) with an assembly point formed by \(v_i\) and u (keeping their orientations intact). Then a solution to the OOS instance \((\mathbb {A}, \mathbb {O})\) can be obtained from a solution to the OOS instance \((\hat{\mathbb {A}}, \hat{\mathbb {O}})\) by replacing every scaffold \(v_i\) with v.

If \(v\notin \mathbb {S}_o(\mathbb {A})\), we iteratively fix the two possible orientations of v, and proceed with construction and solution of the OOS instance \((\hat{\mathbb {A}}, \hat{\mathbb {O}})\) as above. Then a solution to the OOS instance \((\mathbb {A}, \mathbb {O})\) can be obtained from a better solution among the two.

Proof

Let \(\hat{\mathbb {A}}'\) be a solution to the OOS instance \((\hat{\mathbb {A}}, \hat{\mathbb {O}})\), and \(\mathbb {A}'\) be obtained from \(\hat{\mathbb {A}}'\) by replacing every \(v_i\) with v. We remark that \(\mathbb {O}\) can be obtained from \(\hat{\mathbb {O}}\) by similar replacement.

This establishes an one-to-one correspondence between the assembly points in \(\hat{\mathbb {A}}'\) and \(\mathbb {A}'\), as well as between the assembly points in \(\hat{\mathbb {O}}'\) and \(\mathbb {O}'\). It remains to show that consistent orientations are invariant under this correspondence.

We remark that \(\mathsf {SAG} (\mathbb {A}')\) can be obtained from \(\mathsf {SAG} (\hat{\mathbb {A}}')\) by replacing a sequence of edges \((r_1,v_1,r_2,v_2,\dots ,r_k,v_k,r_{k+1})\), where \(r_i\) are assembly edges, with a sequence of edges \((r_1,v,r_2)\). Therefore, if there exists a path in one graph proving existence of consistent orientation for some assembly point, then there exists a corresponding path in the other graph (having the same orientations of the end edges).    \(\square \)

3.3 Non-branching Orientation of Ordered Scaffolds

At the latest stages of genome assembly, the constructed scaffolds are usually of significant length. If (sub)orders for these scaffolds are known, it is rather rare to have orientation-imposing information that would involve non-neighboring scaffolds. Or, more generally, it is rather rare to have orientation imposing information for one scaffold with respect to more than two other scaffolds. This inspires us to consider a special case of the OOS problem.

For a set of assembly points \(\mathbb {A}\), we define the contracted order graph \({{\mathrm{\mathsf {COG}}}}(\mathbb {A})\) obtained from \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) by replacing all multi-edges edges with single edges (Fig. 1c). We now consider a special type of the OOS problem:

Problem 2 (Non-branching Orientation of Ordered Scaffolds, NOOS)

Given an OOS instance \((\mathbb {A},\mathbb {O})\) such that the graph \({{\mathrm{\mathsf {COG}}}}(\mathbb {O}_o)\) is non-branching. Find \(\mathbb {A}'\in {{\mathrm{{NR}}}}(\mathbb {A})\) that maximizes the number of assembly points from \(\mathbb {O}\) having consistent orientations with \(\mathbb {A}'\).

Theorem 14

The NOOS is in \(\mathsf {P}\).

Proof

By Theorems 10 and 12, we can assume that both \({{\mathrm{\mathsf {OG}}}}(\mathbb {A})\) and \({{\mathrm{\mathsf {OG}}}}(\mathbb {O}_o)\) are connected. Since \({{\mathrm{\mathsf {COG}}}}(\mathbb {O}_o)\) is non-branching, and we consider two cases depending on whether it is a path or a cycle.

If \({{\mathrm{\mathsf {COG}}}}(\mathbb {O}_o)\) is a path, then every vertex in it is an articulation vertex in both \({{\mathrm{\mathsf {COG}}}}(\mathbb {O}_o)\) and \({{\mathrm{\mathsf {OG}}}}(\mathbb {O}_o)\). Our algorithm will process this path in a divide and conquer manner. Namely, for a path of length greater than 2, we pick a vertex v closest to the path middle and proceed as in Theorem 13. A path of length at most 2 can be solved in \(\mathcal {O}\left( |\mathbb {O}|\right) \) time by brute-forcing all possible orientations of the scaffolds in the path and counting how many assembly points in \(\mathbb {O}\) get consistent orientations.

The number of operations for the preprocessing stage of the algorithm (dominated by construction of hash tables) is \(\mathcal {O}\left( k\cdot \log (k)\right) \), where \(k=\max \{|\mathbb {O}|, |\mathbb {S}(\mathbb {A})|\}\). The running time for recursive part of the algorithm can described as
$$T(l) = {\left\{ \begin{array}{ll} 4\cdot T\left( \frac{l}{2}\right) + \mathcal {O}\left( 1\right) , &{} \text {if}\ l>2;\\ \mathcal {O}\left( |\mathbb {O}|\right) , &{} \text {if}\ l \le 2. \end{array}\right. }$$
From the Master theorem [8], we conclude that the total running time for the proposed recursive algorithm is \(\mathcal {O}\left( l^2 + k\cdot \log (k)\right) \), where \(l=|\mathbb {O}|\).

If \({{\mathrm{\mathsf {COG}}}}(\mathbb {O}_o)\) is a cycle, we can reduce the corresponding NOOS instance to the case of a path as follows. First, we pick a random vertex w in \({{\mathrm{\mathsf {COG}}}}(\mathbb {O}_o)\) and replace it with new vertices \(w_1\) and \(w_2\) such that the edges \(\{u, w\}\), \(\{w, v\}\) in \({{\mathrm{\mathsf {COG}}}}(\mathbb {O}_o)\) are replaced with \(\{u, w_1\}\), \(\{w_2, v\}\). Then we solve the NOOS for the resulting path one or two times (depending on whether \(w\in \mathbb {S}_o(\mathbb {A})\)): once for each of possible orientations of scaffold w (inherited by \(w_1\) and \(w_2\)), and then select the orientation for w that produces a better result.    \(\square \)

A pseudocode for the algorithm described in the proof of Theorem 14 is given in Algorithm 1 in the Appendix.

4 Conclusion

In the present study, we posed the orientation of ordered scaffolds (OOS) problem as an optimization problem based on given weighted orientations of scaffolds and their pairs. We further addressed it within the earlier introduced CAMSA framework, taking advantage of the simple yet powerful concept of assembly points describing (semi-/un-) oriented adjacencies between scaffolds. This approach allows one to uniformly represent both orders of (oriented and/or unoriented) scaffolds and orientation-imposing data.

We proved that the OOS problem is \(\mathsf {NP}\)-hard when the given scaffold order represents a linear or circular genome. We also described a polynomial-time algorithm for the special case of non-branching OOS (NOOS), where the orientation of each scaffold is imposed relatively to at most two other scaffolds. Our algorithm for the NOOS problem and Theorems 10, 12, and 13 further enable us to develop an \(\mathsf {FPT}\) algorithm for the general OOS problem (to be described elsewhere).

Footnotes

  1. 1.

    We remark that contigs can be viewed as a special type of scaffolds with no gaps.

  2. 2.

    It can be easily seen that a realization of \(\mathbb {A}\) may exist only if \(\mathbb {A}\) is proper.

  3. 3.

    \(\deg (v)\) denotes the degree of a vertex v, i.e., the number of edges (counted with multiplicity) incident to v.

  4. 4.

    More generally, \(\mathbb {O}\) may be a multiset whose elements have real positive multiplicities (weights).

  5. 5.

    We remind that a vertex is articulation if its removal from the graph increases the number of connected components.

Notes

Acknowledgements

The authors thank the anonymous reviewers for their suggestions and comments that helped to improve the exposition.

The work is supported by the National Science Foundation under the grant No. IIS-1462107. The work of SA is also partially supported by the National Science Foundation under the grant No. CCF-1053753 and by the National Institute of Health under the grant No. U24CA211000.

References

  1. 1.
    Aganezov, S., Alekseyev, M.A.: Multi-genome scaffold co-assembly based on the analysis of gene orders and genomic repeats. In: Bourgeois, A., Skums, P., Wan, X., Zelikovsky, A. (eds.) ISBRA 2016. LNCS, vol. 9683, pp. 237–249. Springer, Cham (2016). doi: 10.1007/978-3-319-38782-6_20 Google Scholar
  2. 2.
    Aganezov, S.S., Alekseyev, M.A.: CAMSA: A Tool for Comparative Analysis and Merging of Scaffold Assemblies. Preprint bioRrxiv:10.1101/069153 (2016)
  3. 3.
    Anselmetti, Y., Berry, V., Chauve, C., Chateau, A., Tannier, E., Bérard, S.: Ancestral gene synteny reconstruction improves extant species scaffolding. BMC Genom. 16(Suppl 10), S11 (2015)CrossRefGoogle Scholar
  4. 4.
    Assour, L.A., Emrich, S.J.: Multi-genome synteny for assembly improvement multi-genome synteny for assembly improvement. In: Proceedings of 7th International Conference on Bioinformatics and Computational Biology, pp. 193–199 (2015)Google Scholar
  5. 5.
    Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., Pyshkin, A.V., Sirotkin, A.V., Vyahhi, N., Tesler, G., Alekseyev, M.A., Pevzner, P.A.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Bashir, A., Klammer, A.A., Robins, W.P., Chin, C.S., Webster, D., Paxinos, E., Hsu, D., Ashby, M., Wang, S., Peluso, P., Sebra, R., Sorenson, J., Bullard, J., Yen, J., Valdovino, M., Mollova, E., Luong, K., Lin, S., LaMay, B., Joshi, A., Rowe, L., Frace, M., Tarr, C.L., Turnsek, M., Davis, B.M., Kasarskis, A., Mekalanos, J.J., Waldor, M.K., Schadt, E.E.: A hybrid approach for the automated finishing of bacterial genomes. Nat. Biotech. 30(7), 701–707 (2012)CrossRefGoogle Scholar
  7. 7.
    Bazgan, C., Paschos, V.T.: Differential approximation for optimal satisfiability and related problems. Eur. J. Oper. Res. 147(2), 397–404 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Bentley, J.L., Haken, D., Saxe, J.B.: A general method for solving divide-and-conquer recurrences. ACM SIGACT News 12(3), 36–44 (1980)CrossRefzbMATHGoogle Scholar
  9. 9.
    Bodily, P.M., Fujimoto, M.S., Snell, Q., Ventura, D., Clement, M.J.: ScaffoldScaffolder: solving contig orientation via bidirected to directed graph reduction. Bioinformatics 32(1), 17–24 (2015)Google Scholar
  10. 10.
    Boetzer, M., Henkel, C.V., Jansen, H.J., Butler, D., Pirovano, W.: Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27(4), 578–579 (2011)CrossRefGoogle Scholar
  11. 11.
    Boetzer, M., Pirovano, W.: SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinf. 15(1), 211 (2014)CrossRefGoogle Scholar
  12. 12.
    Burton, J.N., Adey, A., Patwardhan, R.P., Qiu, R., Kitzman, J.O., Shendure, J.: Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31(12), 1119–1125 (2013)CrossRefGoogle Scholar
  13. 13.
    Chen, Z.Z., Harada, Y., Guo, F., Wang, L.: Approximation algorithms for the scaffolding problem and its generalizations. Theoret. Comput. Sci. (2017). http://www.sciencedirect.com/science/article/pii/S0304397517302815
  14. 14.
    Dayarian, A., Michael, T.P., Sengupta, A.M.: SOPRA: scaffolding algorithm for paired reads via statistical optimization. BMC Bioinf. 11, 345 (2010)CrossRefGoogle Scholar
  15. 15.
    Escoffier, B., Paschos, V.T.: Differential approximation of min sat, max sat and related problems. Eur. J. Oper. Res. 181(2), 620–633 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Gao, S., Nagarajan, N., Sung, W.-K.: Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. In: Bafna, V., Sahinalp, S.C. (eds.) RECOMB 2011. LNCS, vol. 6577, pp. 437–451. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-20036-6_40 CrossRefGoogle Scholar
  17. 17.
    Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide To The Theory of Np-completeness, vol. 58. Freeman, San Francisco (1979)zbMATHGoogle Scholar
  18. 18.
    Garey, M.R., Johnson, D.S., Stockmeyer, L.: Some simplified NP-complete graph problems. Theoret. Comput. Sci. 1(3), 237–267 (1976)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Gritsenko, A.A., Nijkamp, J.F., Reinders, M.J.T., de Ridder, D.: GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies. Bioinformatics 28(11), 1429–1437 (2012)CrossRefGoogle Scholar
  20. 20.
    Hunt, M., Newbold, C., Berriman, M., Otto, T.D.: A comprehensive evaluation of assembly scaffolding tools. Genome Biol. 15(3), R42 (2014)CrossRefGoogle Scholar
  21. 21.
    Jiao, W.B., Garcia Accinelli, G., Hartwig, B., Kiefer, C., Baker, D., Severing, E., Willing, E.M., Piednoel, M., Woetzel, S., Madrid-Herrero, E., Huettel, B., Hümann, U., Reinhard, R., Koch, M.A., Swan, D., Clavijo, B., Coupland, G., Schneeberger, K.: Improving and correcting the contiguity of long-read genome assemblies of three plant species using optical mapping and chromosome conformation capture data. Genome Res. 27(5), 116 (2017)CrossRefGoogle Scholar
  22. 22.
    Kececioglu, J.D., Myers, E.W.: Combinatorial algorithms for DNA sequence assembly. Algorithmica 13(1–2), 7–51 (1995)MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Kolmogorov, M., Armstrong, J., Raney, B.J., Streeter, I., Dunn, M., Yang, F., Odom, D., Flicek, P., Keane, T., Thybert, D., Paten, B., Pham, S.: Chromosome assembly of large and complex genomes using multiple references. Preprint bioRxiv:10.1101/088435 (2016)
  24. 24.
    Koren, S., Treangen, T.J., Pop, M.: Bambus 2: scaffolding metagenomes. Bioinformatics 27(21), 2964–2971 (2011)CrossRefGoogle Scholar
  25. 25.
    Lam, K.K., Labutti, K., Khalak, A., Tse, D.: FinisherSC: a repeat-aware tool for upgrading de novo assembly using long reads. Bioinformatics 31(19), 3207–3209 (2015)CrossRefGoogle Scholar
  26. 26.
    Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., Wang, J.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1(1), 18 (2012)CrossRefGoogle Scholar
  27. 27.
    Nagarajan, N., Read, T.D., Pop, M.: Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics 24(10), 1229–1235 (2008)CrossRefGoogle Scholar
  28. 28.
    Pop, M., Kosack, D.S., Salzberg, S.L.: Hierarchical scaffolding with Bambus. Genome Res. 14(1), 149–159 (2004)CrossRefGoogle Scholar
  29. 29.
    Putnam, N.H., O’Connell, B.L., Stites, J.C., Rice, B.J., Blanchette, M., Calef, R., Troll, C.J., Fields, A., Hartley, P.D., Sugnet, C.W., Haussler, D., Rokhsar, D.S., Green, R.E.: Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome Res. 26(3), 342–350 (2016)CrossRefGoogle Scholar
  30. 30.
    Reyes-Chin-Wo, S., Wang, Z., Yang, X., Kozik, A., Arikit, S., Song, C., Xia, L., Froenicke, L., Lavelle, D.O., Truco, M.J., Xia, R., Zhu, S., Xu, C., Xu, H., Xu, X., Cox, K., Korf, I., Meyers, B.C., Michelmore, R.W.: Genome assembly with in vitro proximity ligation data and whole-genome triplication in lettuce. Nat. Commun. 8, Article no. 14953 (2017). https://www.nature.com/articles/ncomms14953
  31. 31.
    Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)CrossRefGoogle Scholar
  32. 32.
    Tang, H., Zhang, X., Miao, C., Zhang, J., Ming, R., Schnable, J.C., Schnable, P.S., Lyons, E., Lu, J.: ALLMAPS: robust scaffold ordering based on multiple maps. Genome Biol. 16(1), 3 (2015)CrossRefGoogle Scholar
  33. 33.
    Warren, R.L., Yang, C., Vandervalk, B.P., Behsaz, B., Lagman, A., Jones, S.J.M., Birol, I.: LINKS: scalable, alignment-free scaffolding of draft genomes with long reads. GigaScience 4(1), 35 (2015)CrossRefGoogle Scholar
  34. 34.
    Zimin, A.V., Smith, D.R., Sutton, G., Yorke, J.A.: Assembly reconciliation. Bioinformatics 24(1), 42–45 (2008)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Princeton UniversityPrincetonUSA
  2. 2.ITMO UniversitySt. PetersburgRussia
  3. 3.The George Washington UniversityWashington, DCUSA

Personalised recommendations