Introduction

Background

The last years have witnessed Next-Generation Sequencing technologies applied to mixed settings in which the input sample consists of different, but highly related, genomic sequences. A major problem in this setting is to assemble the NGS reads produced from these different sequences, problem called multi-assembly [1].

An emblematic example is the multi-assembly of the expressed transcripts of a gene from RNA-Seq reads [2, 3]. The RNA transcripts of a gene are concatenations of exons, which can be shared among them, and whose length is typically much longer than the short read length. The RNA-Seq technology has proved essential in characterizing gene regulation and function, understanding development, disease, and disorders, including cancer [47]. The most popular tool for multi-assembly of RNA-Seq reads is Cufflinks [8], but the great interest in the community has led to a recent proliferation of methods and tools, such as [920]. Another example is the multi-assembly of NGS reads from viral quasi-species [21]. Since many viruses, such as HIV or HCV, encode their genomes in RNA rather than DNA, they lack DNA polymerase and are unable to repair mistakes in their genomes as they reproduce. Over the course of infection, the mistakes made in the replication of the virus are passed down to descendants, producing a family of related variants of the original viral genome, referred to as quasi-species. Among all of the new quasi-species produced, some may be more virulent than others, and it is of great epidemiological interest to identify them. Methods for this problem include [2227].

The vast majority of the multi-assembly tools are genome-guided, in the sense that they have access to one reference genome. Consequently, the analysis proceeds by aligning the reads to this reference, and constructing one of two major graph models. In the first, called an overlap graph, the nodes stand for reads and the edges stand for overlaps between reads. This model is employed both for RNA-Seq reads (by Cufflinks [8]), and for pyrosequencing reads from a viral population (ShoRAH [25, 26]). In the second model, called a splicing graph and used mainly for RNA-Seq reads, the nodes stand for contiguous stretches of DNA present entirely in some transcript (pseudo-exons); its edges stand for reads spanning two pseudo-exons and indicate that they are consecutive in some transcript. This model is employed by most of the other methods for the multi-assembly of RNA-Seq reads [920]. Since both graph models arise from alignments to a reference sequence, they are also directed and acyclic (DAGs). Moreover, the nodes and the edges of the graph are weighted according to the observed coverage, and different strategies exist for integrating them into the formulation of the multi-assembly problem. For example, in Cufflinks [8], the weight of an edge reflects the belief that its two endpoints originate from different transcripts, and is computed using the percent-spliced-in metric proposed in [28].

Motivation

Given an overlap or a splicing DAG, many methods [8, 19, 20, 2527] model the multi-assembly problem as a Minimum Path Cover Problem; these include the well-known tool for RNA-Seq reads Cufflinks [8]. A path cover in a directed graph G is a set of paths which cover all the nodes of G. A minimum path cover (MPC) is a path cover of minimum cardinality. Often, the edges of the DAG are weighted, and one is then interested in a minimum weight MPC. Even though this problem is in general NP-complete (a path cover has cardinality 1 if and only if the directed graph has a Hamiltonian path), it is solvable in polynomial time on DAGs [29]. This fact is one of the main reasons why the MPC Problem has attracted so much interest. Therefore, it makes sense to extend it with other biological information, while maintaining its polynomial-time solvability.

In this paper we consider additional information arising from paired-end reads or long reads. Observe that, currently, both graph models and the associated MPC Problem include constraints only on pairs of nodes which must be consecutive in the (same) genomic sequence. However, on the one hand, most sequencers produce paired-end reads; these two reads correspond to nodes that must be in the same genomic sequence, but they are no longer consecutive in it. On the other hand, Third-Generation Sequencing technologies, like Pacific Biosciences [30], produce long reads whose length is in the range of thousands of base-pairs. If properly error-corrected, they introduce additional constraints on the sequences of nodes which must appear as consecutive in the same assembled genomic sequences. In the case of a splicing graph, such additional constraints can be introduced even from short reads completely overlapping a short middle pseudo-exon (such as in the case of alternative donor/acceptor sites [31]).

Two different problem formulations have been recently proposed to better guide the multi-assembly using paired-end or long reads. In the first [20], a partial assembly of the RNA transcripts is assumed (transfrags), and the following problem, which we call Minimum Path Cover with Subpath Constraints (MPC-SC), is proposed. Given a DAG G and a set of subpaths in G (the transfrags, or the long reads), we are asked to find a MPC such that each given subpath is contained completely in some path of the path cover. In [20], the authors consider in fact the weighted version of the problem, and propose a polynomial-time reduction to the classical weighted MPC Problem. However, their reduction is incomplete as it does not deal with the case when two subpaths P1 and P2 are such that a suffix of P1 is a prefix of P2. In the second formulation [19], given a DAG G and a set of paired-end RNA-Seq read alignments to the nodes of G, we are asked to find a minimum path cover whose paths contain all given paired-end reads. We call this problem Minimum Path Cover with Paired Subpaths Constraints (MPC-PSC). In [19], the authors tackle the MPC-PSC Problem by modeling it as the NP-complete set cover problem.

Results and discussion

In this paper, we solve both the MPC-SC and the MPC-PSC Problem. Namely, we state the MPC-SC Problem more generally than in [20], and give a correct and robust polynomial-time reduction of it to the classical MPC Problem on a DAG. Denote by n the number of nodes of the input DAG, by m its number of edges, by c the total number of subpath constraints, and by N the sum of their lengths. Constructing this reduction to the classical MPC Problem requires a pre-processing step, which, if implemented trivially, takes O(c2n2) time; however, we can reduce that to O(N + c2) by use of a suffix tree construction suitable for large alphabets [32], and of an optimal-time algorithm for computing all pairs longest suffix-prefix overlaps [33, 34]. The complexity of solving Problem MPC-SC thus becomes O N + c 2 + n + c n 2 + c .

We also consider the weighted version of Problem MPC-SC, and show that it can be solved in time O(N + (n + c)2 log(n + c) + (n + c)(m + c)) by a reduction to a min-cost circulation problem on a network with flow lower bounds only [35]. Moreover, we prove that the MPC-PSC Problem itself is NP-complete, but we show that it is fixed-parameter tractable (FPT) in the total number of constraints on the DAG.

As a side result of this paper, we obtain a simple algorithm for the classical minimum weight MPC Problem running in time O(n2log n + nm), based on a recent reduction to a network flow problem [36]. This improves the current best bound O(n2 log n + nt(G)), where t G m , m + 1 , , n 2 is the number of edges in the transitive closure of G, arising from the reduction in [29].

In view of this computational dichotomy between paired-end reads and long reads/transfrags, an alternative title of this paper could have been "Long reads are better than paired-end reads in multi-asssembly problems". In fact, in the experiments we conducted for our own tool for RNA-Seq multi-assembly Traph [37, 38], we fed Cufflinks [8], IsoLasso [10], SLIDE [12] and Traph both with single-end and paired-end reads, but did not notice any significant change in the multi-assembly accuracy. Nevertheless, an immediate solution to the negative result concerning the complexity of the MPC-PSC Problem could be to simply transform paired-end reads into long reads by a local assembly method which fills the gap between them, such as [39, 40].

As a preliminary experiment, in the Supplementary Material we show the solutions of Problem MPC-SC on simulated RNA-Seq data from six cancer-related genes. These results are compared to the ground truth, and to Cufflinks' solutions (given that Cufflinks uses the classical MPC model). These preliminary results indicate that, thanks to the additional long read constraints to the MPC problem, Problem MPC-SC reports more transcripts than Cufflinks, and they are generally more accurate.

Both MPC-SC and MPC-PSC Problems are natural extensions of the classical MPC problem, and can be applied to any graph model for multi-assembly, such as an overlap graph or a splicing graph. The MCP Problem has received great interest in the multi-assembly community, and pair-end reads, long reads, or transfrags are either already, or expected to be easily available in the near future. Our positive result concerning the MPC-SC Problem, and the two proposed solutions for the MPC-PSC Problem, give efficient ways to incorporate additional information that an NGS pipeline can provide. Moreover, all of our solutions are based on easy to implement reductions, and resort to well-known problems in combinatorial optimization, for which there are many existing solvers.

Independently and parallel to this work, [41] gave analogs of our Thm. 4 and Lemma 2 for Problem MPC-PSC.

Methods

A faster algorithm for the weighted Minimum Path Cover (MPC) Problem

Given a directed graph G, we say that a family P= P 1 , , P k of paths in G is a path cover of G if every v ∈ V (G) belongs to some P i . Throughout this paper, we let n stand for the number of vertices of G and m stand for the number of edges of G. A minimum path cover (MPC) of G is a path cover of G of minimum cardinality. If each edge e of G has a non-negative weight w(e), then a minimum weight minimum path cover is a minimum path cover P which minimizes the sum of the weights of the edges of the paths of P, that is, Σ P P Σ e∈P w(e).

A well-known result on path covers in directed acyclic graphs (DAGs) is Dilworth's theorem [42], which equates the minimum number of paths in a path cover to the maximum cardinality of an anti-chain (this cardinality is sometimes called width); an anti-chain is a set of nodes with no directed path between any two of them. A constructive proof of this theorem, due to Fulkerson [29], shows that the MPC problem can be reduced to a maximum matching problem in a bipartite graph, as follows. Given a directed graph G, let T(G) denote the transitive closure of G, that is, the digraph obtained from G by repeatedly adding, until no longer possible, an edge (u, v) whenever (u, v) E(G) but there exist w ∈ V (G) such that (u, w), (w, v) ϵ E(G); we let t(G) denote the number of edges of T(G). Note that if G is a DAG, then T(G) can be computed in time O(t(G)). Fulkerson showed that a MPC can be obtained by computing a maximum matching in a bipartite graph associated to G, having two copies of V (G) as nodes and the edges of T(G) as edges (see Figures 1(a) and 1(b)). Therefore, using the Hopcroft-Karp maximum matching algorithm, a MPC can be computed in time O n t G [43]. To compute a minimum weight MPC, the same bipartite graph can be constructed, having edge weights induced from path weights in G. A minimum weight MPC corresponds to a minimum weight maximum matching on this graph, which can be computed in time O(n2 log n + nt(G)) [44].

Figure 1
figure 1

In Fig. 1(a), an input DAG G. In Fig. 1(b), the reduction to the maximum matching problem: a bipartite graph B(G) having as vertices two copies of V(G) and an edge between the first copy of v1 ∈ V(G) and the second copy of v2 ∈ V(G) iff there is a directed path in G from v1 to v2. The edges of a maximum matching of B(G) are highlighted, and a MPC for G is obtained by putting v1 and v2 in the same path there is an edge between v1 and v2 and is selected by the maximum matching. In Fig. 1(c), a network flow N(G) corresponding to G; the labels '1' on some edges are the lower bounds on that edges; all other edges have lower bound 0. The min-flow on N(G) has value 2; the edges with flow value 1 are highlighted; any decomposition of this flow into paths gives a MPC.

A recent solution for the MPC Problem reduces it instead to a min-flow problem [36], as follows. Each node of G is replaced by an arc with lower bound 1 (all other edges of G have lower bound 0), and a new global source s and sink t are added to G and connected to all sources and sinks of G, respectively (see Figures 1(a) and 1(c)). A min-flow on this digraph is a flow of minimum value satisfying all lower bounds. The value of the min-flow on this network equals the maximum size of an anti-chain of G, and any decomposition of it into paths gives a MPC [36]. A decomposition of a flow on a DAG into paths can be computed in time linear in the number of edges, by traversing the edges used by the flow [45]. A min-flow problem can be solved by two applications of a max-flow algorithm [45]. Therefore, using the recent result on max-flows [46], this approach finds a MPC in time O(nm).

If in the unweighted case, the complexity of the method of [36] is incomparable with the complexity of solving the MPC Problem by a maximum matching problem, in the weighted case, the method of [36] leads to one of improved complexity. This is obtained by an algorithm for the following restricted variant of the min-cost circulation problem [45, 47]: given a directed graph, and a flow lower bound for each edge and a cost per flow unit for each edge, the task is to find a circulation of minimum total cost satisfying all lower bounds. A circulation is a function assigning a flow value to each edge such that the flow conservation property is satisfied for all nodes; consequently, the flow network cannot have sources or sinks.

To solve the minimum weight MPC Problem, we extend the reduction in [36] by associating to the edges either cost 0, if they correspond to the nodes of G or are incident to s or t; or their weight in G, if they correspond to edges of G. Moreover, we add a new edge from t to s with lower bound 0 and having as cost the sum of all edge weights (plus a positive constant if all are 0). This implies that all min-cost circulations induce a min-flow (removing the edge from t to s), and thus, by [36], induce also a MPC that is of minimum weight; obviously, vice versa, a minimum weight MPC induces a min-cost circulation on the constructed flow network.

There are many algorithms and solvers for the min-cost circulation problem, with various time complexity upper bounds [47], for example O(nm log log C log(nK)) [48], where C is the maximum edge bound, and K is the maximum cost. If edges have only lower bounds, as in our case, the min-cost circulation problem can be solved in time O(n log C(m + n log n)) [35]; since we have C = 1, this reduces to O(n2 log n + nm). Therefore, we have the following theorem.

Theorem 1 A minimum weight MPC of a DAG with n nodes and m edges can be computed in time O(n2log n + nm), by a reduction to a min-cost circulation problem.

The new problem formulations

We first consider the problem arising from long reads, or from transfrags. We introduce a slight generalization of a path cover of a DAG G, namely a set of paths which cover only a given subset V of the nodes. We are also given a subset E of the edges of G, and a family of subpaths P i n in G that all have to be entirely covered by some path of the path cover. We could have modeled each edge constraint in E as a path of length 1 in P i n , but for clarity, we keep these separate. Formally, we have:

Minimum Path Cover with Subpath Constraints (MPC-SC) Problem INPUT: A DAG G and

  1. 1

    A subset V of V (G)

  2. 2

    A subset E of E(G)

  3. 3

    A family P i n = P 1 i n , , P t i n of directed paths in G

TASK: Find a minimum number k of directed paths P 1 s o l ,, P k s o l in G such that

  1. 1

    Every node in V occurs in some P i s o l

  2. 2

    Every edge in E occurs in some P i s o l

  3. 3

    Every path Pin ϵ P i n is entirely contained in some P i s o l

We call the elements of the sets V , E , P i n constraints, and we say that the k paths in a solution satisfy these constraints.

Let us briefly argue that the solution in [20, Sec. 2.4.1. and 2.4.2] for MPC-SC Problem (without the generalization at points 1 and 2) is not complete. (Actually, [20] tackles the Minimum Weight MPC with Subpath Constraints Problem--see below--, but the weights are not relevant for this discussion.) The idea of [20] is to reduce this problem to the classical MPC problem. Consequently, each subpath constraint P is modeled by a single edge having the same endpoints as P , which is subdivided by introducing a node v P in the middle (which must be covered by the MPC). The connections between the first or last node of P and the other nodes of the DAG are maintained, but since the internal nodes of P can no longer be required to be covered by the path cover, they are removed. Moreover, for all nodes v1 and v2 such that there is a path between v1 and v2 in the DAG using a proper subpath of P , a new transitive edge (v1, v2) is added. However, this reduction is missing the case in which two subpath constraints P1 and P2 are such that a suffix of P1 is a prefix of P2. As a matter of fact, our proof will show that the most problematic case is when also a suffix of different length of P1 is a prefix of some other subpath constraint P3 (see Figure 2 and the proof of Lemma 1).

Figure 2
figure 2

In Fig. 2(a), subpath constraint P , in Fig. 2(b) the reduction of [20] which replaces path P by node v P connected to the end points of P , removes all internal nodes of P , and adds all transitive edges from and to v1 and v2. In Fig. 2(c), a case not covered by the reduction in [20].

In the second problem, we consider the weighted case, with one further generalization, as follows. As also noted by [20], in practice the paths in the path cover should start only in source nodes or in a specific subset of other nodes of G; similarly for their ending nodes. For example, in our method for the multi-assembly of RNA-transcripts [37, 38], these nodes are identified when there is a sharp increase/decrease in read coverage in the middle of an exon, indicating the start/end of a transcript.

Minimum Weight Minimum Path Cover with Subpath Constraints (MW-MPC-SC) Problem

INPUT: A DAG G and

  1. 1

    A subset V of V (G)

  2. 2

    A subset E of E(G)

  3. 3

    A family P i n = P 1 i n , , P t i n of directed paths in G

  4. 4

    A superset S of the sources of G, and a superset T of the sinks of G

  5. 5

    A weight w(e) for each e ∈ E(G)

TASK: Find a minimum number k of directed paths P 1 s o l ,, P k s o l in G such that

  1. 1

    Every node in V occurs in some P i s o l

  2. 2

    Every edge in E occurs in some P i s o l

  3. 3

    Every path P in P i n is entirely contained in some P i s o l

  4. 4

    Every path P i s o l starts in a node of S and ends in a node of T

  5. 5

    i 1 , , k edge e P i s o l w e is minimum among all tuples of k paths satisfying properties 1-4

When only paired-end reads are available, each such pair of reads corresponds to a pair of subpaths that must both be covered by the same path in the path cover. Formally, we have:

Minimum Path Cover with Paired Subpath Constraints (MPC-PSC) Problem

INPUT: A DAG G and

  1. 1

    A subset V of V (G)

  2. 2

    A subset E of E(G)

  3. 3

    A family P i n = P 1 , 1 i n , P 1 , 2 i n , , P t , 1 i n , P t , 2 i n of pairs of directed paths in G

TASK: Find a minimum number k of directed paths P 1 s o l , , P k s o l in G such that

  1. 1

    Every node in V occurs in some P i s o l

  2. 2

    Every edge in E occurs in some P i s o l

  3. 3

    For every pair P j , 1 i n , P j , 2 i n P i n , there exists such P i s o l that both P j , 1 i n and P j , 2 i n are entirely contained in P i s o l

The MPC with Subpath Constraints (MPC-SC) Problem

The unweighted case

In this section, we reduce the MPC-SC Problem to the classical MPC Problem. We describe our reduction as a sequence of commented algorithmic steps.

Step 1. for every u , v E do: V := V \ u , v ;

If the MPC has a path P covering the arc (u, v), then P also covers both u and v. Therefore, the constraints u, v can be dropped from V (if present).

Step 2. for every path P j i n P i n and for every edge u , v P j i n do:

V : = V \ u , v ; E : = E \ u , v ;

Similarly to Step 1, if the MPC has a path P covering a subpath P j i n P i n , then P also covers every node and edge of P j i n , thus these constraints can be dropped from V and E (if present).

Step 3. while there exist two paths P i i n and P j i n in P i n such that P i i n is contained in P j i n do:

P i n : = P i n \ P i i n ;

After this step, no subpath constraint is completely included into another; this is key for the correctness of Step 4 below.

Step 4. while there exist two paths P i i n , P j i n P i n such that a suffix of P i i n is a prefix of P j i n do:

let P i i n , P j i n P i n be as above and with the common part (i.e., the suffix of P i i n which is a prefix of P j i n ) the longest possible;

let P n e w i n := the path P i i n P j i n which starts as P i i n and ends as P j i n ;

P i n : = P i n \ P i i n , P j i n P n e w i n ;

In this step, we merge paths sharing a suffix/prefix. We do this iteratively, at each step merging that pair of paths for which the shared suffix/prefix is longest possible. The correctness of this step is guaranteed by Lemma 1 below.

Lemma 1 If the MPC-SC Problem on an instance G , V , E , P i n admits a solution with k paths, then also the problem instance transformed by applying Steps 1-4 admits a solution with k paths, and this solution also satisfies the original constraints V , E , P i n .

Proof The correctness of Steps 1-3 was argued next to their introduction. Assume that G, V , E , P i n have been transformed by these first three steps, and let P i i n , P j i n P i n be such that their common part (i.e., the suffix of P i i n which is a prefix of P j i n ) is longest possible. Suppose that the original problem admits a solution P s o l = P 1 s o l , , P k s o l such that P i i n , P j i n are covered by different solution paths say P a s o l and P b s o l , respectively. We show that the transformed problem admits a solution P * = P 1 s o l , , P k s o l \ P a s o l , P b s o l P a * , P b * , having the same cardinality asP, in which P i i n , P j i n are covered by the same path P a * , and P * also satisfies the original constraints V , E , P i n .

Suppose that P i i n starts with node u i and ends with node v i , and that P j i n starts with node u j and ends with node v j . Let (cf. Figures 3(a) and 3(b)):

Figure 3
figure 3

A visual proof of Lemma 1.

  • P a * be the path obtained as the concatenation of the path P a s o l taken from its starting node until v i with the path P b s o l taken from v i until its end node (so that P a * covers both P i i n and P j i n ).

  • P b * be the path obtained as the concatenation of the path P b s o l taken from its starting node until v i with the path P a s o l taken from v i until its end node.

We have to show that the path cover P * = P 1 s o l , , P k s o l \ P a s o l , P b s o l P a * , P b * satisfies the original constraints V , E , P i n . Since P a * and P b * use exactly the same edges as P a s o l and P b s o l , then V and E are satisfied. Moreover, the only two problematic cases are when there is a subpath constraint P k i n which has v i as internal node and is satisfied only by P a s o l , or it is satisfied only by P b s o l . Denote, analogously, by u k and v k the endpoints of P k i n . From the fact that the input was transformed at Step 3, P i i n and P j i n are not completely included in P k i n .

Case 1. P k i n is satisfied only by P a s o l (Figures 3(a) and 3(b)). Since P i i n is not completely included in P k i n , u k is an internal node of P i i n ; thus, a suffix of P i i n is prefix also of P k i n . From the fact that the common part between P i i n and P j i n is longest possible, we have that vertices u j , u k , v i appear in this order in P i i n . Thus, P k i n is also satisfied by P b * , since u k appears after u j on Pi.

Case 2. P k i n is satisfied only by P b s o l , and it is not satisfied by P a * (Figure 3(c)). This means that P k i n starts on P b s o l before u j and, since it contains v i , it ends on P b s o l after v i . From the fact that P j i n in not completely included in P k i n , v k is an internal node of P j i n , and thus a suffix of P k i n equals a prefix of P j i n . This common part is now longer than the common suffix/prefix between P i i n and P j i n , which contradicts maximality of the suffix/prefix between P i i n and P j i n . This proves the lemma.

The remaining steps can be seen as analogous to the reduction in [20].

Step 5. for every path P i i n P i n do:

say P i i n starts in node s and ends in node t;

P i n : = P i n \ P i i n ;
E G : = E G s , t ;
E = E s , t ;

In this step, we represent each subpath constraint by an edge constraint. Its correctness is guaranteed by the fact that by now, no two subpath constraints are such that a suffix of the first is a prefix of the second. We should stress out that if there are more paths with the same endpoints, we may add parallel edges to the DAG. However, in Step 6 below these parallel edges will be transformed into parallel paths of length 2, rendering the DAG simple again.

Step 6. for every edge e E do:

E : = E e ;

subdivide the edge e by introducing a node v e in the middle of it;

V : = V e ;

At this point, we have transformed all subpath constraints into edge constraints. The edge constraints can be modeled as node constraints by simply subdividing each edge and introducing a new node in the middle of it; this node is then added to V .

Step 7. G:= T(G)

We replace G by its transitive closure, since in Step 8 below we are going to remove from G all vertices not in V .

Step 8. Remove from G all nodes not in V ;

Since only the nodes in V have to be covered by the paths in the path cover, we remove all other nodes. This is correct, since, at Step 7 above, we introduced all edges between nodes v and v such that v was reachable from v through some nodes not in V .

Step 9. Compute a MPC for the resulting graph G;

This can be done by any method discussed previously.

Step 10. Postprocess the paths obtained at Step 9 above by reverting the transformations executed at Steps 1-8, in reverse order.

Theorem 2 Problem MPC-SC on a graph with n nodes, m edges, c subpath or edge constraints, and with N being the sum of subpath constraint lengths, can be solved by solving the classical MPC Problem in a graph with O(n + c) nodes and O(n2 + c) edges. This graph can be computed in time O(N + c2 + n2), thus the complexity of Problem MPC-SC is O N + c 2 + n + c n 2 + c .

Proof The complexity of the pre-processing phase is dominated by Steps 3 and 4. Step 3 can be solved by first building a (generalized) suffix tree on the concatenation of subpath constraints with a distinct symbol # i added after each constraint sequence P i i n . This can be done in O(N) time even on our alphabet of size O(n) [32].

Then one can do as follows during depth-first traversal of the tree: If a leaf corresponding to the suffix starting at the beginning of subpath constraint P i i n has an incoming edge labeled by only # i and its parent has still other children, then the constraint is a substring of another constraint and must be removed (together with the leaf).

For Step 4, we compute all pairs longest suffix-prefix overlaps between the subpath constraints using an O(N + c2) time algorithm in [33, Theorem 7.10.1, page 137], [45] with [32] as a subroutine for the sake of large alphabet. The output can be casted to a double-linked list L containing elements of the form (i, j, len, prev i , next i , prev j , next j ) in decreasing order of the overlap length, len, between constraints P i i n and P j i n . Pointers prev i , next i , prev j , and next j tell the previous/next occurrences of the tuple having i as the first element and j as the second element, respectively. Then popping the first tuple from L tells us the first constraints to merge, and following prev and next pointers we can remove all overlaps no longer relevant for next mergings; when removing, we need to make sure the nested double-linked lists formed by the prev and next pointers are also updated. Continuing like this until the list L is empty gives all the overlaps in total O(c2) time. Notice that the new merged constraints do not need to be separately taken into account in overlap computation; no completely new overlaps can be created due to Step 3.

Merging itself requires a similar linked list structure being a special case of unionfind: All the constraints are represented as double-linked lists with node numbers as elements. Merging can be done by linking the double-linked lists together, removing the extra overlapping part from the latter list and redirecting its start pointer to point inside the newly formed merged list. When finished with merging, the new constraints are exactly those old constraints whose start pointers still point to the beginning of a node list. The complexity of merging is thus O(N).

The weighted case

To solve the MW-MPC-SC Problem, we build on the reduction in [36] to a network flow problem. This reduction will allow the addition of edge weights and of constraints on the starting/ending nodes of the solution paths. Note that these constraints S and T cannot be included in the reduction of the MPC Problem to a bipartite matching problem. Moreover, the heuristic in [20, Sec. 2.4.2] of arbitrarily extending the paths in a minimum weight MPC towards sources/sinks cannot be proved to be correct.

Given an input G , V , E , S , T , w for the MW-MPC-SC Problem, we pre-process the graph G by Steps 1-6 of the unweighted case (shown in the previous section). After this pre-processing, we have correctly modeled all subpath constraints by node constraints. On the transformed graph G, we then do a similar reduction as for Thm. 1 (see Figure 4):

Figure 4
figure 4

In Fig. 4(a), an input DAG G with two subpath constraints P 1 and P 2 ; we take. V =V G , E = , S = {a, d, e} and T = {f, c}; weights are not drawn. In Fig. 4(b), the graph transformed by Steps 1-6; the vertices still in V are drawn as circles, other vertices as squares. In Fig. 4(c), the reduction to a min-cost circulation problem; the edges with flow lower bound 1 are labeled as '1'; other edges have flow lower bound 0. In a min-cost circulation of value 3, all highlighted edges have flow value 1, except for (f, t) with flow value 2, and (t, s) with value 3. Any decomposition of the min-cost circulation into 3 paths gives the solution for Problem MW-MPC-SC.

  1. 1

    We replace each node v V by an edge (v1, v2) such that all in-neighbors of v are now in-neighbors of v1, and all out-neighbors of v are now out-neighbors of v2. If node v was introduced at Step 6 to model an edge coming from a subpath constraint P , then the cost per unit of flow of (v1, v2) is the sum of the weights of the edges of P ; otherwise, it is 0.

  2. 2.

    For each edge e of G, if e is an original edge of G, we set its flow lower bound to 0 and its cost per unit of flow to w(e); otherwise we set both to 0.

  3. 3.

    The global source s has out-going edges precisely to the nodes in the set S, and the global sink t has in-coming edges precisely from the nodes in T ; we also add the edge (t, s). All edges incident to s or t have flow flower bound 0 and cost 0, except for the edge (t, s) having as cost the sum of all edge weights (plus a positive constant if all are 0). This guarantees, like before, that any min-cost circulation is also a min-flow.

Note that, by reducing to a flow problem, we do not have to perform Steps 7 and 8 anymore, since the coverage constraints are now modeled as flow lower bound constraints. As in the case of Thm. 1, we compute a min-cost circulation on this transformed input G, that is, a function f : E(G) ℕ which satisfies all the flow conservation property for all nodes, satisfies all edge lower bounds, and minimizes e E G f e . We then decompose the circulation (from which we remove the edge (t, s)) into paths, and covert these paths into paths of the original input graph. This is done by reverting the transformations executed at Steps 1-6, in reverse order (as done for the MPC-SC Problem). As before, these paths form a MPC satisfying all constraints, and they also start and end in vertices of S and T , respectively (because of the way s and t were connected to the other nodes of the graph). Since these paths arise from a min-cost circulation, then they also form a minimum weight MPC satisfying the input constraints. The flow network has only flow lower bounds, thus we can again apply the algorithm of [49], to get the following:

Theorem 3 Problem MW-MPC-SC on a graph with n nodes, m edges, c subpath or edge constraints, and with N being the sum of subpath constraint lengths, can be solved by reducing it to a min-cost circulation problem on a network with O(n + c) nodes and O(m + c) edges, and with flow lower bounds only. This network can be computed in time O(N + c2 + m), and the complexity of Problem MW-MPC-SC becomes O(N + (n + c)2 log(n + c) + (n + c)(m + c)).

The MPC with Paired Subpaths Constraints (MPC-PSC) Problem

The NP-completeness proof

In this section we show that the MPC-PSC Problem is NP-complete. Our reduction is from the NP-complete problem of deciding whether the chromatic number of a graph G, χ(G), is 3 [50]. We will show that it is actually NP-complete to determine if the MPC-PSC Problem admits a solution with just 3 paths, even on planar DAGs, of width 2, series-parallel, when only paired subpath constraints are imposed, and all subpaths are just edges.

Let G = (V, E) with V = {v1,..., v n } and E = {e1,..., e m } be any non-bipartite graph; our question is whether χ(G) = 3. We reformulate this question by building up the DAG P(G) drawn in Figure 5. P(G) consists of a first stage of n blocks corresponding to the n vertices of G, and a second stage of m blocks corresponding to each edge e k = v i k v j k of G, k ∈ {1, ..., m}. Only some of the nodes and edges of P (G) have been labeled; when an edge is labeled [L], we mean that in the family of paired subpath constraints we have the constraint (L, [L]).

Figure 5
figure 5

A reduction from chromatic number 3 to the MPC-PSC Problem.

Theorem 4 Problem MPC-PSC is NP-complete.

Proof We show that the graph G = (V, E) has χ(G) = 3 if and only if the DAG P(G) drawn in Figure 5 admits a solution to Problem MPC-PSC with 3 paths.

(⇒) Suppose that χ(G) = 3. Definitely, we need at least three paths to solve P(G), since the three edges v1, X1, Y1 exiting from node 0 cannot be covered by the same path, and each of them is mentioned in some constraint. By definition, G is 3 colorable if and only if V can be partitioned into three sets V A , V B , V C such that no edge of G is contained in any of them. We use these three sets to build up the three solution paths for Problem MPC-PSC as follows: for all X ∈ {A, B, C}, in the first stage (until node n) path P X picks up all edges labeled with a node in V X and no edge labeled with a node in V\V X ; next, in the second stage (from node n until node n + m), P X picks up those edges v i k such that v i k belongs to P X . This is possible, since no edge e k = v i k v j k is contained in the same color class, and consequently the two of edges of P (G) labeled v i k and v j k do not belong to the same path among {P A , P B , P C }. Thus, v i k and v j k do not have to be both covered by the same solution path. Therefore, the three paths P A , P B , P C satisfy all paired subpath constraints, and are a solution to Problem MPC-PSC.

(⇐) Suppose the DAG P(G) drawn in Figure 5 admits a solution to Problem MPC-PSC with 3 paths P A , P B , P C . Then, we partition V into three color classes A, B, C by setting v i ∈ × if and only if the edge of P(G) labeled by v i (in the first stage from node 0 to node n) belongs to P X , for all X ∈ {A, B, C}. To see that {A, B, C} is indeed a partition of V , observe that in each block k of the first stage of P(G), no two paths in {P A , P B , P C } can share an edge, since all three edges v k , X k , Y k appear in some constraint. Therefore, each edge v k appears in exactly one of {P A , P B , P C }. The proof that the partition {A, B, C} is also a proper coloring of G encounters no difficulty, as the rationale behind the reduction was illustrated in the forward implication.

Corollary 1 For no ε > 0 there exists a 4 3 - -approximation algorithm for Problem MPC-PSC unless P=NP. Moreover, the problem is not FPT when parameterized on OPT (the minimum number of paths in a solution).

The FPT algorithm

In the previous section, we obtained the NP-completeness for the decision problem OPT = 3; this rules out a Dynamic Programming approach for Problem MPC-PSC. In this section, we show that if OPT = 2, then the problem can be solved in polynomial time. This also leads to an FPT algorithm on the total number of constraints.

For any constraint of the input DAG G that is made up of a pair (P1, P2) of subpaths of G, we may assume that there exists a directed path of G completely containing both P1 and P2, otherwise, the input instance is infeasible. Given any two constraints X and Y (X and Y can be nodes, edges, or pairs of subpaths), we say that X and Y are compatible if there is a directed path of G completely containing both X and Y . We exploit the following structural property:

Lemma 2 Let C be a set of constraints on a DAG G. There exists a directed path P in G which satisfies all constraints in C if and only if any two constraints in C are compatible.

Proof The forward implication is clear from the definition. For the backward implication, recall that the width of a DAG denotes the maximum size of an anti-chain of it. We claim that the union of the constraints in C is a DAG of width 1. Indeed, if it were of width 2 it would contain two nodes v1 and v2 which are pairwise not reachable by a directed path, thus forming an anti-chain of size 2. Since we assumed that for all pairs (P1, P2) of subpaths constraints of G, there exists a directed path of G completely containing both P1 and P2, this implies that v1 and v2 belong to two different constraints X and Y in C. Thus, X and Y are not compatible, a contradiction.

Theorem 5 Given an instance for Problem MPC-PSC, we can decide in polynomial time if OPT = 2, and if so, find the two solution paths. Moreover, Problem MPC-PSC is fixed-parameter tractable (FPT) in the total number C of input constraints.

Proof We build an incompatibility graph from the input constraints: every constraint is represented by a node, and we add an edge between two constraints iff they are incompatible. Then, OPT = 2 iff this incompatibility graph is bipartite, and the two classes of the bipartition give the two solution paths; this can be done in time O(C2). If OPT > 2, then we try all possible ways of partitioning the set of all input constraints (the number of these possibilities is a function only on C), and check that each class of the partition consists of pairwise compatible constraints.