Kernelizations for the Hybridization Number Problem on Multiple Nonbinary Trees

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8747)

Abstract

A well-studied problem in phylogenetics is to determine the minimum number of hybridization events necessary to explain conflicts among several evolutionary trees, e.g. from different genes. An evolutionary history with hybridization events (or, more generally, reticulations) can be described by a rooted leaf-labelled directed acyclic graph, which is called a phylogenetic network. The reticulation number of such a phylogenetic network can be defined as the sum of all indegrees minus the number of vertices plus one. The considered problem can now formally be stated as follows. Given a finite set \(X\), a collection \(\mathcal {T}\) of rooted phylogenetic trees on \(X\) and \(k\in \mathbb {N}^{+}\), the Hybridization Number problem asks if there exists a rooted phylogenetic network on \(X\) that displays all trees from \(\mathcal {T}\) and has reticulation number at most \(k\). We show that Hybridization Number admits a kernel of size \(4k(5k)^t\) if \(\mathcal {T}\) contains \(t\) (not necessarily binary) rooted phylogenetic trees. In addition, we show a slightly different kernel of size \(20k^2(\varDelta ^+-1)\) with \(\varDelta ^+\) the maximum outdegree of the input trees.

1 Introduction

In phylogenetics, the central challenge is to construct a plausible evolutionary history for a set of contemporary species \(X\) given incomplete data. This usually concerns biological evolution, but the paradigm is equally applicable to more abstract form s of evolution, e.g. natural languages [16]. Classically an evolutionary history is modelled by a rooted phylogenetic tree, essentially a rooted tree in which the leaves are bijectively labelled by \(X\) [18]. In recent years, however, there has been growing interest in generalizing this model to directed acyclic graphs, i.e., to rooted phylogenetic networks [1, 9, 15]. In the latter model, reticulations are of central importance, which are vertices of indegree 2 (or higher); these are used to represent non-treelike evolutionary phenomena such as hybridization and lateral gene transfer. This has naturally given rise to the Hybridization Number problem: given a set of phylogenetic trees \(\mathcal {T}\) on the same set of taxa \(X\), construct a phylogenetic network on \(X\) with as few indegree-2 vertices as possible, such that an image of every tree in \(\mathcal {T}\) is embedded in the network [2]. We defer formal definitions to the preliminaries.

Hybridization Number has attracted considerable interest in a short space of time. Even in the case when \(\mathcal {T}\) consists of two binary (i.e., bifurcating) trees the problem is NP-hard, APX-hard [5] and in terms of approximability is a surprisingly close relative of the problem Directed Feedback Vertex Set [13, 19]. On the positive side, this variant of the problem is fixed-parameter tractable (FPT) in parameter \(k\), the minimum number of indegree-2 vertices required. Initially this was established via kernelization [4], but more recently efficient bounded-search algorithms have emerged with \(O( 3.18^k \cdot poly(n))\) being the current state of the art, where \(n=|X|\) [21].

In this article we focus on the general case when \(|\mathcal {T}| \ge 2\) and the trees in \(\mathcal {T}\) are allowed to be nonbinary (i.e., not necessarily binary). This causes complications for two reasons. First, when \(|\mathcal {T}| > 2\) the popular “maximum acyclic agreement forest” abstraction breaks down, a central pillar of algorithms for the \(|\mathcal {T}|=2\) case. Second, in the nonbinary case the images of the trees in the network are allowed to be more “resolved” than the original trees. (More formally, an input tree \(T\) is seen as being embedded in a network \(N\) if \(T\) can be obtained from a subgraph of \(N\) by contracting edges.) The reason for this is that vertices with outdegree greater than two are used by biologists to model uncertainty in the order that species diverged. Both factors complicate matters considerably. Consequently, progress has been more gradual.

For the case of multiple binary trees, there exists a polynomial kernel [20], various heuristics [6, 7, 22] and an exact approach without running-time bound [23].

For the case of two nonbinary trees, there is also a polynomial kernel [14], based on a highly technical kernelization argument, and a simpler FPT algorithm based on bounded search [17].

This leaves the case of an unbounded number of nonbinary trees as the main variant for which it is unclear whether the problem is FPT. There has, however, been some partial progress: for fixed \(k\) the problem is polynomial-time solvable and the problem is FPT if the number of trees is bounded or the maximum outdegree of the trees is bounded [11]. The main problem with the result from [11] is its theoretical character: it is indirect (based on [12]) and yields a bounded-search algorithm with astronomical running time.

Here we mirror the bounded-search result from [11] by showing that Hybridization Number admits a kernel of size \(4k(5k)^t\) if \(\mathcal {T}\) contains \(t\) nonbinary rooted phylogenetic trees. In addition, we show a slightly different kernel of size \(20k^2(\varDelta ^+-1)\) with \(\varDelta ^+\) the maximum outdegree. We believe this result is important for several reasons.

First, it is the first polynomial kernel for any fixed number of nonbinary trees, and the first polynomial kernel for an unbounded number of trees with outdegrees bounded by any constant.

Second, it represents a significant step forward in our understanding of the complexities associated with nonbinary trees. In particular, the result of [14] is so technical due to the difficulties of dealing with so-called common chains, which in the case of binary trees are much easier to deal with [4, 20]. The sister result of [11] avoids this technical analysis by exhaustive guessing which is mathematically unsatisfying and is one of the reasons for its purely theoretical running time. Here, for the first time, we present a simple and unified kernelization strategy for dealing with common chains which avoids technical case analysis (and exhaustive guessing) and can cope with the chains as they unfold across many trees.
Fig. 1.

A (rooted phylogenetic) network \(N\) and a (rooted phylogenetic) tree \(T\). Network \(N\) is binary, has two reticulations (unfilled) and reticulation number 2. Tree \(T\) is displayed by \(N\) because it can be obtained from \(N\) by deleting the dotted edges and contracting the dashed edges.

Third, the \(4k(5k)^t\) kernel introduces an interesting way to deal with multiple parameters simultaneously. It is based on searching, for decreasing \(q\), for certain substructures called “\(q\)-star chains”, which are chains that are common to all \(t\) input trees and form stars in \(q\) of the input trees. When we encounter such substructures we truncate them to a size that is a function of \(q\) and \(k\). Since we loop through all possible values of \(q\) (\(0\le q\le t\)), we eventually truncate all common substructures. The correctness of each step heavily relies on the fact that substructures for larger values of \(q\) have already been truncated. However, when \(q\) decreases, the size to which substructures can be reduced increases (as will become clear later). This has the effect that the size of kernelized instances is a function of \(k\) and \(t\) and not of \(k\) only. For the \(20k^2(\varDelta ^+-1)\) kernel, we use a similar but simpler technique.

2 Preliminaries

Let \(X\) be a finite set. A rooted phylogenetic\(X\)-tree is a rooted tree with no vertices with indegree 1 and outdegree 1, a root with indegree 0 and outdegree at least 2, and leaves bijectively labelled by the elements of \(X\). We identify each leaf with its label. We henceforth call a rooted phylogenetic \(X\)-tree a tree for short. A tree \(T\) is a refinement of a tree \(T'\) if \(T'\) can be obtained from \(T\) by contracting edges.

Throughout the paper, we refer to directed edges simply as edges. If \(e=(u,v)\) is an edge, then we say that \(v\) is a child of \(u\), that \(u\) is a parent of \(v\) and that \(v\) is the head of \(e\).

A rooted phylogenetic network is a directed acyclic graph with no vertices with indegree 1 and outdegree 1 and leaves bijectively labelled by the elements of \(X\). Rooted phylogenetic networks will henceforth be called networks for short in this paper. A tree \(T\) is displayed by a network \(N\) if \(T\) can be obtained from a subgraph of \(N\) by contracting edges. Note that, without loss of generality, we may assume that edges incident to leaves are not contracted. See Fig. 1 for an example. Using \(d^-(v)\) to denote the indegree of a vertex \(v\), a reticulation is a vertex \(v\) with \(d^-(v)\ge 2\). The reticulation number of a network \(N\) with vertex set \(V\) is given by
$$ r(N)=\sum _{v\in V : d^-(v)\ge 2}(d^-(v)-1). $$
Given a set of trees \(\mathcal {T}\) on \(X\), we use \(r(\mathcal {T})\) to denote the minimum value of \(r(N)\) over all phylogenetic networks \(N\) on \(X\) that display \(\mathcal {T}\). We are now ready to formally define the problem we consider.

Problem:Hybridization Number.

Instance: A finite set \(X\), a collection \(\mathcal {T}\) of rooted phylogenetic trees on \(X\) and \(k\in \mathbb {N}^{+}\).

Question: Is \(r(\mathcal {T})\le k\), i.e., does there exist a phylogenetic network \(N\) on \(X\) that displays \(\mathcal {T}\) and has \(r(N)\le k\)?

A network is called binary if each vertex has indegree and outdegree at most 2 and if each vertex with indegree 2 has outdegree 1. By the following lemma we may restrict to binary networks.

Observation 1

[11]. If there exists a network \(N\) on \(X\) that displays \(\mathcal {T}\) then there exists a binary network \(N'\) on \(X\) that displays \(\mathcal {T}\) such that \(r(N)=r(N')\).

The observation follows directly from noting that, for each network \(N\), there exists a binary network \(N'\) with \(r(N')=r(N)\) such that \(N\) can be obtained from \(N'\) by contracting edges. Hence, any tree displayed by \(N\) is also displayed by \(N'\).

A subtree \(T'\) of a network \(N\) (or of a tree \(T\)) is said to be pendant if no vertex of \(T'\) other than possibly its root has a child that is not in \(T'\). A pendant subtree is called trivial if it has only one leaf.
Fig. 2.

A network \(N\) and the 4-reticulation generator \(G\) underlying \(N\). Generator \(G\) has two vertex sides \(s_8\) and \(s_{15}\) and 13 edge sides. For example, leaves \(d,e\) and \(f\) are on edge side \(s_6\) and leaf \(g\) is on vertex side \(s_8\).

The notion of “generators” is used to describe the underlying structure of a network without nontrivial pendant subtrees [12]. Let \(k\in \mathbb {N}^{+}\). A binary\(k\)-reticulation generator is defined as an acyclic directed multigraph with a single root with indegree 0 and outdegree 1, exactly \(k\) vertices with indegree 2 and outdegree at most 1, and all other vertices have indegree 1 and outdegree 2. See Fig. 2 for an example. Let \(N\) be a binary network with no nontrivial pendant subtrees and with \(r(N)=k\). Then, a binary \(k\)-reticulation generator is said to be the generator underlying \(N\) if it can be obtained from \(N\) by adding a new root with an edge to the old root, deleting all leaves and suppressing all resulting indegree-1 outdegree-1 vertices. In the other direction, \(N\) can be reconstructed from its underlying generator by subdividing edges, adjoining a leaf to each vertex that subdivides an edge, or has indegree 2 and outdegree 0, via a new edge, and deleting the outdegree-1 root. The sides of a generator are its edges (the edge sides) and its vertices with indegree 2 and outdegree 0 (the vertex sides). Thus, each leaf of \(N\) is on a certain side of its underlying generator. To formalize this, consider a leaf \(x\) of a binary network \(N\) without nontrivial pendant subtrees and with underlying generator \(G\). If the parent \(p\) of \(x\) has indegree 2, then \(p\) is a vertex side of \(G\) and we say that \(x\)is on side \(p\). If, on the other hand, the parent \(p\) of \(x\) has indegree 1 and outdegree 2, then \(p\) is used to subdivide an edge side \(e\) of \(G\) and we say that \(x\)is on side \(e\). We say that two leaves \(x\) and \(y\) (with \(x\ne y\)) are on the same side of \(N\) if the underlying generator of \(N\) has an edge side \(e\) such that \(x\) and \(y\) are both on side \(e\). The following lemma from [20] will be useful.

Lemma 1

[20]. If \(N\) is a binary phylogenetic network with no nontrivial pendant subtrees and \(r(N)=k>0\) and if \(G\) is its underlying generator, then \(G\) has at most \(4k-1\) edge sides, at most \(k\) vertex sides and at most \(5k-1\) sides in total.

A kernelization of a parameterized problem is a polynomial-time algorithm that maps an instance \(x\) with parameter \(k\) to an instance \(x'\) with parameter \(k'\) such that (1) \((x',k')\) is a yes-instance if and only if \((x,k)\) is a yes-instance, (2) the size of \(x'\) is bounded by a function \(f\) of \(k\), and (3) the size of \(k'\) is bounded by a function of \(k\) [8]. A kernelization is usually referred to as a kernel and the function \(f\) as the size of the kernel. Thus, a parameterized problem admits a polynomial kernel if there exists a kernelization with \(f\) being a polynomial. A parameterized problem is fixed-parameter tractable (FPT) if there exists an algorithm that solves the problem in time \(O(g(k)|x|^{O(1)})\), with \(g\) being some function of \(k\) and \(|x|\) the size of \(x\). It is well known that a parameterized problem is fixed-parameter tractable if and only if it admits a kernelization and is decidable. However, there exist fixed-parameter tractable problems that do not admit a kernel of polynomial size unless the polynomial hierarchy collapses [3]. Kernels are of practical interest because they can be used as polynomial-time preprocessing which can be combined with any algorithm (usually an exponential-time exact algorithm) solving the problem.
Fig. 3.

Example instance of Hybridization Number consisting of four trees that have a common pendant subtree on \(\{f,g,h\}\) and a common 1-star chain \((d,c,b,a)\). Chain \((d,c,b,a)\) is pendant in \(T_1\) and \(T_2\) but not in \(T_3\) and \(T_4\). It is a 1-star chain because all its leaves have a common parent in only \(T_2\).

3 A Polynomial Kernel for a Bounded Number of Trees

We first introduce the following key definitions. Let \(\mathcal {T}\) be a set of trees. A tree \(T'\) is said to be a common pendant subtree of \(\mathcal {T}\) if it is a refinement of a pendant subtree of each \(T\in \mathcal {T}\) and \(T'\) is said to be nontrivial if it has at least two leaves.

Definition 1

If \(T\) is a tree on \(X\), \(p\ge 2\) and \(x_1,\ldots ,x_p\in X\), then \((x_1,\ldots ,x_p)\) is a chain of \(T\) if:
  1. 1.

    there exists a directed path \((v_1,...,v_t)\) in T, for some \(t\ge 1\);

     
  2. 2.

    each \(x_i\) is a child of some \(v_j\);

     
  3. 3.

    if \(x_i\) is a child of \(v_j\) and \(i < p\), then \(x_{i+1}\) is either a child of \(v_j\) or of \(v_{j+1}\);

     
  4. 4.

    for \(i\in \{2,\ldots ,t-1\}\), the children of \(v_i\) are all in \(\{v_{i+1},x_1,x_2,\ldots ,x_p\}\).

     
If, in addition, \(t=1\) or the children of \(v_t\) are all in \(\{x_1,\ldots ,x_p\}\), then \((x_1,\ldots ,x_p)\) is said to be a pendant chain of \(T\). The length of the chain is \(p\).

A chain is said to be a common chain of \(\mathcal {T}\) if it is a chain of each tree in \(\mathcal {T}\). The following observations follow easily from the definition of a chain.

Observation 2

If \((x_1,\ldots ,x_p)\) is a common chain of \(\mathcal {T}\) and \(1\le i < j \le p\), then \((x_i,\ldots ,x_j)\) is a common chain of \(\mathcal {T}\).

Observation 3

If \((x_1,\ldots ,x_p)\) is a chain of a tree \(T\)\(1\le i < j \le p\) and \(x_i\) and \(x_j\) have a common parent in \(T\), then \(x_i,\ldots ,x_j\) have a common parent in \(T\).

Definition 2

If \(\mathcal {T}\) is a set of trees on \(X\) and \(x_1,\ldots ,x_p\in X\), then \((x_1,\ldots ,x_p)\) is a common\(q\)-star chain of \(\mathcal {T}\) if:
  1. (a)

    \((x_1,\ldots ,x_p)\) is a common chain of \(\mathcal {T}\) and

     
  2. (b)

    in precisely \(q\) trees of \(\mathcal {T}\), all of \(x_1,\ldots ,x_p\) have a common parent.

     

We say that a common \(q\)-star chain \((x_1,\ldots ,x_p)\) of \(\mathcal {T}\) is maximal if there is no common \(q\)-star chain \((y_1,\ldots ,y_{p'})\) of \(\mathcal {T}\) with \(\{x_1,\ldots ,x_p\} \subsetneq \{y_1,\ldots ,y_{p'}\}\). Notice that a common 0-star chain is a common chain that does not form a star in any tree. An illustration of the above definitions is in Fig. 3.

We are now ready to describe the kernelization, which is in Algorithm 1.

It is not too difficult to see that the subtree reduction preserves the reticulation number and can be applied in polynomial time.

To prove correctness of the chain reduction, we use two lemmas which have been omitted due to space constraints and can be found in the full version of this paper. The idea of these lemmas is illustrated in Fig. 4. The two trees \(T_1\) and \(T_2\) in this figure have a common chain \((a,b,c,d,e)\). Both trees are displayed by network \(N\). However, the leaves of the chain are spread out over different sides of the underlying generator \(G\) of \(N\). To prove correctness of the chain reduction, we want to argue that there exists a modified network \(N'\) in which the leaves of the chain \((a,b,c,d,e)\) all lie on the same side. Moreover, network \(N'\) should display all input trees and its reticulation number should not be higher than the reticulation number of \(N\).

In \(T_1\), all leaves of the chain have a common parent. For this case, Lemma 4 (omitted) argues that all leaves of the chain can be moved to any side that contains at least one of its leaves, and the resulting network still displays \(T_1\).

In \(T_2\), there are two leaves \(x_i=d\) and \(x_j=e\) that are on the same side of \(G\) (the blue side \(s_b\)) and that do not have a common parent in \(T_2\). For this case, Lemma 5 (omitted) argues that all the leaves of the chain can be moved to side \(s_b\), and the resulting network will still display \(T_2\). (Note that we cannot move all the leaves of the chain to the red side \(s_r\), even though it contains two leaves \(b,c\) of the chain, because \(b\) and \(c\) have a common parent in \(T_2\)).

Hence, the network \(N'\) obtained by moving all leaves of the chain to the blue side \(s_b\) displays both \(T_1\) and \(T_2\). Furthermore, \(r(N')=r(N)=2\).
Fig. 4.

Two trees \(T_1\) and \(T_2\) with a common chain \((a,b,c,d,e)\) highlighted in blue (grey), a network \(N\) that displays these trees, the network \(N'\) as constructed in Lemmas 4 and 5, and the underlying generator \(G\) of both networks. Dashed and dotted edges are used to indicate that \(T_2\) can be obtained from either of \(N\) and \(N'\) by deleting the dotted edges and contracting the dashed edges (Color figure online).

The next lemma shows correctness of the chain reduction, and thereby of Algorithm 1. It is based on the idea that, if a chain is long enough, one of Lemmas 4 and 5 applies for each tree.

Lemma 2

Let \(q\in \{0,\ldots ,t-1\}\) (with \(t=|\mathcal {T}|\)) and let \((X,\mathcal {T},k)\) be an instance of Hybridization Number without nontrivial common pendant subtrees or maximal common \(q'\)-star chains of more than \((5k)^{t-q'}\) leaves, for \(q<q'\le t-1\). Let \((X',\mathcal {T}',k)\) be the instance obtained after applying the chain reduction to a maximal common \(q\)-star chain \(C=(x_1,\ldots ,x_p)\) of \(\mathcal {T}\) with \(p>(5k)^{t-q}\). Then \(r(\mathcal {T})\le k\) if and only if \(r(\mathcal {T}')\le k\).

Proof

It is clear that if \(r(\mathcal {T})\le k\) then \({r(\mathcal {T}')\le k}\) because the chain reduction only deletes leaves (and suppresses and deletes vertices).

It remains to prove the other direction. Assume that \(r(\mathcal {T}')\le k\), i.e., there exists a network \(N'\) that displays \(\mathcal {T}'\) and has \(r(N')\le k\). Define \(m:=(5k)^{t-q-1}\). Hence, there are no common chains of \(\mathcal {T}\) of more than \(m\) leaves that have a common parent in more than \(q\) of the trees.

Let \(C'=(x_1,\ldots ,x_{5km})\). First observe that \(C'\) is a common chain of \(\mathcal {T}'\) and, moreover, that \(C'\) is a common \(q'\)-star chain of \(\mathcal {T}'\) with \(q'\ge q\). Moreover, we claim the following.

Claim (1). Any two leaves in \(\{ x_1,\ldots ,x_{5km-1}\}\) have a common parent in a tree \(T\in \mathcal {T}\) if and only if they have a common parent in the corresponding tree \(T'\in \mathcal {T}'\).

This claim follows directly from the observation that, in the chain reduction, the parents of \(x_1,\ldots ,x_{5km-1}\) cannot become outdegree-1 and are therefore not being suppressed. Correctness of the next claim can be verified in a similar way.

Claim (2). If \(C\) is not pendant in \(T\in \mathcal {T}\), then any two leaves in \(\{ x_1,\ldots ,x_{5km}\}\) have a common parent in \(T\) if and only if they have a common parent in the corresponding tree \(T'\in \mathcal {T}'\).

Now define
$$ C^*:=(x_1,x_{1+m},x_{1+2m},\ldots ,x_{1+(5k-1)m}), $$
i.e., \(C^*\) contains \(5k\) leaves and the indices of any two subsequent leaves are \(m\) apart.

Let \(G'\) be the generator underlying \(N'\). Each leaf of \(C^*\) is on a certain side of \(G'\). Since \(G'\) has at most \(5k-1\) sides (by Lemma 1) and \(C^*\) contains \(5k\) leaves, there exist two leaves \(x_i,x_j\) of \(C^*\) that are on the same side of \(G'\) by the pigeonhole principle. Assume without loss of generality that \(j>i\). Then, by the construction of \(C^*\), \(j\ge i + m\).

We modify network \(N'\) to a network \(N''\) by moving the whole chain \(C'\) to the side of the network containing \(x_i\) and \(x_j\). To describe this modification more precisely, let \(v_{5km}\) be the parent of \(x_i\) in \(N'\). Then, \(N''\) is the network obtained from \(N'\) by deleting the leaves \(x_1,\ldots ,x_{5km}\), subdividing the edge entering \(v_{5km}\) by a directed path \(v_1,\ldots ,v_{5km-1}\), adding the leaves \(x_1,\ldots ,x_{5km}\) by edges \((v_1,x_1),\ldots ,(v_{5km},x_{5km})\) and cleaning up the resulting directed graph.

For each tree \(T'\in \mathcal {T}'\) in which all of \(x_1,\ldots ,x_{5km}\) have a common parent, Lemma 4 shows that \(N''\) displays \(T'\). There are at least \(q\) such trees. In fact, it follows from the following claim that there are precisely \(q\) such trees. Moreover, the claim shows that in all other trees \(x_i\) and \(x_j\) do not have a common parent. Therefore, it follows from Lemma 5 that these trees are also displayed by \(N''\).

Claim (3). The number of trees of \(\mathcal {T}'\) in which \(x_i\) and \(x_j\) have a common parent is at most \(q\).

To prove the claim, consider \(C^{**}:=(x_i,\ldots ,x_j)\). Since \(C^{**}\) is a subchain of \(C'\), it is a chain of each tree in \(\mathcal {T}'\) by Observation 2.

First consider the case \(q=t-1\) and assume that \(x_i\) and \(x_j\) have a common parent in more than \(q\) trees in \(\mathcal {T}'\) and hence in all trees in \(\mathcal {T}'\). Then, \(x_i,\ldots ,x_j\) all have a common parent in all trees in \(\mathcal {T}'\), by Observation 3. Since \(C\) is a \(q\)-star chain of \(\mathcal {T}\), there are \(q=t-1\) trees in \(\mathcal {T}\) in which all leaves of \(C\) have a common parent. Let \(T^*\) be the only tree in \(\mathcal {T}\) in which the leaves of \(C\) do not all have a common parent. Then \(C\) is not pendant in \(T^*\) or its leaves would form a nontrivial common pendant subtree of \(\mathcal {T}\). Hence, \(x_i\) and \(x_j\) have a common parent in \(T^*\) by Claim (2). However, this is a contradiction because then \(x_i\) and \(x_j\) form a nontrivial common pendant subtree of \(\mathcal {T}\).

To finish the proof of Claim (3), consider the case \(q<t-1\). In that case, \(m>1\) and hence \(C^{**}\) contains only leaves in \(\{x_1,\ldots ,x_{5km-1}\}\). Because \(C^{**}\) contains more than \(m\) leaves, the number of trees of \(\mathcal {T}\) in which all the leaves of \(C^{**}\) have a common parent is at most \(q\) here we use the fact that there are no common \(q'\)-star chains for \(q'>q\) that have more than \(m\) leaves). Hence, it follows from Claim (1) that the number of trees of \(\mathcal {T}'\) in which the leaves of \(C^{**}\) have a common parent is at most \(q\). Claim (3) then follows by Observation 3.

Hence, we have shown that \(N''\) displays \(\mathcal {T}'\). We now construct a network \(N\) from \(N''\) by replacing the reduced chain by the unreduced chain. More precisely, let \(e_{5km}\) be the edge of \(N''\) that leaves \(v_{5km}\) but is not the edge \((v_{5km},x_{5km})\). Subdivide \(e_{5km}\) by a directed path \((v_{5km+1},\ldots ,v_p)\) and add leaves \(x_{5km+1},\ldots ,x_p\) by edges \((v_{5km+1},x_{5km+1}), \ldots ,\)\((v_p,x_p)\). This gives \(N\). Then, by a similar argument as in the proof of Lemma 5, \(N\) displays \(\mathcal {T}\). Moreover, since none of the applied operations increase the reticulation number, we have \(r(N)\le r(N')\).       \(\square \)

The next lemma, whose proof has been omitted, shows that the chain reduction can be performed in polynomial time.

Lemma 3

There exists a polynomial-time algorithm that, given a set \(\mathcal {T}\) of trees on \(X\) and \(q\in \mathbb {N}\), decides if there exists a common \(q\)-star chain of \(\mathcal {T}\) and constructs such a chain of maximum size if one exists.

To see that Algorithm 1 runs in polynomial time, it remains to observe that at least one leaf is removed in each iteration and hence that the number of iterations is bounded by \(|X|\).

Let \((X',\mathcal {T}',k)\) be a kernelized instance of Hybridization Number. If there exists a network \(N'\) displaying \(\mathcal {T}'\) with \(r(N')\le k\) then \(N'\) has at most one leaf per vertex side of the underlying generator (since common pendant subtrees have been reduced) and at most \((5k)^{|\mathcal {T}|}\) leaves per edge side (since common chains have been reduced). Hence,
$$ |X'| \le k + (4k-1)(5k)^{|\mathcal {T}|} \le 4k(5k)^{|\mathcal {T}|}. $$
Correctness of the following theorem now follows from Lemmas 2–5.

Theorem 1

The problem Hybridization Number on \(|\mathcal {T}|=t\) trees admits a kernel with at most \(4k(5k)^t\) leaves.

4 A Polynomial Kernel for Bounded Outdegrees

Algorithm 2 describes a polynomial kernel for Hybridization Number if not the number of input trees but their maximum outdegree is bounded. Let \(\varDelta ^+\) be the maximum outdegree over all vertices of all trees in \(\mathcal {T}\).
The proof of the following theorem follows the same ideas as the proofs of Lemmas 2–5 and has been omitted.

Theorem 2

The problem Hybridization Number on trees with maximum outdegree \(\varDelta ^+\) admits a kernel with at most \(20k^2(\varDelta ^+-1)\) leaves.

5 Discussion and Open Problems

The main open question remains whether Hybridization Number has a polynomial kernel for an unbounded number of nonbinary trees with unbounded outdegrees. A related question is whether this problem is fixed-parameter tractable.

Note that when the input trees are not required to have the same label set \(X\), Hybridization Number is not fixed-parameter tractable unless P = NP. The reason for this is that it is NP-hard to decide if \(r(\mathcal {T})=1\) for sets \(\mathcal {T}\) consisting of rooted phylogenetic trees with three leaves each [10, Theorem 7].

Another question is whether the kernel size can be reduced for certain fixed \(|\mathcal {T}|\). For \(|\mathcal {T}|=2\), our results give a cubic kernel, while Linz and Semple [14] showed a linear kernel of a modified, weighted problem, by analyzing carefully how common chains can look in two trees. Can something like this be done for more than two trees? In particular, does there exist a quadratic kernel for three trees?

Finally, there is the problem of solving the kernelized instances. For this, a fast exponential-time exact algorithm is needed (or a good heuristic). However, it is not known if there exists an \(O(c^n)\)-algorithm for Hybridization Number for any constant \(c\) and \(n=|X|\), even for three binary trees. A related, but possibly more ambitious goal would be an \(O(c^kn^{O(1)})\)-algorithm for the same problem. Note that such algorithms do exist for the case \(|\mathcal {T}|=2\) [21].

References

  1. 1.
    Bapteste, E., van Iersel, L., Janke, A., Kelchner, S., Kelk, S., McInerney, J.O., Morrison, D.A., Nakhleh, L., Steel, M., Stougie, L., Whitfield, J.: Networks: expanding evolutionary thinking. Trends Genet. 29(8), 439–441 (2013)CrossRefGoogle Scholar
  2. 2.
    Baroni, M., Grünewald, S., Moulton, V., Semple, C.: Bounding the number of hybridisation events for a consistent evolutionary history. Math. Biol. 51, 171–182 (2005)MathSciNetCrossRefMATHGoogle Scholar
  3. 3.
    Bodlaender, H.L., Downey, R.G., Fellows, M.R., Hermelin, D.: On problems without polynomial kernels. J. Comput. Syst. Sci. 75(8), 423–434 (2009)MathSciNetCrossRefMATHGoogle Scholar
  4. 4.
    Bordewich, M., Semple, C.: Computing the hybridization number of two phylogenetic trees is fixed-parameter tractable. IEEE/ACM Trans. Comput. Biol. Bioinf. 4(3), 458–466 (2007)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Bordewich, M., Semple, C.: Computing the minimum number of hybridization events for a consistent evolutionary history. Discrete Appl. Math. 155(8), 914–928 (2007)MathSciNetCrossRefMATHGoogle Scholar
  6. 6.
    Chen, Z.-Z., Wang, L.: Algorithms for reticulate networks of multiple phylogenetic trees. IEEE/ACM Trans. Comput. Biol. Bioinf. 9(2), 372–384 (2012)CrossRefGoogle Scholar
  7. 7.
    Chen, Z.-Z., Wang, L.: An ultrafast tool for minimum reticulate networks. J. Comput. Biol. 20(1), 38–41 (2013)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Downey, R.G., Fellows, M.R.: Parameterized Complexity. Springer, New York (1999)CrossRefGoogle Scholar
  9. 9.
    Huson, D.H., Rupp, R., Scornavacca, C.: Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge University Press, Cambridge (2011)Google Scholar
  10. 10.
    Jansson, J., Nguyen, N.B., Sung, W.-K.: Algorithms for combining rooted triplets into a galled phylogenetic network. SIAM J. Comput. 35(5), 1098–1121 (2006)MathSciNetCrossRefMATHGoogle Scholar
  11. 11.
    Kelk, S., Scornavacca, C.: Towards the fixed parameter tractability of constructing minimal phylogenetic networks from arbitrary sets of nonbinary trees (2012). arXiv:1207.7034 [q-bio.PE]
  12. 12.
    Kelk, S., Scornavacca, C.: Constructing minimal phylogenetic networks from softwired clusters is fixed parameter tractable. Algorithmica 68(4), 886–915 (2014)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Kelk, S., van Iersel, L., Lekić, N., Linz, S., Scornavacca, C., Stougie, L.: Cycle killer.. qu’est-ce que c’est? on the comparative approximability of hybridization number and directed feedback vertex set. SIAM J. Discrete Math. 26(4), 1635–1656 (2012)MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Linz, S., Semple, C.: Hybridization in non-binary trees. IEEE/ACM Trans. Comput. Biol. Bioinf. 6(1), 30–45 (2009)CrossRefGoogle Scholar
  15. 15.
    Morrison, D.: Introduction to Phylogenetic Networks. RJR Productions, Uppsala (2011)Google Scholar
  16. 16.
    Nakhleh, L., Ringe, D., Warnow, T.: Perfect phylogenetic networks: a new methodology for reconstructing the evolutionary history of natural languages. Language 81(2), 382–420 (2005)CrossRefGoogle Scholar
  17. 17.
    Piovesan, T., Kelk, S.: A simple fixed parameter tractable algorithm for computing the hybridization number of two (not necessarily binary) trees. IEEE/ACM Trans. Comput. Biol. Bioinf. 10(1), 18–25 (2013)CrossRefGoogle Scholar
  18. 18.
    Semple, C., Steel, M.: Phylogenetics. Oxford University Press, Oxford (2003)MATHGoogle Scholar
  19. 19.
    van Iersel, L., Kelk, S., Lekić, N., Stougie, L.: Approximation algorithms for nonbinary agreement forests. SIAM J. Discrete Math. 28(1), 49–66 (2014)MathSciNetCrossRefMATHGoogle Scholar
  20. 20.
    van Iersel, L., Linz, S.: A quadratic kernel for computing the hybridization number of multiple trees. Inf. Process. Lett. 113(9), 318–323 (2013)CrossRefMATHGoogle Scholar
  21. 21.
    Whidden, C., Beiko, R.G., Zeh, N.: Fixed-parameter algorithms for maximum agreement forests. SIAM J. Comput. 42(4), 1431–1466 (2013)MathSciNetCrossRefMATHGoogle Scholar
  22. 22.
    Yufeng, W.: Close lower and upper bounds for the minimum reticulate network of multiple phylogenetic trees. Bioinformatics 26, i140–i148 (2010)CrossRefGoogle Scholar
  23. 23.
    Yufeng, W.: An algorithm for constructing parsimonious hybridization networks with multiple phylogenetic trees. J. Comput. Biol. 20(10), 792–804 (2013)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Centrum Wiskunde and Informatica (CWI)AmsterdamThe Netherlands
  2. 2.Department of Knowledge Engineering (DKE)Maastricht UniversityMaastrichtThe Netherlands

Personalised recommendations