New Reduction Rules for the Tree Bisection and Reconnection Distance

Kelk, Steven; Linz, Simone

doi:10.1007/s00026-020-00502-7

New Reduction Rules for the Tree Bisection and Reconnection Distance

Open access
Published: 01 July 2020

Volume 24, pages 475–502, (2020)
Cite this article

Download PDF

You have full access to this open access article

Annals of Combinatorics Aims and scope Submit manuscript

New Reduction Rules for the Tree Bisection and Reconnection Distance

Download PDF

Steven Kelk¹ &
Simone Linz²

2164 Accesses
6 Citations
2 Altmetric
Explore all metrics

Abstract

Recently it was shown that, if the subtree and chain reduction rules have been applied exhaustively to two unrooted phylogenetic trees, the reduced trees will have at most $15k-9$ taxa where k is the TBR (Tree Bisection and Reconnection) distance between the two trees, and that this bound is tight. Here, we propose five new reduction rules and show that these further reduce the bound to $11k-9$. The new rules combine the “unrooted generator” approach introduced in Kelk and Linz (SIAM J Discrete Math 33(3):1556–1574, 2019) with a careful analysis of agreement forests to identify (i) situations when chains of length 3 can be further shortened without reducing the TBR distance, and (ii) situations when small subtrees can be identified whose deletion is guaranteed to reduce the TBR distance by 1. To the best of our knowledge these are the first reduction rules that strictly enhance the reductive power of the subtree and chain reduction rules.

Extremal Distances for Subtree Transfer Operations in Binary Trees

Article Open access 13 December 2018

Median quartet tree search algorithms using optimal subtree prune and regraft

Article Open access 13 March 2024

Reflections on kernelizing and computing unrooted agreement forests

Article Open access 19 November 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

A phylogenetic tree is a tree whose leaves are bijectively labelled by a set of species (or, more generically, a set of taxa) X [13]. These trees are ubiquitous in the systematic study of evolution: the leaves represent contemporary species and the internal vertices of the tree represent hypothetical common ancestors. Over the years many techniques have been developed for inferring phylogenetic trees from (incomplete) biological data and under a range of different objective functions [7]. Here we are not concerned with inferring phylogenetic trees, but rather with quantifying the “distance” between two phylogenetic trees. Such a goal is well-motivated, since different methodologies for inferring phylogenetic trees sometimes yield trees with differing topologies, and reticulate evolutionary phenomena such as hybridization can cause different genes in the same genome to have different evolutionary histories [11].

We focus on the Tree Bisection and Reconnection (TBR) distance, which is NP-hard to compute [1, 10]. Informally, the TBR distance between two trees T and $T'$, denoted $d_\mathrm{TBR}(T,T')$, is the minimum number of topological rearrangement moves that need to be applied to transform T into $T'$, where such a move involves detaching a subtree and attaching it elsewhere. It was proven in 2001 [1] that the question “Is $d_\mathrm{TBR}(T,T') \le k?$” can be answered in time $f(k) \cdot \text {poly}(|X|)$, where f is a computable function that depends only on k. In other words: the problem is fixed parameter tractable [6]. Specifically, the authors proved that the two polynomial-time subtree and chain reduction rules preserve the TBR distance and reduce the number of taxa to at most $28 \cdot d_\mathrm{TBR}(T,T')$ for any two unrooted phylogenetic trees T and $T'$. The reduced instance, known as a kernel, can then be solved with any exact algorithm, yielding the $f(k) \cdot \text {poly}(|X|)$ running time [9]. The analysis in [1] made heavy use of a powerful abstraction known as an agreement forest, which in a nutshell partitions the two trees into smaller, non-overlapping fragments which do have the same topology in both trees (see e.g. [14, 16] for overviews of algorithmic results and [2] for extremal results). Via a different technique, based on bounded search trees, running times of $O( 4^k \cdot \text {poly}(|X|) )$ [16] and then $O( 3^k \cdot \text {poly}(|X|) )$ were later obtained [5]. However, the question remained whether a kernel with fewer than $28 \cdot d_\mathrm{TBR}(T,T')$ taxa could be obtained.

Recently, in [12], it was shown that the subtree and chain reduction rules actually reduce the instance to size $15\cdot d_\mathrm{TBR}(T,T')-9$, and that there are instances for which this bound is tight. Interestingly, the sharpened analysis does not leverage agreement forests at all, but instead recasts the computation of TBR distance as a phylogenetic network inference problem, where phylogenetic networks are essentially the generalization of phylogenetic trees to graphs. Namely, the TBR distance of T and $T'$ is equal to the minimum value of $|E|-(|V|-1)$, ranging over all phylogenetic networks $N=(V,E)$ that embed T and $T'$ [15]. The backbone topology of such minimal networks can be represented by unrooted generators, and when viewed from this static perspective it becomes much easier to analyze the role of common chains in the trees.

In the present article, we combine the agreement forest perspective of [1] with the network/generator perspective of [12], to obtain a new suite of five polynomial-time reduction rules. When applied alongside the subtree and chain reduction rules, these reduce the size of the kernel to $11\cdot d_\mathrm{TBR}(T,T')-9$. To leverage agreement forests, we first prove a general theorem which states that, given any disjoint set of common chains, there exists an optimal agreement forest in which all the chains are simultaneously preserved i.e. none of the chains are divided across two or more components of the forest. Crucially, this also holds for chains containing only 2 taxa, as long as the two taxa in the chain have a common parent in at least one of the trees. Such very short chains have not received a lot of attention in the literature, since the standard chain reduction rule, which truncates long common chains to length 3, does not always preserve the TBR distance if the chains are truncated to length 2. Nevertheless, the weaker “simultaneous preservation” property that we prove in this article (see Theorem 5), and which we believe to be of independent interest, turns out to be quite powerful when combined with networks/generators. The fact that chains are preserved allows us to determine specific situations when it is actually safe to reduce a chain to length 2 (and sometimes to length 1), or even to identify an entire component of an optimal agreement forest (which can then be deleted, reducing the TBR distance by exactly 1). These insights directly inspire the new reduction rules presented in this article. To the best of our knowledge these new reduction rules are the first reduction rules for a phylogenetic distance problem which strictly improve upon the reductive power of the subtree and chain reduction rules. Other reduction rules, such as the cluster reduction [3, 4], tend to be very effective in practice, but do not yield improved (i.e. smaller) bounds on kernel size [12].

After presenting the main results, we show a family of tight examples i.e. tree pairs that, after applying all seven reduction rules, have exactly $11\cdot d_\mathrm{TBR}(T,T')-9$ taxa. We then conclude with a short reflection on potential avenues for further improving the $11\cdot d_\mathrm{TBR}(T,T')-9$ bound, and discuss a number of insights flowing from our analysis which might be useful when considering non-kernelization approaches for computing the TBR distance.

2 Preliminaries

Throughout this paper, X denotes a finite set (of taxa) with $|X|\ge 4$.

Unrooted phylogenetic trees and networks: An unrooted binary phylogenetic network N on X is a simple, connected, and undirected graph whose leaves are bijectively labeled with X and whose other vertices all have degree 3. Let E and V be the edge and vertex set of N, respectively. We define the reticulation number of N as the number of edges in E that need to be deleted from N to obtain a spanning tree. More formally, we have $r(N) = |E|-(|V|-1)$. If $r(N)=0$, then N is called an unrooted binary phylogenetic tree on X. An example of two unrooted binary phylogenetic trees is shown in Fig. 1. Now, let T be an unrooted binary phylogenetic tree on X. Two leaves, say a and b, of T are called a cherry $\{a,b\}$ of T if they are adjacent to a common vertex. We say that a vertex v is the (unique) parent of a leaf a in N if v is adjacent to a. For $X' \subset X$, we write $T[X']$ to denote the unique, minimal subtree of T that connects all elements in $X'$. For brevity we call $T[X']$ the embedding of $X'$ in T. If $X''$ is also a subset of X, we denote by $T[X']\cap T[X'']$ the set of vertices in T that are contained in $T[X']$ and $T[X'']$. Furthermore, we refer to the unrooted phylogenetic tree on $X'$ obtained from $T[X']$ by suppressing degree-2 vertices as the restriction of T to $X'$ and we denote this by $T|X'$.

Let T be an unrooted binary phylogenetic tree on X. A quartet is an unrooted binary phylogenetic tree with exactly four leaves. For example, if $\{a,b,c,d\}\subseteq X$ , we say that ab|cd is a quartet of T if the path from a to b does not intersect the path from c to d. Note that, if ab|cd is not a quartet of T, then either ac|bd or ad|bc is a quartet of T.

Finally, let N be an unrooted binary phylogenetic network on X and let T be an unrooted binary phylogenetic tree on X. We say that N displays T, if T can be obtained from a subtree of N by suppressing degree-2 vertices.

Tree bisection and reconnection: Let T be an unrooted binary phylogenetic tree on X. Apply the following three-step operation to T:

1.
Delete an edge in T and suppress any resulting degree-2 vertex. Let $T_1$ and $T_2$ be the two resulting unrooted binary phylogenetic trees.
2.
If $T_1$ (resp. $T_2$) has at least one edge, subdivide an edge in $T_1$ (resp. $T_2$) with a new vertex $v_1$ (resp. $v_2$) and otherwise set $v_1$ (resp. $v_2$) to be the single isolated vertex of $T_1$ (resp. $T_2$).
3.
Add a new edge $\{v_1,v_2\}$ to obtain a new unrooted binary phylogenetic tree $T'$ on X.

We say that $T'$ has been obtained from T by a single tree bisection and reconnection (TBR) operation. An example of a TBR operation is illustrated in Fig. 2. Furthermore, we define the TBR distance between two unrooted binary phylogenetic trees T and $T'$ on X, denoted by $d_\mathrm{TBR}(T,T')$, to be the minimum number of TBR operations that is required to transform T into $T'$. It is well known that $d_\mathrm{TBR}$ is a metric [1]. By building on an earlier result by Hein et al. [10, Theorem 8], Allen and Steel [1] showed that computing the TBR distance is an NP-hard problem.

Unrooted minimum hybridization: In [15], it was shown that computing the TBR distance for a pair of unrooted binary phylogenetic trees T and $T'$ on X is equivalent to computing the minimum number of extra edges required to simultaneously explain T and $T'$. More precisely, we set

$$\begin{aligned} h^u(T, T') = \min _ N\{r(N)\}, \end{aligned}$$

where the minimum is taken over all unrooted binary phylogenetic networks N on X that display T and $T'$. The value $h^u(T, T')$ is known as the (unrooted) hybridization number of T and $T'$ [15].

The aforementioned equivalence is given in the next theorem that was established in [15, Theorem 3].

Theorem 1

Let T and $T'$ be two unrooted binary phylogenetic trees on X. Then

$$\begin{aligned} d_\mathrm{TBR}(T,T')=h^u(T,T'). \end{aligned}$$

Subtrees and chains: Let N be an unrooted binary phylogenetic network on X. A pendant subtree of N is an unrooted binary phylogenetic tree on a proper subset of X that can be obtained from N by deleting a single edge. For $n\ge 1$, let $C=(\ell _1,\ell _2,\ldots ,\ell _n)$ be a sequence of distinct taxa in X and, for each $i\in \{1,2,\ldots ,n\}$, let $p_i$ denote the unique parent of $\ell _i$ in N. We call C an n-chain of N, where n is referred to as the length of C, if there exists a walk $p_1,p_2,\ldots ,p_n$ in N and the elements in $\{p_2,p_3,\ldots ,p_{n-1}\}$ are all pairwise distinct. Hence $\ell _1$ and $\ell _2$ may have a common parent (i.e. $p_1 = p_2$) and/or $\ell _{n-1}$ and $\ell _{n}$ may have a common parent in N (i.e. $p_{n-1} = p_n$). Note that, if one of $p_1 = p_2$ and $p_{n-1}=p_n$ holds, then C is pendant in N. Since we require that X contains at least four elements, note that a 3-chain of N cannot consist of three leaves that all have the same parent. Furthermore, by definition, each element in X is a chain of length 1 in N. To ease reading, we sometimes write C to denote the set $\{\ell _1,\ell _2,\ldots ,\ell _n\}$. It will always be clear from the context whether C refers to the associated sequence or set of leaves.

If a pendant subtree S (resp. chain C) exists in two unrooted binary phylogenetic trees T and $T'$, we say that S (resp. C) is a common subtree (resp. chain) of T and $T'$. To illustrate, the 3-chain (b, c, d) is a common chain of the two phylogenetic trees shown in Fig. 1. Moreover, if C is a common n-chain of T and $T'$, reducing C to a chain of length k with $1\le k< n$ yields the two new trees $T_r = T|X {\setminus } \{ \ell _{k+1}, \ldots , \ell _{n} \}$ and $T'_r = T'|X {\setminus } \{ \ell _{k+1}, \ldots , \ell _{n} \}$.

We will later make use of the following simple, but fundamental, observation, which holds due to the definition of an n-chain and the fact that we are working with unrooted binary trees.

Observation 2

Let T be an unrooted binary phylogenetic tree on X, and let $\{C_1,C_2,\ldots ,C_m\}$ be a set of mutually taxa-disjoint chains of T. Then the embeddings

$$\begin{aligned} \{T[C_1],T[C_2],\ldots ,T[C_m]\} \end{aligned}$$

are mutually vertex disjoint in T.

For two unrooted binary phylogenetic trees T and $T'$, we next state two reduction rules that were first introduced in [1] and crucial in establishing fixed-parameter tractability of computing the TBR distance (see Sect. 3 for details).

Subtree reduction: Replace a maximal pendant subtree with at least two leaves that is common to T and $T'$ by a single leaf with a new label.

Chain reduction: Reduce a maximal n-chain with $n\ge 4$ that is common to T and $T'$ to a chain of length three.

The next lemma shows that the subtree and chain reduction are both TBR-preserving, i.e. applying one of the two reductions to T and $T'$ results in a pair of new trees whose TBR distance is equal to $d_\mathrm{TBR}(T,T')$.

Lemma 1

Let T and $T'$ be two unrooted binary phylogenetic trees on X, and let $T_r$ and $T_r'$ be two trees obtained from T and $T'$, respectively, by applying a single subtree or chain reduction. Then $d_\mathrm{TBR}(T,T')=d_\mathrm{TBR}(T_r,T_r')$.

Finally, if T and $T'$ cannot be reduced any further under the subtree (resp. chain) reduction, we say that T and $T'$ are subtree (resp. chain) reduced.

Agreement forests: Let T and $T'$ be two unrooted binary phylogenetic trees on X. Furthermore, let $F = \{B_0, B_1,B_2,\ldots ,B_k\}$ be a partition of X, where each block $B_i$ with $i\in \{0,1,2,\ldots ,k\}$ is referred to as a component of F. We say that F is an agreement forest for T and $T'$ if the following hold.

1.
For each $i\in \{0,1,2,\ldots ,k\}$, we have $T|B_i = T'|B_i$.
2.
For each pair $i,j\in \{0,1,2,\ldots ,k\}$ with $i \ne j$, we have that $T[B_i]$ and $T[B_j]$ are vertex-disjoint in T, and $T'[B_i]$ and $T'[B_j]$ are vertex-disjoint in $T'$.

Let $F=\{B_0,B_1,B_2,\ldots ,B_k\}$ be an agreement forest for T and $T'$. The size of F is simply its number of components; i.e. $k+1$. Moreover, an agreement forest with the minimum number of components (over all agreement forests for T and $T'$) is called a maximum agreement forest for T and $T'$. The number of components of a maximum agreement forest for T and $T'$ is denoted by $d_\mathrm{MAF}(T,T')$. The following theorem is well known.

Theorem 3

[1, Theorem 2.13] Let T and $T'$ be two unrooted binary phylogenetic trees on X. Then

$$\begin{aligned} d_\mathrm{TBR}(T,T') = d_\mathrm{MAF}(T,T') - 1. \end{aligned}$$

Again, let F be an agreement forest for two unrooted binary phylogenetic trees T and $T'$ on X, and let $C=(\ell _1,\ell _2,\ldots ,\ell _n)$ be a common n-chain of T and $T'$. We say that C is split in F if there exist (at least) two components, say $B_j$ and $B_{j'}$, with $j\ne j'$ such that $B_j\cap C\ne \emptyset $ and $B_{j'}\cap C\ne \emptyset $. Furthermore, we say that C is atomized in F if each taxon $\ell _i \in C$ with $i\in \{1,2,\ldots ,n\}$ occurs as a singleton component $\{\ell _i\}$ in F. Finally, we say that C is preserved in F if there exists a component $B_j \in F$ such that $C \subseteq B_j$. Taking the last three definitions together we have that C is split in F if and only if it is not preserved in F.

3 A New Suite of Reduction Rules

In 2001, Allen and Steel [1] showed that computing the TBR distance between two unrooted binary phylogenetic trees T and $T'$ is fixed-parameter tractable, when parameterized by $k=d_\mathrm{TBR}(T,T')$. To this end, they established a linear kernel of size at most 28k. Recently, this result was improved by Kelk and Linz [12] who showed with a new analysis that the following superior bound actually holds.

Theorem 4

Let T and $T'$ be two subtree and chain reduced unrooted binary phylogenetic trees on X. If $d_\mathrm{TBR}(T,T')\ge 2$, then $|X|\le 15 d_\mathrm{TBR}(T,T')-9$.

Noting that the subtree and chain reduction can be applied in time that is polynomial in the size of X, it immediately follows that computing the TBR distance is fixed-parameter tractable when parameterized by k. In what follows, we will develop five new reduction rules that complement the subtree and chain reduction as introduced by Allen and Steel [1] and further improve the $15\cdot d_\mathrm{TBR}(T,T') -9$ bound.

Although not directly addressed by [1], reducing a chain of length 3 to length 2 can strictly lower the TBR distance, which is why their chain reduction only allows reductions to length 3. An explicit example of this phenomenon are the two phylogenetic trees that are shown in Fig. 1. The TBR distance of these two trees is 2. However, if we reduce the common 3-chain (b, c, d) to the 2-chain (b, c), we obtain the two quartets ab|ce and eb|ca whose TBR distance is 1. Nevertheless, as we shall see, there do exist special circumstances when common 3-chains can be further reduced without altering the TBR distance. This is an important building block of our new reduction rules which ultimately will yield a kernel of size at most $11\cdot d_\mathrm{TBR}(T,T')-9$. To obtain this bound, we combine the generator approach introduced in [12] with a careful analysis of agreement forests.

The next theorem, which is formally established in the appendix to avoid disrupting the flow of the paper, will repeatedly be used in establishing our new kernel result.

Theorem 5

Let T and $T'$ be two unrooted binary phylogenetic trees on X. Let K be an (arbitrary) set of mutually taxa-disjoint chains that are common to T and $T'$. Then there exists a maximum agreement forest $F'$ of T and $T'$ such that

1.
every n-chain in K with $n\ge 3$ is preserved in $F'$, and
2.
every 2-chain in K that is pendant in at least one of T and $T'$ is preserved in $F'$.

Theorem 5 is somewhat more general than we need in this article—for us, $|K|\le 2$ is sufficient—but we include full details because we consider it to be of independent interest and anticipate future applications beyond this article. Furthermore, we remark that a weaker version of Theorem 5 was already presented in [1], where it forms the foundation of the proof of Lemma 1. However, the authors of [1] did not prove any results about chains of length 2, and their proof mainly focuses on the case of a single common chain of length 3 that is pendant in neither of the two trees.

Throughout the next four subsections, we detail five new reduction rules. We will see that, similar to Lemma 1, each of these new reductions either preserves the TBR distance or reduces the parameter by 1. To simplify the exposition, we assume that the new reductions are always applied to two unrooted binary phylogenetic trees that are subtree and chain reduced. Finally, while the reduction names appear to be cryptic, they will be further explained in Sect. 4, where we tie the new reductions and a careful unrooted generator analysis together to establish an improved kernel result. A generic example for each of the five new reductions is shown in Fig. 3.

3.1 $(,3,)$-Reduction

Let T and $T'$ be two unrooted binary phylogenetic trees on X, and let $C = (a,b,c)$ be a 3-chain that is common to T and $T'$. For example C may be the result of a previously applied chain reduction. If T has cherry $\{b,c\}$ and $T'$ has has cherry $\{a,b\}$, then a $(*,3,*)$-reduction is the operation of deleting a, b, and c from T and $T'$. Formally, we set $T_r = T|X {\setminus } C$ and $T'_r = T'|X {\setminus } C$.

Lemma 2

Let T and $T'$ be two unrooted binary phylogenetic trees on X, and let $T_r$ and $T_r'$ be two trees obtained from T and $T'$, respectively, by a single application of the $(*,3,*)$-reduction. Then $d_\mathrm{TBR}(T_r, T'_r) = d_\mathrm{TBR}(T,T') - 1$.

Proof

Without loss of generality, we establish the lemma using the same notation as in the definition of a $(*,3,*)$-reduction. Let $F_r$ be a maximum agreement forest for $T_r$ and $T_r'$, and let F be a maximum agreement forest for T and $T'$. Then $F_r\cup \{\{a,b,c\}\}$ is an agreement forest for T and $T'$ which implies that $|F_r| \ge |F|-1$. Hence,

$$\begin{aligned} {d_\mathrm{TBR}(T_r, T'_r) = |F_r|-1 \ge |F|-2 = d_\mathrm{TBR}(T,T') - 1.} \end{aligned}$$

By Theorem 5, we may assume that C is preserved in F. (Formally, we apply the theorem to the set $K=\{ C \}$.) Hence, there exists a component $B_{abc}$ in F such that $C\subseteq B_{abc}$. Towards a contradiction, assume that $C\subset B_{abc}$ and let $x\in B_{abc} {\setminus } C$. Then, as $\{b,c\}$ is a cherry in T and $\{a,b\}$ is a cherry of $T'$, it follows that bc|ax is a quartet of $T|B_{abc}$ and ab|cx is a quartet of $T'|B_{abc}$; thereby contradicting that F is an agreement forest for T and $T'$. Now, since $B_{abc}=C$, we have that $F{\setminus } \{B_{abc}\}$ is an agreement forest for $T_r$ and $T_r'$. Hence, $|F_r| \le |F|-1$, which yields

$$\begin{aligned} d_\mathrm{TBR}(T, T') -1= |F|-2 \ge |F_r|-1 = d_\mathrm{TBR}(T_r,T_r'). \end{aligned}$$

$\square $

3.2 $(3,1,*)$-Reduction

Let T and $T'$ be two unrooted binary phylogenetic trees on X, and let $C = (a,b,c)$ be a 3-chain that is common to T and $T'$. If $T'$ has cherry $\{b,c\}$ and T has cherry $\{c,x\}$ for some element $x\in X{\setminus }{C}$, then a $(3,1,*)$-reduction is the operation of deleting x from T and $T'$, i.e. we set $T_r = T|X {\setminus } \{x\}$ and $T'_r = T'|X {\setminus } \{x\}$. Informally, x prevents C from being a common pendant subtree of T and $T'$.

Lemma 3

Let T and $T'$ be two unrooted binary phylogenetic trees on X, and let $T_r$ and $T_r'$ be two trees obtained from T and $T'$, respectively, by a single application of the $(3,1,*)$-reduction. Then $d_\mathrm{TBR}(T_r, T'_r) = d_\mathrm{TBR}(T,T') - 1.$

Proof

Again without loss of generality, we establish the lemma using the same notation as in the definition of a $(3,1,*)$-reduction. Let $F_r$ be a maximum agreement forest for $T_r$ and $T_r'$, and let F be a maximum agreement forest for T and $T'$. To show that $d_\mathrm{TBR}(T_r, T'_r) \ge d_\mathrm{TBR}(T,T') - 1,$ we apply the same argument as in the first part of the proof of Lemma 2 with the modification of considering $F_r\cup \{\{x\}\}$ (instead of $F_r\cup \{C\}$) as an agreement forest for T and $T'$. Now, by Theorem 5, we may assume that C is preserved in F. Let $B_x$ be the component of F that contains x, and let $B_{abc}$ be the component of F such that $C\subseteq B_{abc}$. By the choice of F, $B_{abc}$ exists. Clearly, $B_x\ne B_{abc}$ since, otherwise, ab|cx is a quartet of $T|B_x$ but not a quartet of $T'|B_x$. Moreover, if $|B_x|\ge 2$, then $T[B_x]$ and $T[B_{abc}]$ are not vertex-disjoint in T. It now follows that $B_x=\{x\}$ and that $F{\setminus } \{B_x\}$ is an agreement forest for $T_r$ and $T_r'$. Hence, $|F_r| \le |F|-1$, so we have

$$\begin{aligned} d_\mathrm{TBR}(T, T') -1= |F|-2 \ge |F_r|-1 = d_\mathrm{TBR}(T_r,T_r'). \end{aligned}$$

$\square $

Following on from the proof of Lemma 3, note that $\{a,b,c\}$ is the leaf set of a pendant subtree that is common to $T_r$ and $T_r'$. The two reduced trees can, therefore, be further reduced by the subtree reduction.

3.3 (2, 1, 2)-Reduction

Let T and $T'$ be two unrooted binary phylogenetic trees on X. Furthermore, let $\{a,b,c,d,x\}\subseteq X$ such that $C_1=(a,b)$ and $C_2=(c,d)$ are two 2-chains that are common to T and $T'$. If T has cherries $\{b,x\}$ and $\{c,d\}$, and if $T'$ has cherries $\{a,b\}$ and $\{d,x\}$, then a (2, 1, 2)-reduction is the operation of obtaining $T_r$ and $T_r'$ from T and $T'$, respectively, by deleting x from T and $T'$, i.e. $T_r = T|X {\setminus } \{x\}$ and $T'_r = T'|X {\setminus } \{x\}$.

Lemma 4

Let T and $T'$ be two unrooted binary phylogenetic trees on X, and let $T_r$ and $T_r'$ be two trees obtained from T and $T'$, respectively, by a single application of the (2, 1, 2)-reduction. Then $d_\mathrm{TBR}(T_r, T'_r) = d_\mathrm{TBR}(T,T') - 1.$

Proof

Without loss of generality, we may assume that the two common 2-chains $C_1$ and $C_2$ and their respective configurations in T and $T'$ are exactly as described in the definition of a (2, 1, 2)-reduction. Then, $d_\mathrm{TBR}(T_r, T'_r) \ge d_\mathrm{TBR}(T,T') - 1$ follows as described in the proof of Lemma 3. We establish the lemma by showing that $d_\mathrm{TBR}(T_r, T'_r) \le d_\mathrm{TBR}(T,T') - 1$. Let F be a maximum agreement forest for T and $T'$, and let $F_r$ be a maximum agreement forest for $T_r$ and $T_r'$. By Theorem 5, we may assume that $C_1$ and $C_2$ are preserved in F. (Formally, we apply the theorem to the set of chains $K=\{ C_1, C_2 \}$, noting that each chain is pendant in one of the two trees.) Let $B_{ab}$ be the element in F that contains $C_1$ and, similarly, let $B_{cd}$ be the element in F that contains $C_2$. Towards showing that $\{x\}\in F$, first assume that there exists an element $B_x\in F$ such that $|B_x|\ge 2$ and $B_x\cap \{a,b,c,d,x\}=\{x\}$. Then, it is straightforward to check that $T[B_x]$ and $T[B_{ab}]$ are not vertex-disjoint in T; a contradiction. Thus, x is either contained in $B_{ab}$ or $B_{cd}$, or $\{x\}$ is an element in F. Now, if $B_{ab}=B_{cd}$ and $x\in B_{ab}$, then ax|cd is a quartet of $T|B_{ab}$ while ac|dx is a quartet of $T'|B_{ab}$; a contradiction. Otherwise, if $B_{ab}\ne B_{cd}$ and $x\in B_{ab}$, then $T'[B_{ab}]$ and $T'[B_{cd}]$ are not vertex-disjoint in $T'$. Symmetrically, if $B_{ab}\ne B_{cd}$ and $x\in B_{cd}$, then $T[B_{ab}]$ and $T[B_{cd}]$ are not vertex-disjoint in T. It now follows that $\{x\}\in F$ and that $F{\setminus } \{\{x\}\}$ is an agreement forest for $T_r$ and $T_r'$. Hence, we have

$$\begin{aligned} d_\mathrm{TBR}(T, T') -1= |F|-2 \ge |F_r|-1= d_\mathrm{TBR}(T_r,T_r'). \end{aligned}$$

$\square $

After performing a (2, 1, 2)-reduction, note that the two reduced trees $T_r$ and $T_r'$ have common pendant subtrees on leaf sets $\{a,b\}$ and $\{c,d\}$, respectively, that can be reduced further under the subtree reduction.

3.4 (3, 3)- and (3, 2)-Reduction

Let T and $T'$ be two unrooted binary phylogenetic trees on X. The next reduction can be applied in two slightly different situations. The first situation considers two 3-chains while the second situation considers one 3-chain and one 2-chain. We start by formally describing the first situation. Let $\{a,b,c,x,y,z\}\subseteq X$ such that $C_1=(a,b,c)$ and $C_2=(x,y,z)$ are two 3-chains that are common to T and $T'$. If T has cherries $\{b,c\}$ and $\{x,y\}$, and if $T'$ has a 6-chain (a, b, c, x, y, z) then a (3, 3)-reduction is the operation of obtaining $T_r$ and $T_r'$ from T and $T'$, respectively, by deleting x and y from T and $T'$, i.e. $T_r = T|X {\setminus } \{x,y\}$ and $T'_r = T'|X {\setminus } \{x,y\}$. We now turn to the second situation. Let $\{a,b,c,y,z\}\subseteq X$ such that $C_1=(a,b,c)$ and $C_2=(y,z)$ are two chains that are common to T and $T'$. If T has cherries $\{b,c\}$ and $\{y,z\}$, and if $T'$ has a 5-chain (a, b, c, y, z), then a (3, 2)-reduction is the operation of obtaining $T_r$ and $T_r'$ from T and $T'$, respectively, by deleting y from T and $T'$, i.e. $T_r = T|X {\setminus } \{y\}$ and $T'_r = T'|X {\setminus } \{y\}$.

Lemma 5

Let T and $T'$ be two unrooted binary phylogenetic trees on X, and let $T_r$ and $T_r'$ be two trees obtained from T and $T'$, respectively, by a single application of the (3, 3)- or the (3, 2)-reduction. Then $d_\mathrm{TBR}(T_r, T'_r) = d_\mathrm{TBR}(T,T').$

Proof

Without loss of generality, we may assume that the two common chains $C_1$ and $C_2$ and their respective configurations in T and $T'$ are exactly as described in the paragraph prior to the statement of this lemma. Let $Y=\{a,b,c,x,y,z\}$ if $T_r$ and $T_r'$ have been obtained from T and $T'$ by a (3, 3)-reduction and, otherwise, let $Y=\{a,b,c,y,z\}$.

Given that the $T_r$ and $T'_r$ are induced subtrees of T and $T'$ respectively (i.e. obtained from T and $T'$ using the “|” operator), it follows from Lemma 2.11 of [1] that

$$\begin{aligned} {d_\mathrm{TBR}(T_r, T'_r)\le d_\mathrm{TBR}(T,T').} \end{aligned}$$

To establish the other direction, let $F_r$ be a maximum agreement forest for $T_r$ and $T_r'$, and let F be a maximum agreement forest for T and $T'$. By Theorem 5, we may assume that $C_1$ is preserved in $F_r$. Let $B_{abc}$ be the element in $F_r$ such that $C_1\subseteq B_{abc}$. Similarly, let $B_z$ be the element in $F_r$ such that $z\in B_z$. We have $B_{abc}\ne B_z$ since, otherwise, bc|az is a quartet of $T_r|B_{abc}$ while ab|cz is a quartet of $T_r'|B_{abc}$; a contradiction. Next, observe that if $B_{abc}$ contains some taxon $d \not \in \{a,b,c,z\}$, then d is a leaf in the subtree Q of $T_r'$, as depicted in Fig. 3(iv) and (v). If this was not so, then ad|bc would be a quartet of $T_r|B_{abc}$ while cd|ab would be a quartet of $T_r'|B_{abc}$; a contradiction. Combining these facts yields the insight that, in $T_r'$, the edge between the parent of z and the parent of c (if such an edge exists) is not on the embedding of any component in $F_r$. Since $F_r$ is an agreement forest for $T_r$ and $T_r'$, it now follows that

$$\begin{aligned} (F_r{\setminus }\{B_z\})\cup \{B_z\cup \{x,y\}\} \end{aligned}$$

is an agreement forest for T and $T'$ if a (3, 3)-reduction has been applied and that

$$\begin{aligned} (F_r{\setminus }\{B_z\})\cup \{B_z\cup \{y\}\} \end{aligned}$$

is an agreement forest for T and $T'$ if a (3, 2)-reduction has been applied. Hence,

$$\begin{aligned} d_\mathrm{TBR}(T_r, T'_r) = |F_r|-1 \ge |F|-1 = d_\mathrm{TBR}(T,T'). \end{aligned}$$

$\square $

We end this section by noting that it takes $O( \text {poly}(|X|))$ time to test if any of the new reductions presented in Sects. 3.1–3.4 can be applied. While the (3, 2)- and (3, 3)-reduction preserves the TBR distance, each of the other three new reductions reduces the parameter by exactly one, i.e. the TBR distance for the unreduced trees can be calculated by computing the TBR distance for the reduced trees and adding one to the result.

4 A New Kernel for Computing the TBR Distance

In this section, we establish the main result of this paper. To formally state it, we require a new definition. Let T and $T'$ be two unrooted binary phylogenetic trees on X. We say that T and $T'$ are exhaustively reduced if they are subtree and chain reduced, and none of the five reductions presented in Sect. 3 can be applied to T and $T'$.

Theorem 6

Let T and $T'$ be two exhaustively reduced unrooted binary phylogenetic trees on X. If $d_\mathrm{TBR}(T,T')\ge 2$, then $|X|\le 11d_\mathrm{TBR}(T,T')-9$.

To establish this theorem, we analyze the maximum size of two exhaustively reduced phylogenetic trees with the help of an unrooted binary phylogenetic network N that displays the two trees and the unrooted generator that underlies N. Next, we define unrooted generators.

Let k be a positive integer. For $k\ge 2$, a k-generator (or short generator when k is clear from the context) is a connected cubic multigraph with edge set E and vertex set V such that $k=|E|-(|V|-1)$. The edges of a generator are called its sides. Intuitively, given an unrooted binary phylogenetic network N with $r(N)=k$, we can obtain a k-generator by, repeatedly, deleting all (labeled and unlabeled) leaves and suppressing any resulting degree-2 vertices. We say that the generator obtained in this way underlies N. Now, let G be a k-generator, let $\{u,v\}$ be a side of G, and let Y be a set of leaves. The operation of subdividing $\{u,v\}$ with |Y| new vertices and, for each such new vertex w, adding a new edge $\{w,\ell \}$, where $\ell \in Y$ and Y bijectively labels the new leaves, is referred to as attaching Y to $\{u,v\}$. Finally, if at least one new leaf is attached to each loop and to each pair of parallel edges in G, then the resulting graph is an unrooted binary phylogenetic network N with $r(N)=k$. Note that N has no pendant subtree with more than a single leaf. Hence, we have the following observation.

Observation 7

Let N be an unrooted binary phylogenetic network that has no pendant subtree with at least two leaves, and let G be a generator. Then G underlies N if and only if N can be obtained from G by attaching a (possibly empty) set of leaves to each side of G.

Now let T and $T'$ be two subtree and chain reduced unrooted binary phylogenetic trees on X, and let N be an unrooted binary phylogenetic network on X that displays T and $T'$. Let S and $S'$ be spanning trees of N obtained by greedily extending a subdivision of T (respectively, $T'$) to become a spanning tree, if it is not that already. Since N displays T and $T'$, S and $S'$ exist. Furthermore, let G be the generator that underlies N. Since T and $T'$ are subtree and chain reduced, N does not have a pendant subtree of size at least two. Hence, by Observation 7, we can obtain N from G by attaching leaves to G. Let $s=\{u,w\}$ be a side of G. Let $Y=\{\ell _1,\ell _2,\ldots ,\ell _m\}$ be the set of leaves that are attached to s in obtaining N from G. Recall that $m \ge 0$. Then there exists a path

$$\begin{aligned} u=v_0,v_1,v_2,\ldots ,v_m,v_{m+1}=w \end{aligned}$$

of vertices in N such that, for each $i\in \{1,2,\ldots ,m\}$, $v_i$ is the unique parent of $\ell _i$. We refer to this path as the path associated with s and denote it by $P_s$. Importantly, for a path $P_s$ in N that is associated with a side s of G, there is at most one edge in $P_s$ that is not contained in S, and there is at most one (not necessarily distinct) edge in $P_s$ that is not contained in $S'$. We make this precise in the following definition and say that s has b breakpoints relative to S and $S'$, where

1.
$b=0$ if S and $S'$ both contain all edges of $P_s$,
2.
$b=1$ if one element in $\{S,S'\}$ contains all edges of $P_s$ while the other element contains all but one edge of $P_s$, and
3.
$b=2$ if each of S and $S'$ contains all but one edge of $P_s$.

Since S and $S'$ span N, note that s cannot have more than 2 breakpoints relative to S and $S'$.

In the language of this paper, Kelk and Linz [12] have established the following result.

Lemma 6

Let N be an unrooted binary phylogenetic network on X that displays two subtree and chain reduced unrooted binary phylogenetic trees T and $T'$. Let S (resp. $S')$ be a spanning tree of N obtained by extending a subdivision of T (resp. $T')$. Furthermore, let G be the generator that underlies N, and let s be a side of G. Suppose that s has b breakpoints relative to S and $S'$ for some ${b}\in \{0,1,2\}$. Then,

(i)
if $b=0$, then N can be obtained from G by attaching at most 3 leaves to s;
(ii)
if $b=1$, then N can be obtained from G by attaching at most 6 leaves to s; and
(iii)
if $b=2$, then N can be obtained from G by attaching at most 9 leaves to s.

Since Lemma 6 only considers the subtree and chain reduction, a natural question is whether or not the five reductions presented in Sect. 3 improve the bounds on the number of leaves that are attached to a side of a generator. We answer this question positively in the next lemma.

Lemma 7

Let N be an unrooted binary phylogenetic network on X that displays two exhaustively reduced unrooted binary phylogenetic trees T and $T'$. Let S (resp. $S')$ be a spanning tree of N obtained by extending a subdivision of T (resp. $T')$. Furthermore, let G be the generator that underlies N, and let $s=\{u,v\}$ be a side of G. Suppose that s has b breakpoints relative to S and $S'$ for some ${b}\in \{0,1,2\}$. Then,

(i)
if $b=0$, then N can be obtained from G by attaching at most 3 leaves to s;
(ii)
if $b=1$, then N can be obtained from G by attaching at most 4 leaves to s; and
(iii)
if $b=2$, then N can be obtained from G by attaching at most 4 leaves to s.

Proof

By Lemma 6(i), (i) follows immediately.

To establish (ii), we show that neither 5 nor 6 leaves are attached to s and note that, by Lemma 6(ii), no more than 6 leaves are attached to s. Without loss of generality, we may assume that S contains all edges of $P_s$ and that $S'$ contains all but one edge of $P_s$. Let e be the edge of $P_s$ that $S'$ does not contain. First, assume that 6 leaves are attached to s. Let $P_s=v_0,v_1,v_2,\ldots ,v_{6},v_7$. Recall that $u=v_0$ and $v=v_7$. For each $i\in \{1,2,\ldots ,6\}$, let $\ell _i$ be the leaf adjacent to $v_i$ in N. If $e\ne \{v_3,v_4\}$, then T and $T'$ have a common chain of length at least 4; a contradiction since T and $T'$ are chain reduced. On the other hand, if $e=\{v_3,v_4\}$, then T and $T'$ have two common 3-chains $(\ell _1,\ell _2,\ell _3)$ and $(\ell _4,\ell _5,\ell _6)$ such that $(\ell _1,\ell _2,\ldots ,\ell _6)$ is a chain of T, and both of $\{\ell _2,\ell _3\}$ and $\{\ell _4,\ell _5\}$ are cherries of $T'$. Hence, T and $T'$ can be further reduced under a (3, 3)-reduction; again a contradiction. Second, assume that 5 leaves are attached to s. Let $P_s=v_0,v_1,v_2,\ldots ,v_{5},v_6$. Since T and $T'$ are chain reduced, we use an argument analogous to the previous 6-leaf case to show that $e\in \{\{v_2,v_3\},\{v_3,v_4\}\}$. If $e=\{v_2,v_3\}$, then T and $T'$ have common chains $(\ell _1,\ell _2)$ and $(\ell _3,\ell _4,\ell _5)$ and T has a chain $(\ell _1, \ell _2, \ell _3,\ell _4,\ell _5)$, where $\ell _i$ is again the leaf adjacent to $v_i$ in N for each $i\in \{1,2,\ldots ,5\}$. Furthermore, $T'$ has cherries $\{\ell _1,\ell _2\}$ and $\{\ell _3,\ell _4\}$. It now follows that T and $T'$ can be further reduced under a (3, 2)-reduction; a contradiction to the fact that both trees are exhaustively reduced. If $e=\{v_3,v_4\}$, we use an symmetric argument to get the same contradiction; thereby establishing (ii).

We complete the proof by showing that (iii) holds. Throughout this part of the proof, suppose that at least 5 leaves get attached to s in the process of obtaining N from G since, otherwise, (iii) follows without proof. Again, consider the path

$$\begin{aligned} P_s=v_0,v_1,v_2,\ldots ,v_{m+1} \end{aligned}$$

that is associated with s in N. Recall that m is the number of leaves that are attached to s. Hence $m\ge 5$. Let $\ell _i$ be the leaf adjacent to $v_i$ in N for each $i\in \{1,2,\ldots ,m\}$. Furthermore, let $e=\{v_i,v_j\}$ be the edge of $P_s$ that is not contained in S, and let $f=\{v_{i'},v_{j'}\}$ be the edge of $P_s$ that is not contained in $S'$. Without loss of generality, we may assume that $i=j-1$, $i' = j'-1$, and $i\le i'$. Moreover, note that if $i < i'$ then $C=(\ell _{i+1},\ell _{i+2},\ldots ,\ell _{i'})$ is an $(i'-i)$-chain that is common to T and $T'$. Considering four cases and deriving a contradiction for each, we next show that $i'-i=1$.

Case 1.:: If $i'-i>3$, then C has length at least 4 and T and $T'$ are not chain reduced.
Case 2.:: If $i'-i=3$, then $C=(\ell _{i+1},\ell _{i+2},\ell _{i+3})$ is a maximal common 3-chain of T and $T'$. Moreover, as $\{v_{i+1},v_{i+2}\}$ is a cherry of T and $\{v_{i+2},v_{i+3}\}$ is a cherry of $T'$, it follows that a $(*,3,*)$-reduction can be applied to T and $T'$; thereby contradicting that T and $T'$ are exhaustively reduced.
Case 3.:: If $i'-i=2$, then $C=(\ell _{i+1},\ell _{i+2})$ is a maximal common 2-chain of T and $T'$. In particular C is the leaf set of a pendant subtree that is common to T and $T'$ that can be further reduced under the subtree reduction.
Case 4.:: If $i'-i=0$, then $\{\ell _1,\ell _2,\ldots ,\ell _i\}$ and $\{\ell _j,\ell _{j+1},\ldots ,\ell _{m-1}\}$ are the leaf sets of two pendant subtrees that are common to T and $T'$. Since $m\ge 5$ one of these subtrees has size at least two and, so, T and $T'$ can be further reduced under the subtree reduction.

All four cases contradict the fact that T and $T'$ are exhaustively reduced. Thus, we may assume for the remainder of the proof that, if $m\ge 5$, then $i'-i=1$.

We next establish a maximum for i and minimum for $i'$. Clearly, if $i>3$, then $(\ell _1,\ell _2,\ldots ,\ell _i)$ is a chain of length at least 4 that is common to T and $T'$ that can be reduced by applying a chain reduction. Moreover, if $i=3$, first recall that $i'=i+1$. It then follows that $(\ell _1,\ell _2,\ell _3)$ is a chain that is common to T and $T'$, $\{\ell _2,\ell _3\}$ is a cherry of T, and $\{\ell _3,\ell _4\}$ is a cherry of $T'$. Hence, T and $T'$ can be further reduced by applying a $(3,1,*)$-reduction, where $\ell _4$ takes on the role of x in the definition of this reduction. Hence $i\le 2$. By symmetry and applying an analogous argument, we derive that $i'\ge m-2$. In summary, under the assumption that $m\ge 5$, we have established the following three restrictions on the indices i and $i'$:

$$\begin{aligned}&i\le 2;\\&i'= i+1, \text { and}\\&i'\ge m-2. \end{aligned}$$

Taken all three together, it follows that $m\le 5$. So suppose that $m=5$. Then, by the aforementioned three restrictions, this implies that $e=\{v_2,v_3\}$ and $f=\{v_3,v_4\}$. Furthermore, $(\ell _1,\ell _2)$ and $(\ell _4,\ell _5)$ are two 2-chains that are common to T and $T'$ such that T has cherries $\{\ell _1,\ell _2\}$ and $\{\ell _3,\ell _4\}$, and $T'$ has cherries $\{\ell _2,\ell _3\}$ and $\{\ell _4,\ell _5\}$. With $\ell _3$ taking on the role of x in the definition of a (2, 1, 2)-reduction, it now follows that T and $T'$ can be further reduced under this reduction. This contradicts our initial assumption that $m\ge 5$; thereby establishing (iii). $\square $

We can now clarify the rather cryptic names of the new reduction rules. From the proof of Lemma 7 we can see that a side s that has 2 breakpoints, indexed by i and $i'$, respectively (where $i < i'$), induces three common chains of length i, $i'-i$ and $m-i'$. We can summarize these three lengths in a vector $(i, i'-i, m-i')$. Then the $(*,3,*)$-reduction can be applied when $i'-i = 3$, irrespective of the values of i and $m-i'$, and we denote this indifference using wildcard symbols. The same idea applies to the $(3,1,*)$- and the (2, 1, 2)-reduction. For sides with a single breakpoint at position i, the vector of common chain lengths induced is given by $(i, m-i)$. Then essentially, the (3, 3)- and the (3, 2)-reduction capture the situation when a 6-chain (resp. 5-chain) in $T'$ is split into two shorter chains in T by a breakpoint at $i=3$.

We are now in a position to establish Theorem 6.

Proof of Theorem 6

Let N be an unrooted binary phylogenetic network on X that displays T and $T'$ such that

$$\begin{aligned} r(N)=h^u(T,T')=d_\mathrm{TBR}(T,T')=k\ge 2, \end{aligned}$$

where the second equality follows from Theorem 1. Let S and $S'$ be spanning trees of N that are obtained by extending subdivisions of T and $T'$, respectively. Furthermore, let G be the generator that underlies N. To establish the theorem, we use Lemma 7 to bound from above the number of leaves that can collectively be attached to G over all of its sides. The following approach is similar to the one used in [12, Lemma 3]. By [12, Lemma 1], G has $3(k-1)$ sides. Furthermore N contains exactly k edges that are not contained in S, and exactly k edges that are not contained in $S'$. Each of these edges induces a breakpoint on a corresponding side of G, so each side of G can have 0, 1 or 2 breakpoints and there are 2k breakpoints in total. Let q be the number of sides in G that have two breakpoints relative to S and $S'$. Noting that $0\le q\le k$, it follows that there are $2(k-q)$ sides in G whose number of breakpoints is one relative to S and $S'$. Hence, there are $3(k-1)-(2k-q)$ sides in G that have zero breakpoints relative to S and $S'$. Since T and $T'$ are exhaustively reduced, we now apply Lemma 7 to derive the following inequality:

$$\begin{aligned} |X|\le 4q+4(2(k-q))+3(3(k-1)-(2k-q))=-q+11k-9. \end{aligned}$$

Clearly, $-q+11k-9$ is maximum for $q=0$ and, so, we have

$$\begin{aligned} |X|\le -q+11k-9\le 11k-9=11d_\mathrm{TBR}(T,T')-9 \end{aligned}$$

which establishes the theorem. $\square $

We finish the section with an additional kernel result that establishes an even smaller kernel for particular trees.

Corollary 1

Let T and $T'$ be two unrooted binary phylogenetic trees on X. If $d_\mathrm{TBR}(T,T')\ge 2$, T and $T'$ are subtree reduced and do not have any common n-chain with $n\ge 2$, then $|X|\le 5d_\mathrm{TBR}(T,T') - 3$.

Proof

As previously, let N be an unrooted binary phylogenetic network on X that displays T and $T'$ and has the property that $r(N)=h^u(T,T')$. Let S and $S'$ be spanning trees of N obtained by extending subdivisions of T and $T'$, respectively, and let G be the generator underlying N. For a side s of G, it follows that we can attach at most 3 leaves to s if s has two breakpoints relative to S and $S'$. Similarly, we can attach at most 2 leaves (resp. 1 leaf) to s if s has one breakpoint (resp. zero breakpoints) relative to S and $S'$. Interestingly, and in comparison with the proof of Lemma 7, these upper bounds can be easily established using arguments that only rely on the (ordinary) subtree and chain reduction, but make no use of the five reductions presented in Sect. 3. Now, using the same counting argument as in the proof of Theorem 6, we have

$$\begin{aligned} |X|\le 5k-3=5d_\mathrm{TBR}(T,T')-3. \end{aligned}$$

$\square $

5 Tightness of the Kernel Under the New Reductions

In this section, we show that, for two exhaustively reduced trees, the kernel result presented in Theorem 6 is tight. For each $k\ge 4$, we do this by providing two exhaustively reduced unrooted binary phylogenetic trees $T_k$ and $T_k'$ whose leaf sets have size $11k-9$, such that $d_\mathrm{TBR}(T_k,T_k')=k$. To illustrate, $T_k$ and $T_k'$ are shown in Fig. 4. It is straightforward to check that $T_k$ and $T_k'$ are exhaustively reduced. While we do not go into detail about justifying that $T_k$ and $T_k'$ indeed provide a tight example, i.e. $d_\mathrm{TBR}(T_k,T_k')=k$, we point the interested reader to [12, Section 4], where a very similar family of constructions is given to show that the kernel result presented in [12] is tight for phylogenetic trees that are subtree and chain reduced, and do not contain any common so-called cluster. The approach taken there, which uses unrooted generators to argue that $d_\mathrm{TBR}(T_k,T_k') \le k$ and maximum parsimony distance [8] to prove $d_\mathrm{TBR}(T_k,T_k') \ge k$, can be easily adapted to establish the following proposition from which tightness of the kernel presented in Theorem 6 immediately follows.^{Footnote 1}

Proposition 1

For $k\ge 4$, let $T_k$ and $T_k'$, be the two exhaustively reduced unrooted binary phylogenetic trees on X that are shown in Fig. 4. Then $d_\mathrm{TBR} (T_k,T_k')=k$.

6 Discussion and Future Work

To further lower the $11k-9$ bound using the approach described in this article requires reduction rules to prohibit generator sides from having 4 leaves (and 1 or 2 breakpoints), or 3 leaves (and 0 breakpoints). However, in such situations it is neither clear how to reduce the TBR distance by 1, or reduce the number of taxa without altering the TBR distance. Hence, new techniques are required which do not just look “locally” at individual sides of the generator, but at the way multiple sides of the generator interact. We hope to return to this issue in future work. Interestingly, although it is not yet clear how to eliminate these cases in the context of kernelization, the analysis in our paper does convey additional structural information. For example, the argument behind the (3, 3)- and (3, 2)-reduction directly identifies an edge, in one of the trees, that can safely be deleted if we wish to progressively transform that tree into a maximum agreement forest. These edges can sometimes still be identified even in situations when our new reduction rules do not apply. Such insights, together with Theorem 5, can potentially be used by FPT branching algorithms that compute the TBR distance by iteratively deleting edges to obtain agreement forests (see e.g. [5]). Could the unrooted generator approach, coupled with the reduction rules described in this article, be used to reduce the branching factor of such algorithms?

Notes

In fact, up to relabeling of the leaves the trees shown here are obtained by repeatedly applying the (3, 3)-reduction to the trees shown in [12, Figure 2], whose TBR distance is there proven to be exactly k. Given that the (3, 3)-reduction is TBR-preserving, the claim follows.
Essentially we are deleting two edges in $T|B = T'|B$. These two edges induce what in standard phylogenetic terminology are called “compatible splits”; they are compatible because the two edges are drawn from the same tree. Two splits are compatible if and only if at most 3 of the 4 described intersections are non-empty [13].

References

B. Allen and M. Steel. Subtree transfer operations and their induced metrics on evolutionary trees. Annals of Combinatorics, 5:1–15, 2001.
R. Atkins and C. McDiarmid. Extremal distances for subtree transfer operations in binary trees. Annals of Combinatorics, 23(1):1–26, 2019.
M. Baroni, C. Semple, and M. Steel. Hybrids in real time. Systematic Biology, 55(1):46–56, 2006.
M. Bordewich, C. Scornavacca, N. Tokac, and M. Weller. On the fixed parameter tractability of agreement-based phylogenetic distances. Journal of Mathematical Biology, 74(1-2):239–257, 2017.
J. Chen, J-H. Fan, and S-H. Sze. Parameterized and approximation algorithms for maximum agreement forest in multifurcating trees. Theoretical Computer Science, 562:496–512, 2015.
M. Cygan, F. Fomin, Ł. Kowalik, D. Lokshtanov, D. Marx, M. Pilipczuk, M. Pilipczuk, and S. Saurabh. Parameterized algorithms, volume 3. Springer, 2015.
J. Felsenstein. Inferring Phylogenies. Sinauer Associates, Incorporated, 2004.
M. Fischer and S. Kelk. On the Maximum Parsimony distance between phylogenetic trees. Annals of Combinatorics, 20(1):87–113, 2016.
F. Fomin, D. Lokshtanov, S. Saurabh, and M. Zehavi. Kernelization: Theory of Parameterized Preprocessing. Cambridge University Press, 2019.
J. Hein, T. Jiang, L. Wang, and K. Zhang. On the complexity of comparing evolutionary trees. Discrete Applied Mathematics, 71(1-3):153–169, 1996.
D. Huson, R. Rupp, and C. Scornavacca. Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge University Press, 2011.
Steven Kelk and Simone Linz. A tight kernel for computing the tree bisection and reconnection distance between two phylogenetic trees. SIAM Journal on Discrete Mathematics, 33(3):1556–1574, 2019.
C. Semple and M. Steel. Phylogenetics. Oxford University Press, 2003.
F. Shi, J. Chen, Q. Feng, and J. Wang. A parameterized algorithm for the maximum agreement forest problem on multiple rooted multifurcating trees. Journal of Computer and System Sciences, 97:28–44, 2018.
L. van Iersel, S. Kelk, G. Stamoulis, L. Stougie, and O. Boes. On unrooted and root-uncertain variants of several well-known phylogenetic network problems. Algorithmica, 80(11):2993–3022, 2018.
C. Whidden, R. G. Beiko, and N. Zeh. Fixed-parameter algorithms for maximum agreement forests. SIAM Journal on Computing, 42(4):1431–1466, 2013.

Download references

Acknowledgements

The second author was supported by the New Zealand Marsden Fund. Both authors would also like to thank the reviewers for their insightful comments.

Author information

Authors and Affiliations

Department of Data Science and Knowledge Engineering (DKE), Maastricht University, Maastricht, The Netherlands
Steven Kelk
School of Computer Science, University of Auckland, Auckland, New Zealand
Simone Linz

Authors

Steven Kelk
View author publications
You can also search for this author in PubMed Google Scholar
Simone Linz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steven Kelk.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Proof of Theorem 5

Proof

Let F be an arbitrary maximum agreement forest of T and $T'$. Let $C \in K$ be a chain as described in the statement of the theorem (i.e. it has length at length 3, or it has length 2 and it is pendant in at least one of T and $T'$.) For shorthand we call these eligible chains. Suppose that C is split in F. We will show how to transform F into a new agreement forest $F'$, without increasing the number of components, such that C is preserved in $F'$ and such that all eligible chains that were preserved in F are also preserved in $F'$. Iterating this process will eventually bring us to a maximum agreement forest $F'$ with the desired property, and the proof will be complete. It is helpful to recall that all the chains in K are mutually taxa disjoint, and thus (by Observation 2) their embeddings are mutually vertex disjoint in both T and $T'$.

Let $J = \{B \in F : C \cap B \ne \emptyset \}$. We have assumed that C is split, so $|J| \ge 2$. If $B \in J$ and, additionally, $B \cap (X {\setminus } C) \ne \emptyset $, we call B an inside-outside component with respect to C. There can be at most 2 such components because a chain connects to the surrounding tree in (at most) 2 places. If C is not pendant in T, then deleting C from T naturally partitions $X {\setminus } C$ into two disjoint non-empty sets $L_T(C)$ and $R_T(C)$. Informally these are the taxa in T that are to the “left” and “right” of C. For the purpose of this proof it does not matter which side we designate as left and right. If C is not pendant in T, then we say that a component $B \in F$straddles C in T if $L_T(C) \cap B \ne \emptyset $ and $R_T(C) \cap B \ne \emptyset $. The straddling relation only applies to non-pendant chains: if C is pendant in T then, by definition, it is not possible for a component of any agreement forest to straddle it in T.

We say that a component $B \in F$ is a bypass component in T with respect to C, if $B \not \in J$ (i.e. $B \cap C = \emptyset $) and B straddles C in T. Informally, B passes “through” C in T without including any of its taxa. If in a given context it does not matter whether B is a bypass component in T and/or $T'$, we simply say that B is a bypass component with respect to C. Figures 5 and 6 illustrate a number of these concepts. Note that, in the main proof below, we will need to consider the possibility that B (resp. $B')$ is a bypass component in T (resp. $T')$ with respect to C, but that $B \ne B'$, or that a component B is a bypass component in T and $T'$ with respect to C. The following observations will also be useful, for which we omit proofs.

Observation 8

Let C, F and J be as defined above.

(a)
F can contain at most two bypass components with respect to C i.e. one per tree. If $B \in F$ is a bypass component in T and $T'$ with respect to C, then B is the only bypass component in F with respect to C.
(b)
If F contains at least one bypass component with respect to C, then C is atomized in F.
(c)
If F contains at least one inside-outside component with respect to C, it cannot contain any bypass components with respect to C.
(d)
A component $B \in J$ that is not an inside-out component with respect to C, has the property $B \subseteq C$.
(e)
If C is pendant in T and/or $T'$, then F contains at most one bypass component with respect to C, and at most one inside-outside component with respect to C.

We now start with the main proof. We distinguish several (sub)cases.

1.
C is pendant in neither T nor $T'$. In this case, $|C| \ge 3$, because chains of length 2 are assumed to be pendant in at least one tree. 1.1. Suppose that F contains at least one bypass component with respect to C. Then C is atomized, by Observation 8(b). Now, recall Observation 8(a). We start by splitting the bypass component(s), as follows. If F contains a bypass component B such that B is a bypass component in T (but not in $T'$) with respect to C, we replace B by two components $L_T(C) \cap B$ and $R_T(C) \cap B$. Next, if F contains a bypass component $B'$ such that $B'$ is a bypass component in $T'$ (but not in T) with respect to C, we replace $B'$ by two components $L_{T'}(C) \cap B'$ and $R_{T'}(C) \cap B'$. A third possibility (which can only hold if neither of the two previous possibilities holds – see the second part of Observation 8(a)) is that F contains a bypass component B that is a bypass component in both T and $T'$ with respect to C. In this case, we replace B with non-empty components from the following list.
- $(L_T(C) \cap B) \cap (L_T'(C) \cap B)$,
- $(L_T(C) \cap B) \cap (R_T'(C) \cap B)$,
- $(R_T(C) \cap B) \cap (L_T'(C) \cap B)$,
- $(R_T(C) \cap B) \cap (R_T'(C) \cap B)$.
This captures the situation when we split the same component twice, because it bypasses C in both trees, rather than splitting two distinct components each once. Crucially, at most 3 of these sets can be non-empty. (If all four were non-empty, then this would contradict the $T|B = T'|B$ property of agreement forests^{Footnote 2}). Having split the bypass component(s), we next remove all components $\{x\}$ where $x \in C$, and introduce C as a single component. Splitting the bypass component(s) increases the number of components by at most 2, but replacing the singleton components with C reduces the number of components by at least 2, because $|C| \ge 3$, so we still have an optimal agreement forest. Now, suppose for the sake of a contradiction that a previously preserved eligible chain D is split by the modifications described above. Then, at least one of the following holds: (i) $D \cap L_T(C) \ne \emptyset $ and $D \cap R_T(C) \ne \emptyset $, (ii) $D \cap L_{T'}(C) \ne \emptyset $ and $D \cap R_{T'}(C) \ne \emptyset $. However, if (i) holds then $T[D] \cap T[C] \ne \emptyset $, and if (ii) holds then $T'[D] \cap T'[C] \ne \emptyset $, both of which contradict Observation 2. 1.2. Suppose that F does not contain any bypass components with respect to C. We now look at the number of inside-outside components with respect to C. If there are 0 inside-outside components, then by Observation 8(d) all $B \in J$ have the property $B \subseteq C$. We remove all the $\ge 2$ components in J, and replace them with C, yielding a valid agreement forest with strictly fewer components than F, and thus a contradiction. If there are exactly 2 inside-outside components $B_1, B_2$, then we do the following to F: remove $B_1$ and $B_2$, discard any components $B_i$ such that $B_i \subseteq C$ and then finally add the single component $B_1 \cup B_2 \cup C$. This yields a smaller agreement forest and thus also a contradiction. The only subcase that remains is that there is exactly one inside-outside component $B_1 \in J$. We will illustrate this subcase with a number of additional figures. 1.2.1. Suppose that $|B_1 \cap C| \ge 2$. Observe that if $B_1$ straddles C in at least one of T and $T'$, then (i) the taxa in $C {\setminus } B_1$ are all singleton components in F and (ii) $B_1$ actually straddles C in both T and $T'$ (because $B_1 \cap C$ is not pendant in $B_1$). This situation is illustrated in Fig. 7(i). We remove $B_1$, discard all the singleton components $\{x\}$ such that $x \in C {\setminus } B_1$, then add the component $B_1 \cup C$. This is a valid agreement forest because the (at least) two taxa in $B_1 \cap C$, combined with the fact that $B_1$ straddles C in both trees, ensure that $T | (B_1 \cup C) = T' | (B_1 \cup C)$. Noting that $| C {\setminus } B_1 | \ge 1$ (because $|J| \ge 2$), we thus obtain a smaller agreement forest and thus a contradiction on the optimality of F. Continuing, suppose that $B_1$ straddles C in neither T nor $T'$; this situation is illustrated in Fig. 7(ii). Informally this means that in both T and $T'$ the inside-outside component $B_1$ enters the chain from only one side. We replace $B_1$ with $B_1 {\setminus } C$, delete all components $B_i \subseteq C$ (there is at least one such component, because $|J| \ge 2$ and $B_1$ is the only inside–outside component), and introduce component C. The overall size of the agreement forest does not increase. This cannot split any previously preserved eligible chain D, because any such chain D would have taxa in both C and $B_1 {\setminus } C$, yielding $T[C] \cap T[D] \ne \emptyset $ and a contradiction to Observation 2. 1.2.2. Suppose that $|B_1 \cap C| = 1$. Let x be the unique taxon in $B_1 \cap C$. Suppose $B_1$ straddles C in at least one of T and $T'$; assume without loss of generality that it is in T (see Fig. 8(i)). Then the $\ge 2$ taxa in $C {\setminus } \{x\}$ must be singleton components in F. We delete $B_1$, delete the at least 2 singleton components formed by taxa in $C {\setminus } \{x\}$, and introduce components $L_T(C), R_T(C), C$. This does not increase the number of components in the agreement forest, so it is still a maximum agreement forest. As usual, the only way that a previously preserved eligible chain could be split is if it contains taxa from both $L_T(C)$ and $R_T(C)$ which contradicts Observation 2. Finally, suppose that $B_1$ straddles C in neither T nor $T'$ (see Fig. 8(ii)). We replace $B_1$ with $B_1 {\setminus } C$, delete all components $B_i \subseteq C$ (there will be at least one such component because $B_1$ is the only inside-outside component with respect to C), and introduce C. The overall size of the agreement forest does not increase. This cannot split any previously preserved eligible chain D, because then $T[D] \cap T[C] \ne \emptyset $, contradicting Observation 2.
2.
C is pendant in at least one of T and $T'$. In this case, we have $|C|\ge 2$. Assume without loss of generality that C is pendant in $T'$. Recall from Observation 8(e) that F contains at most one inside-outside component with respect to C, at most one bypass component with respect to C, and that from Observations 8(b) and (c) at most one of these two situations can hold. Suppose that $B \in F$ is a bypass component with respect to C; this is necessarily in T, since B cannot be a bypass component in $T'$ with respect to C (due to pendancy). Figure 9(i) illustrates this situation. By Observation 8(b), C is atomized. Now, consider the construction in Case 1.1. Here we only have one bypass component to split, but on the other hand we are only allowed to use the weaker assumption $|C| \ge 2$. These two cancel each other out, so Case 1.1 still goes through. So henceforth we can assume that there are no bypass components with respect to C. If there are no inside–outside components with respect to C then replacing the components in J (which are all subsets of C) with C reduces the overall number of components, because $|J| \ge 2$, immediately yielding a contradiction to the optimality of F. So let $B \in F$ be the unique inside-outside component in J. If $|B \cap C| \ge 2$ (note that if $|C|=2$ then this cannot happen, because it would imply $|J|=1$) then, due to the pendancy of C in $T'$, B straddles C in neither T nor $T'$; this is the situation shown in Fig. 9(ii). (The fact that B does not straddle C in $T'$ is automatic, since pendant chains cannot be straddled, by definition. The same holds if C is pendant in T. If C is not pendant in T, observe that if B straddled C in T, then $B \cap C$ would not be pendant in T|B. However, the fact that C is pendant in $T'$ means that $B \cap C$ must be pendant in $T'|B$. Taken together we would have $T|B \ne T'|B$, contradicting the assumption that B is a component of an agreement forest of T and $T'$.) We replace B with $B {\setminus } C$, delete all components $B_i \subseteq C$ (there must be at least one such component), and introduce C as a component. Thus, the overall size of the agreement forest does not increase. This cannot split any previously preserved eligible chain D, because (by the usual argument) any such chain D would have taxa in both C and $B{\setminus } C$, and this contradicts Observation 2. So assume that $|B \cap C| = 1$. If $|C| \ge 3$, Case 1.2.2 goes through unchanged. If $|C|=2$, then Case 1.2.2 mostly still holds, except for one situation: when $B_1$ straddles C in (say) T. This is illustrated in Fig. 9(iii). The problem here is that $C {\setminus } \{x\}$ (where x is as defined as in that case) contains only 1 taxon, so we introduce more components than we delete. However, we can modify the argument as follows. Let y be the unique taxon in C that is not equal to x. We delete B and $\{y\}$, and introduce components $L_T(C) \cup \{x,y\}, R_T(C)$. Hence, the agreement forest does not increase in size. A previously preserved eligible chain D cannot be split by this modification, since it would imply that D contains taxa from both $L_T(C)$ and $R_T(C)$, yielding the usual contradiction to Observation 2.

$\square $

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kelk, S., Linz, S. New Reduction Rules for the Tree Bisection and Reconnection Distance. Ann. Comb. 24, 475–502 (2020). https://doi.org/10.1007/s00026-020-00502-7

Download citation

Received: 26 July 2019
Accepted: 04 June 2020
Published: 01 July 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s00026-020-00502-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

New Reduction Rules for the Tree Bisection and Reconnection Distance

Abstract

Similar content being viewed by others

Extremal Distances for Subtree Transfer Operations in Binary Trees

Median quartet tree search algorithms using optimal subtree prune and regraft

Reflections on kernelizing and computing unrooted agreement forests

1 Introduction

2 Preliminaries

Theorem 1

Observation 2

Lemma 1

Theorem 3

3 A New Suite of Reduction Rules

Theorem 4

Theorem 5

3.1 \((*,3,*)\)-Reduction

Lemma 2

Proof

3.2 \((3,1,*)\)-Reduction

Lemma 3

Proof

3.3 (2, 1, 2)-Reduction

Lemma 4

Proof

3.4 (3, 3)- and (3, 2)-Reduction

Lemma 5

Proof

4 A New Kernel for Computing the TBR Distance

Theorem 6

Observation 7

Lemma 6

Lemma 7

Proof

Proof of Theorem 6

Corollary 1

Proof

5 Tightness of the Kernel Under the New Reductions

Proposition 1

6 Discussion and Future Work

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Proof of Theorem 5

A Proof of Theorem 5

Proof

Observation 8

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

3.1 \((,3,)\)-Reduction