1 Introduction

A phylogenetic tree is a tree whose leaves are bijectively labelled by a set of species (or, more generically, a set of taxa) X [13]. These trees are ubiquitous in the systematic study of evolution: the leaves represent contemporary species and the internal vertices of the tree represent hypothetical common ancestors. Over the years many techniques have been developed for inferring phylogenetic trees from (incomplete) biological data and under a range of different objective functions [7]. Here we are not concerned with inferring phylogenetic trees, but rather with quantifying the “distance” between two phylogenetic trees. Such a goal is well-motivated, since different methodologies for inferring phylogenetic trees sometimes yield trees with differing topologies, and reticulate evolutionary phenomena such as hybridization can cause different genes in the same genome to have different evolutionary histories [11].

We focus on the Tree Bisection and Reconnection (TBR) distance, which is NP-hard to compute [1, 10]. Informally, the TBR distance between two trees T and \(T'\), denoted \(d_\mathrm{TBR}(T,T')\), is the minimum number of topological rearrangement moves that need to be applied to transform T into \(T'\), where such a move involves detaching a subtree and attaching it elsewhere. It was proven in 2001 [1] that the question “Is \(d_\mathrm{TBR}(T,T') \le k?\)” can be answered in time \(f(k) \cdot \text {poly}(|X|)\), where f is a computable function that depends only on k. In other words: the problem is fixed parameter tractable [6]. Specifically, the authors proved that the two polynomial-time subtree and chain reduction rules preserve the TBR distance and reduce the number of taxa to at most \(28 \cdot d_\mathrm{TBR}(T,T')\) for any two unrooted phylogenetic trees T and \(T'\). The reduced instance, known as a kernel, can then be solved with any exact algorithm, yielding the \(f(k) \cdot \text {poly}(|X|)\) running time [9]. The analysis in [1] made heavy use of a powerful abstraction known as an agreement forest, which in a nutshell partitions the two trees into smaller, non-overlapping fragments which do have the same topology in both trees (see e.g. [14, 16] for overviews of algorithmic results and [2] for extremal results). Via a different technique, based on bounded search trees, running times of \(O( 4^k \cdot \text {poly}(|X|) )\) [16] and then \(O( 3^k \cdot \text {poly}(|X|) )\) were later obtained [5]. However, the question remained whether a kernel with fewer than \(28 \cdot d_\mathrm{TBR}(T,T')\) taxa could be obtained.

Recently, in [12], it was shown that the subtree and chain reduction rules actually reduce the instance to size \(15\cdot d_\mathrm{TBR}(T,T')-9\), and that there are instances for which this bound is tight. Interestingly, the sharpened analysis does not leverage agreement forests at all, but instead recasts the computation of TBR distance as a phylogenetic network inference problem, where phylogenetic networks are essentially the generalization of phylogenetic trees to graphs. Namely, the TBR distance of T and \(T'\) is equal to the minimum value of \(|E|-(|V|-1)\), ranging over all phylogenetic networks \(N=(V,E)\) that embed T and \(T'\) [15]. The backbone topology of such minimal networks can be represented by unrooted generators, and when viewed from this static perspective it becomes much easier to analyze the role of common chains in the trees.

In the present article, we combine the agreement forest perspective of [1] with the network/generator perspective of [12], to obtain a new suite of five polynomial-time reduction rules. When applied alongside the subtree and chain reduction rules, these reduce the size of the kernel to \(11\cdot d_\mathrm{TBR}(T,T')-9\). To leverage agreement forests, we first prove a general theorem which states that, given any disjoint set of common chains, there exists an optimal agreement forest in which all the chains are simultaneously preserved i.e. none of the chains are divided across two or more components of the forest. Crucially, this also holds for chains containing only 2 taxa, as long as the two taxa in the chain have a common parent in at least one of the trees. Such very short chains have not received a lot of attention in the literature, since the standard chain reduction rule, which truncates long common chains to length 3, does not always preserve the TBR distance if the chains are truncated to length 2. Nevertheless, the weaker “simultaneous preservation” property that we prove in this article (see Theorem 5), and which we believe to be of independent interest, turns out to be quite powerful when combined with networks/generators. The fact that chains are preserved allows us to determine specific situations when it is actually safe to reduce a chain to length 2 (and sometimes to length 1), or even to identify an entire component of an optimal agreement forest (which can then be deleted, reducing the TBR distance by exactly 1). These insights directly inspire the new reduction rules presented in this article. To the best of our knowledge these new reduction rules are the first reduction rules for a phylogenetic distance problem which strictly improve upon the reductive power of the subtree and chain reduction rules. Other reduction rules, such as the cluster reduction [3, 4], tend to be very effective in practice, but do not yield improved (i.e. smaller) bounds on kernel size [12].

After presenting the main results, we show a family of tight examples i.e. tree pairs that, after applying all seven reduction rules, have exactly \(11\cdot d_\mathrm{TBR}(T,T')-9\) taxa. We then conclude with a short reflection on potential avenues for further improving the \(11\cdot d_\mathrm{TBR}(T,T')-9\) bound, and discuss a number of insights flowing from our analysis which might be useful when considering non-kernelization approaches for computing the TBR distance.

2 Preliminaries

Throughout this paper, X denotes a finite set (of taxa) with \(|X|\ge 4\).

Unrooted phylogenetic trees and networks: An unrooted binary phylogenetic network N on X is a simple, connected, and undirected graph whose leaves are bijectively labeled with X and whose other vertices all have degree 3. Let E and V be the edge and vertex set of N, respectively. We define the reticulation number of N as the number of edges in E that need to be deleted from N to obtain a spanning tree. More formally, we have \(r(N) = |E|-(|V|-1)\). If \(r(N)=0\), then N is called an unrooted binary phylogenetic tree on X. An example of two unrooted binary phylogenetic trees is shown in Fig. 1. Now, let T be an unrooted binary phylogenetic tree on X. Two leaves, say a and b, of T are called a cherry \(\{a,b\}\) of T if they are adjacent to a common vertex. We say that a vertex v is the (unique) parent of a leaf a in N if v is adjacent to a. For \(X' \subset X\), we write \(T[X']\) to denote the unique, minimal subtree of T that connects all elements in \(X'\). For brevity we call \(T[X']\) the embedding of \(X'\) in T. If \(X''\) is also a subset of X, we denote by \(T[X']\cap T[X'']\) the set of vertices in T that are contained in \(T[X']\) and \(T[X'']\). Furthermore, we refer to the unrooted phylogenetic tree on \(X'\) obtained from \(T[X']\) by suppressing degree-2 vertices as the restriction of T to \(X'\) and we denote this by \(T|X'\).

Fig. 1
figure 1

Two unrooted binary phylogenetic trees on \(X=\{a,b,c,d,e\}\)

Let T be an unrooted binary phylogenetic tree on X. A quartet is an unrooted binary phylogenetic tree with exactly four leaves. For example, if \(\{a,b,c,d\}\subseteq X\) , we say that ab|cd is a quartet of T if the path from a to b does not intersect the path from c to d. Note that, if ab|cd is not a quartet of T, then either ac|bd or ad|bc is a quartet of T.

Finally, let N be an unrooted binary phylogenetic network on X and let T be an unrooted binary phylogenetic tree on X. We say that N displays T, if T can be obtained from a subtree of N by suppressing degree-2 vertices.

Tree bisection and reconnection: Let T be an unrooted binary phylogenetic tree on X. Apply the following three-step operation to T:

  1. 1.

    Delete an edge in T and suppress any resulting degree-2 vertex. Let \(T_1\) and \(T_2\) be the two resulting unrooted binary phylogenetic trees.

  2. 2.

    If \(T_1\) (resp. \(T_2\)) has at least one edge, subdivide an edge in \(T_1\) (resp. \(T_2\)) with a new vertex \(v_1\) (resp. \(v_2\)) and otherwise set \(v_1\) (resp. \(v_2\)) to be the single isolated vertex of \(T_1\) (resp. \(T_2\)).

  3. 3.

    Add a new edge \(\{v_1,v_2\}\) to obtain a new unrooted binary phylogenetic tree \(T'\) on X.

We say that \(T'\) has been obtained from T by a single tree bisection and reconnection (TBR) operation. An example of a TBR operation is illustrated in Fig. 2. Furthermore, we define the TBR distance between two unrooted binary phylogenetic trees T and \(T'\) on X, denoted by \(d_\mathrm{TBR}(T,T')\), to be the minimum number of TBR operations that is required to transform T into \(T'\). It is well known that \(d_\mathrm{TBR}\) is a metric [1]. By building on an earlier result by Hein et al. [10, Theorem 8], Allen and Steel [1] showed that computing the TBR distance is an NP-hard problem.

Fig. 2
figure 2

A single TBR operation that transforms T into \(T'\). First, \(T_1\) and \(T_2\) are obtained from T by deleting the edge \(\{u_1,u_2\}\) in T. Second, \(T'\) is obtained from \(T_1\) and \(T_2\) by subdividing an edge in both trees as indicated by the open circles \(v_1\) and \(v_2\) and adding a new edge \(\{v_1,v_2\}\)

Unrooted minimum hybridization: In [15], it was shown that computing the TBR distance for a pair of unrooted binary phylogenetic trees T and \(T'\) on X is equivalent to computing the minimum number of extra edges required to simultaneously explain T and \(T'\). More precisely, we set

$$\begin{aligned} h^u(T, T') = \min _ N\{r(N)\}, \end{aligned}$$

where the minimum is taken over all unrooted binary phylogenetic networks N on X that display T and \(T'\). The value \(h^u(T, T')\) is known as the (unrooted) hybridization number of T and \(T'\) [15].

The aforementioned equivalence is given in the next theorem that was established in [15, Theorem 3].

Theorem 1

Let T and \(T'\) be two unrooted binary phylogenetic trees on X. Then

$$\begin{aligned} d_\mathrm{TBR}(T,T')=h^u(T,T'). \end{aligned}$$

Subtrees and chains: Let N be an unrooted binary phylogenetic network on X. A pendant subtree of N is an unrooted binary phylogenetic tree on a proper subset of X that can be obtained from N by deleting a single edge. For \(n\ge 1\), let \(C=(\ell _1,\ell _2,\ldots ,\ell _n)\) be a sequence of distinct taxa in X and, for each \(i\in \{1,2,\ldots ,n\}\), let \(p_i\) denote the unique parent of \(\ell _i\) in N. We call C an n-chain of N, where n is referred to as the length of C, if there exists a walk \(p_1,p_2,\ldots ,p_n\) in N and the elements in \(\{p_2,p_3,\ldots ,p_{n-1}\}\) are all pairwise distinct. Hence \(\ell _1\) and \(\ell _2\) may have a common parent (i.e. \(p_1 = p_2\)) and/or \(\ell _{n-1}\) and \(\ell _{n}\) may have a common parent in N (i.e. \(p_{n-1} = p_n\)). Note that, if one of \(p_1 = p_2\) and \(p_{n-1}=p_n\) holds, then C is pendant in N. Since we require that X contains at least four elements, note that a 3-chain of N cannot consist of three leaves that all have the same parent. Furthermore, by definition, each element in X is a chain of length 1 in N. To ease reading, we sometimes write C to denote the set \(\{\ell _1,\ell _2,\ldots ,\ell _n\}\). It will always be clear from the context whether C refers to the associated sequence or set of leaves.

If a pendant subtree S (resp. chain C) exists in two unrooted binary phylogenetic trees T and \(T'\), we say that S (resp. C) is a common subtree (resp. chain) of T and \(T'\). To illustrate, the 3-chain (bcd) is a common chain of the two phylogenetic trees shown in Fig. 1. Moreover, if C is a common n-chain of T and \(T'\), reducing C to a chain of length k with \(1\le k< n\) yields the two new trees \(T_r = T|X {\setminus } \{ \ell _{k+1}, \ldots , \ell _{n} \}\) and \(T'_r = T'|X {\setminus } \{ \ell _{k+1}, \ldots , \ell _{n} \}\).

We will later make use of the following simple, but fundamental, observation, which holds due to the definition of an n-chain and the fact that we are working with unrooted binary trees.

Observation 2

Let T be an unrooted binary phylogenetic tree on X, and let \(\{C_1,C_2,\ldots ,C_m\}\) be a set of mutually taxa-disjoint chains of T. Then the embeddings

$$\begin{aligned} \{T[C_1],T[C_2],\ldots ,T[C_m]\} \end{aligned}$$

are mutually vertex disjoint in T.

For two unrooted binary phylogenetic trees T and \(T'\), we next state two reduction rules that were first introduced in [1] and crucial in establishing fixed-parameter tractability of computing the TBR distance (see Sect. 3 for details).

Subtree reduction: Replace a maximal pendant subtree with at least two leaves that is common to T and \(T'\) by a single leaf with a new label.

Chain reduction: Reduce a maximal n-chain with \(n\ge 4\) that is common to T and \(T'\) to a chain of length three.

The next lemma shows that the subtree and chain reduction are both TBR-preserving, i.e. applying one of the two reductions to T and \(T'\) results in a pair of new trees whose TBR distance is equal to \(d_\mathrm{TBR}(T,T')\).

Lemma 1

Let T and \(T'\) be two unrooted binary phylogenetic trees on X, and let \(T_r\) and \(T_r'\) be two trees obtained from T and \(T'\), respectively, by applying a single subtree or chain reduction. Then \(d_\mathrm{TBR}(T,T')=d_\mathrm{TBR}(T_r,T_r')\).

Finally, if T and \(T'\) cannot be reduced any further under the subtree (resp. chain) reduction, we say that T and \(T'\) are subtree (resp. chain) reduced.

Agreement forests: Let T and \(T'\) be two unrooted binary phylogenetic trees on X. Furthermore, let \(F = \{B_0, B_1,B_2,\ldots ,B_k\}\) be a partition of X, where each block \(B_i\) with \(i\in \{0,1,2,\ldots ,k\}\) is referred to as a component of F. We say that F is an agreement forest for T and \(T'\) if the following hold.

  1. 1.

    For each \(i\in \{0,1,2,\ldots ,k\}\), we have \(T|B_i = T'|B_i\).

  2. 2.

    For each pair \(i,j\in \{0,1,2,\ldots ,k\}\) with \(i \ne j\), we have that \(T[B_i]\) and \(T[B_j]\) are vertex-disjoint in T, and \(T'[B_i]\) and \(T'[B_j]\) are vertex-disjoint in \(T'\).

Let \(F=\{B_0,B_1,B_2,\ldots ,B_k\}\) be an agreement forest for T and \(T'\). The size of F is simply its number of components; i.e. \(k+1\). Moreover, an agreement forest with the minimum number of components (over all agreement forests for T and \(T'\)) is called a maximum agreement forest for T and \(T'\). The number of components of a maximum agreement forest for T and \(T'\) is denoted by \(d_\mathrm{MAF}(T,T')\). The following theorem is well known.

Theorem 3

[1, Theorem 2.13] Let T and \(T'\) be two unrooted binary phylogenetic trees on X. Then

$$\begin{aligned} d_\mathrm{TBR}(T,T') = d_\mathrm{MAF}(T,T') - 1. \end{aligned}$$

Again, let F be an agreement forest for two unrooted binary phylogenetic trees T and \(T'\) on X, and let \(C=(\ell _1,\ell _2,\ldots ,\ell _n)\) be a common n-chain of T and \(T'\). We say that C is split in F if there exist (at least) two components, say \(B_j\) and \(B_{j'}\), with \(j\ne j'\) such that \(B_j\cap C\ne \emptyset \) and \(B_{j'}\cap C\ne \emptyset \). Furthermore, we say that C is atomized in F if each taxon \(\ell _i \in C\) with \(i\in \{1,2,\ldots ,n\}\) occurs as a singleton component \(\{\ell _i\}\) in F. Finally, we say that C is preserved in F if there exists a component \(B_j \in F\) such that \(C \subseteq B_j\). Taking the last three definitions together we have that C is split in F if and only if it is not preserved in F.

3 A New Suite of Reduction Rules

In 2001, Allen and Steel [1] showed that computing the TBR distance between two unrooted binary phylogenetic trees T and \(T'\) is fixed-parameter tractable, when parameterized by \(k=d_\mathrm{TBR}(T,T')\). To this end, they established a linear kernel of size at most 28k. Recently, this result was improved by Kelk and Linz [12] who showed with a new analysis that the following superior bound actually holds.

Theorem 4

Let T and \(T'\) be two subtree and chain reduced unrooted binary phylogenetic trees on X. If \(d_\mathrm{TBR}(T,T')\ge 2\), then \(|X|\le 15 d_\mathrm{TBR}(T,T')-9\).

Noting that the subtree and chain reduction can be applied in time that is polynomial in the size of X, it immediately follows that computing the TBR distance is fixed-parameter tractable when parameterized by k. In what follows, we will develop five new reduction rules that complement the subtree and chain reduction as introduced by Allen and Steel [1] and further improve the \(15\cdot d_\mathrm{TBR}(T,T') -9\) bound.

Although not directly addressed by [1], reducing a chain of length 3 to length 2 can strictly lower the TBR distance, which is why their chain reduction only allows reductions to length 3. An explicit example of this phenomenon are the two phylogenetic trees that are shown in Fig. 1. The TBR distance of these two trees is 2. However, if we reduce the common 3-chain (bcd) to the 2-chain (bc), we obtain the two quartets ab|ce and eb|ca whose TBR distance is 1. Nevertheless, as we shall see, there do exist special circumstances when common 3-chains can be further reduced without altering the TBR distance. This is an important building block of our new reduction rules which ultimately will yield a kernel of size at most \(11\cdot d_\mathrm{TBR}(T,T')-9\). To obtain this bound, we combine the generator approach introduced in [12] with a careful analysis of agreement forests.

The next theorem, which is formally established in the appendix to avoid disrupting the flow of the paper, will repeatedly be used in establishing our new kernel result.

Theorem 5

Let T and \(T'\) be two unrooted binary phylogenetic trees on X. Let K be an (arbitrary) set of mutually taxa-disjoint chains that are common to T and \(T'\). Then there exists a maximum agreement forest \(F'\) of T and \(T'\) such that

  1. 1.

    every n-chain in K with \(n\ge 3\) is preserved in \(F'\), and

  2. 2.

    every 2-chain in K that is pendant in at least one of T and \(T'\) is preserved in \(F'\).

Theorem 5 is somewhat more general than we need in this article—for us, \(|K|\le 2\) is sufficient—but we include full details because we consider it to be of independent interest and anticipate future applications beyond this article. Furthermore, we remark that a weaker version of Theorem 5 was already presented in [1], where it forms the foundation of the proof of Lemma 1. However, the authors of [1] did not prove any results about chains of length 2, and their proof mainly focuses on the case of a single common chain of length 3 that is pendant in neither of the two trees.

Throughout the next four subsections, we detail five new reduction rules. We will see that, similar to Lemma 1, each of these new reductions either preserves the TBR distance or reduces the parameter by 1. To simplify the exposition, we assume that the new reductions are always applied to two unrooted binary phylogenetic trees that are subtree and chain reduced. Finally, while the reduction names appear to be cryptic, they will be further explained in Sect. 4, where we tie the new reductions and a careful unrooted generator analysis together to establish an improved kernel result. A generic example for each of the five new reductions is shown in Fig. 3.

3.1 \((*,3,*)\)-Reduction

Let T and \(T'\) be two unrooted binary phylogenetic trees on X, and let \(C = (a,b,c)\) be a 3-chain that is common to T and \(T'\). For example C may be the result of a previously applied chain reduction. If T has cherry \(\{b,c\}\) and \(T'\) has has cherry \(\{a,b\}\), then a \((*,3,*)\)-reduction is the operation of deleting a, b, and c from T and \(T'\). Formally, we set \(T_r = T|X {\setminus } C\) and \(T'_r = T'|X {\setminus } C\).

Lemma 2

Let T and \(T'\) be two unrooted binary phylogenetic trees on X, and let \(T_r\) and \(T_r'\) be two trees obtained from T and \(T'\), respectively, by a single application of the \((*,3,*)\)-reduction. Then \(d_\mathrm{TBR}(T_r, T'_r) = d_\mathrm{TBR}(T,T') - 1\).

Proof

Without loss of generality, we establish the lemma using the same notation as in the definition of a \((*,3,*)\)-reduction. Let \(F_r\) be a maximum agreement forest for \(T_r\) and \(T_r'\), and let F be a maximum agreement forest for T and \(T'\). Then \(F_r\cup \{\{a,b,c\}\}\) is an agreement forest for T and \(T'\) which implies that \(|F_r| \ge |F|-1\). Hence,

$$\begin{aligned} {d_\mathrm{TBR}(T_r, T'_r) = |F_r|-1 \ge |F|-2 = d_\mathrm{TBR}(T,T') - 1.} \end{aligned}$$

By Theorem 5, we may assume that C is preserved in F. (Formally, we apply the theorem to the set \(K=\{ C \}\).) Hence, there exists a component \(B_{abc}\) in F such that \(C\subseteq B_{abc}\). Towards a contradiction, assume that \(C\subset B_{abc}\) and let \(x\in B_{abc} {\setminus } C\). Then, as \(\{b,c\}\) is a cherry in T and \(\{a,b\}\) is a cherry of \(T'\), it follows that bc|ax is a quartet of \(T|B_{abc}\) and ab|cx is a quartet of \(T'|B_{abc}\); thereby contradicting that F is an agreement forest for T and \(T'\). Now, since \(B_{abc}=C\), we have that \(F{\setminus } \{B_{abc}\}\) is an agreement forest for \(T_r\) and \(T_r'\). Hence, \(|F_r| \le |F|-1\), which yields

$$\begin{aligned} d_\mathrm{TBR}(T, T') -1= |F|-2 \ge |F_r|-1 = d_\mathrm{TBR}(T_r,T_r'). \end{aligned}$$

\(\square \)

Fig. 3
figure 3

The five reductions that are described in Sects. 3.13.4: (i) \((*,3,*)\)-reduction, (ii) \((3,1,*)\)-reduction, (iii) (2, 1, 2)-reduction, (iv) (3, 3)-reduction, (v) (3, 2)-reduction. Triangles indicate subtrees. For (iii)–(v), we have omitted some parts of the trees. Note that the reductions do not require the sets P, Q, \(P'\), and \(Q'\) to all be non-empty

3.2 \((3,1,*)\)-Reduction

Let T and \(T'\) be two unrooted binary phylogenetic trees on X, and let \(C = (a,b,c)\) be a 3-chain that is common to T and \(T'\). If \(T'\) has cherry \(\{b,c\}\) and T has cherry \(\{c,x\}\) for some element \(x\in X{\setminus }{C}\), then a \((3,1,*)\)-reduction is the operation of deleting x from T and \(T'\), i.e. we set \(T_r = T|X {\setminus } \{x\}\) and \(T'_r = T'|X {\setminus } \{x\}\). Informally, x prevents C from being a common pendant subtree of T and \(T'\).

Lemma 3

Let T and \(T'\) be two unrooted binary phylogenetic trees on X, and let \(T_r\) and \(T_r'\) be two trees obtained from T and \(T'\), respectively, by a single application of the \((3,1,*)\)-reduction. Then \(d_\mathrm{TBR}(T_r, T'_r) = d_\mathrm{TBR}(T,T') - 1.\)

Proof

Again without loss of generality, we establish the lemma using the same notation as in the definition of a \((3,1,*)\)-reduction. Let \(F_r\) be a maximum agreement forest for \(T_r\) and \(T_r'\), and let F be a maximum agreement forest for T and \(T'\). To show that \(d_\mathrm{TBR}(T_r, T'_r) \ge d_\mathrm{TBR}(T,T') - 1,\) we apply the same argument as in the first part of the proof of Lemma 2 with the modification of considering \(F_r\cup \{\{x\}\}\) (instead of \(F_r\cup \{C\}\)) as an agreement forest for T and \(T'\). Now, by Theorem 5, we may assume that C is preserved in F. Let \(B_x\) be the component of F that contains x, and let \(B_{abc}\) be the component of F such that \(C\subseteq B_{abc}\). By the choice of F, \(B_{abc}\) exists. Clearly, \(B_x\ne B_{abc}\) since, otherwise, ab|cx is a quartet of \(T|B_x\) but not a quartet of \(T'|B_x\). Moreover, if \(|B_x|\ge 2\), then \(T[B_x]\) and \(T[B_{abc}]\) are not vertex-disjoint in T. It now follows that \(B_x=\{x\}\) and that \(F{\setminus } \{B_x\}\) is an agreement forest for \(T_r\) and \(T_r'\). Hence, \(|F_r| \le |F|-1\), so we have

$$\begin{aligned} d_\mathrm{TBR}(T, T') -1= |F|-2 \ge |F_r|-1 = d_\mathrm{TBR}(T_r,T_r'). \end{aligned}$$

\(\square \)

Following on from the proof of Lemma 3, note that \(\{a,b,c\}\) is the leaf set of a pendant subtree that is common to \(T_r\) and \(T_r'\). The two reduced trees can, therefore, be further reduced by the subtree reduction.

3.3 (2, 1, 2)-Reduction

Let T and \(T'\) be two unrooted binary phylogenetic trees on X. Furthermore, let \(\{a,b,c,d,x\}\subseteq X\) such that \(C_1=(a,b)\) and \(C_2=(c,d)\) are two 2-chains that are common to T and \(T'\). If T has cherries \(\{b,x\}\) and \(\{c,d\}\), and if \(T'\) has cherries \(\{a,b\}\) and \(\{d,x\}\), then a (2, 1, 2)-reduction is the operation of obtaining \(T_r\) and \(T_r'\) from T and \(T'\), respectively, by deleting x from T and \(T'\), i.e. \(T_r = T|X {\setminus } \{x\}\) and \(T'_r = T'|X {\setminus } \{x\}\).

Lemma 4

Let T and \(T'\) be two unrooted binary phylogenetic trees on X, and let \(T_r\) and \(T_r'\) be two trees obtained from T and \(T'\), respectively, by a single application of the (2, 1, 2)-reduction. Then \(d_\mathrm{TBR}(T_r, T'_r) = d_\mathrm{TBR}(T,T') - 1.\)

Proof

Without loss of generality, we may assume that the two common 2-chains \(C_1\) and \(C_2\) and their respective configurations in T and \(T'\) are exactly as described in the definition of a (2, 1, 2)-reduction. Then, \(d_\mathrm{TBR}(T_r, T'_r) \ge d_\mathrm{TBR}(T,T') - 1\) follows as described in the proof of Lemma 3. We establish the lemma by showing that \(d_\mathrm{TBR}(T_r, T'_r) \le d_\mathrm{TBR}(T,T') - 1\). Let F be a maximum agreement forest for T and \(T'\), and let \(F_r\) be a maximum agreement forest for \(T_r\) and \(T_r'\). By Theorem 5, we may assume that \(C_1\) and \(C_2\) are preserved in F. (Formally, we apply the theorem to the set of chains \(K=\{ C_1, C_2 \}\), noting that each chain is pendant in one of the two trees.) Let \(B_{ab}\) be the element in F that contains \(C_1\) and, similarly, let \(B_{cd}\) be the element in F that contains \(C_2\). Towards showing that \(\{x\}\in F\), first assume that there exists an element \(B_x\in F\) such that \(|B_x|\ge 2\) and \(B_x\cap \{a,b,c,d,x\}=\{x\}\). Then, it is straightforward to check that \(T[B_x]\) and \(T[B_{ab}]\) are not vertex-disjoint in T; a contradiction. Thus, x is either contained in \(B_{ab}\) or \(B_{cd}\), or \(\{x\}\) is an element in F. Now, if \(B_{ab}=B_{cd}\) and \(x\in B_{ab}\), then ax|cd is a quartet of \(T|B_{ab}\) while ac|dx is a quartet of \(T'|B_{ab}\); a contradiction. Otherwise, if \(B_{ab}\ne B_{cd}\) and \(x\in B_{ab}\), then \(T'[B_{ab}]\) and \(T'[B_{cd}]\) are not vertex-disjoint in \(T'\). Symmetrically, if \(B_{ab}\ne B_{cd}\) and \(x\in B_{cd}\), then \(T[B_{ab}]\) and \(T[B_{cd}]\) are not vertex-disjoint in T. It now follows that \(\{x\}\in F\) and that \(F{\setminus } \{\{x\}\}\) is an agreement forest for \(T_r\) and \(T_r'\). Hence, we have

$$\begin{aligned} d_\mathrm{TBR}(T, T') -1= |F|-2 \ge |F_r|-1= d_\mathrm{TBR}(T_r,T_r'). \end{aligned}$$

\(\square \)

After performing a (2, 1, 2)-reduction, note that the two reduced trees \(T_r\) and \(T_r'\) have common pendant subtrees on leaf sets \(\{a,b\}\) and \(\{c,d\}\), respectively, that can be reduced further under the subtree reduction.

3.4 (3, 3)- and (3, 2)-Reduction

Let T and \(T'\) be two unrooted binary phylogenetic trees on X. The next reduction can be applied in two slightly different situations. The first situation considers two 3-chains while the second situation considers one 3-chain and one 2-chain. We start by formally describing the first situation. Let \(\{a,b,c,x,y,z\}\subseteq X\) such that \(C_1=(a,b,c)\) and \(C_2=(x,y,z)\) are two 3-chains that are common to T and \(T'\). If T has cherries \(\{b,c\}\) and \(\{x,y\}\), and if \(T'\) has a 6-chain (abcxyz) then a (3, 3)-reduction is the operation of obtaining \(T_r\) and \(T_r'\) from T and \(T'\), respectively, by deleting x and y from T and \(T'\), i.e. \(T_r = T|X {\setminus } \{x,y\}\) and \(T'_r = T'|X {\setminus } \{x,y\}\). We now turn to the second situation. Let \(\{a,b,c,y,z\}\subseteq X\) such that \(C_1=(a,b,c)\) and \(C_2=(y,z)\) are two chains that are common to T and \(T'\). If T has cherries \(\{b,c\}\) and \(\{y,z\}\), and if \(T'\) has a 5-chain (abcyz), then a (3, 2)-reduction is the operation of obtaining \(T_r\) and \(T_r'\) from T and \(T'\), respectively, by deleting y from T and \(T'\), i.e. \(T_r = T|X {\setminus } \{y\}\) and \(T'_r = T'|X {\setminus } \{y\}\).

Lemma 5

Let T and \(T'\) be two unrooted binary phylogenetic trees on X, and let \(T_r\) and \(T_r'\) be two trees obtained from T and \(T'\), respectively, by a single application of the (3, 3)- or the (3, 2)-reduction. Then \(d_\mathrm{TBR}(T_r, T'_r) = d_\mathrm{TBR}(T,T').\)

Proof

Without loss of generality, we may assume that the two common chains \(C_1\) and \(C_2\) and their respective configurations in T and \(T'\) are exactly as described in the paragraph prior to the statement of this lemma. Let \(Y=\{a,b,c,x,y,z\}\) if \(T_r\) and \(T_r'\) have been obtained from T and \(T'\) by a (3, 3)-reduction and, otherwise, let \(Y=\{a,b,c,y,z\}\).

Given that the \(T_r\) and \(T'_r\) are induced subtrees of T and \(T'\) respectively (i.e. obtained from T and \(T'\) using the “|” operator), it follows from Lemma 2.11 of [1] that

$$\begin{aligned} {d_\mathrm{TBR}(T_r, T'_r)\le d_\mathrm{TBR}(T,T').} \end{aligned}$$

To establish the other direction, let \(F_r\) be a maximum agreement forest for \(T_r\) and \(T_r'\), and let F be a maximum agreement forest for T and \(T'\). By Theorem 5, we may assume that \(C_1\) is preserved in \(F_r\). Let \(B_{abc}\) be the element in \(F_r\) such that \(C_1\subseteq B_{abc}\). Similarly, let \(B_z\) be the element in \(F_r\) such that \(z\in B_z\). We have \(B_{abc}\ne B_z\) since, otherwise, bc|az is a quartet of \(T_r|B_{abc}\) while ab|cz is a quartet of \(T_r'|B_{abc}\); a contradiction. Next, observe that if \(B_{abc}\) contains some taxon \(d \not \in \{a,b,c,z\}\), then d is a leaf in the subtree Q of \(T_r'\), as depicted in Fig. 3(iv) and (v). If this was not so, then ad|bc would be a quartet of \(T_r|B_{abc}\) while cd|ab would be a quartet of \(T_r'|B_{abc}\); a contradiction. Combining these facts yields the insight that, in \(T_r'\), the edge between the parent of z and the parent of c (if such an edge exists) is not on the embedding of any component in \(F_r\). Since \(F_r\) is an agreement forest for \(T_r\) and \(T_r'\), it now follows that

$$\begin{aligned} (F_r{\setminus }\{B_z\})\cup \{B_z\cup \{x,y\}\} \end{aligned}$$

is an agreement forest for T and \(T'\) if a (3, 3)-reduction has been applied and that

$$\begin{aligned} (F_r{\setminus }\{B_z\})\cup \{B_z\cup \{y\}\} \end{aligned}$$

is an agreement forest for T and \(T'\) if a (3, 2)-reduction has been applied. Hence,

$$\begin{aligned} d_\mathrm{TBR}(T_r, T'_r) = |F_r|-1 \ge |F|-1 = d_\mathrm{TBR}(T,T'). \end{aligned}$$

\(\square \)

We end this section by noting that it takes \(O( \text {poly}(|X|))\) time to test if any of the new reductions presented in Sects. 3.13.4 can be applied. While the (3, 2)- and (3, 3)-reduction preserves the TBR distance, each of the other three new reductions reduces the parameter by exactly one, i.e. the TBR distance for the unreduced trees can be calculated by computing the TBR distance for the reduced trees and adding one to the result.

4 A New Kernel for Computing the TBR Distance

In this section, we establish the main result of this paper. To formally state it, we require a new definition. Let T and \(T'\) be two unrooted binary phylogenetic trees on X. We say that T and \(T'\) are exhaustively reduced if they are subtree and chain reduced, and none of the five reductions presented in Sect. 3 can be applied to T and \(T'\).

Theorem 6

Let T and \(T'\) be two exhaustively reduced unrooted binary phylogenetic trees on X. If \(d_\mathrm{TBR}(T,T')\ge 2\), then \(|X|\le 11d_\mathrm{TBR}(T,T')-9\).

To establish this theorem, we analyze the maximum size of two exhaustively reduced phylogenetic trees with the help of an unrooted binary phylogenetic network N that displays the two trees and the unrooted generator that underlies N. Next, we define unrooted generators.

Let k be a positive integer. For \(k\ge 2\), a k-generator (or short generator when k is clear from the context) is a connected cubic multigraph with edge set E and vertex set V such that \(k=|E|-(|V|-1)\). The edges of a generator are called its sides. Intuitively, given an unrooted binary phylogenetic network N with \(r(N)=k\), we can obtain a k-generator by, repeatedly, deleting all (labeled and unlabeled) leaves and suppressing any resulting degree-2 vertices. We say that the generator obtained in this way underlies N. Now, let G be a k-generator, let \(\{u,v\}\) be a side of G, and let Y be a set of leaves. The operation of subdividing \(\{u,v\}\) with |Y| new vertices and, for each such new vertex w, adding a new edge \(\{w,\ell \}\), where \(\ell \in Y\) and Y bijectively labels the new leaves, is referred to as attaching Y to \(\{u,v\}\). Finally, if at least one new leaf is attached to each loop and to each pair of parallel edges in G, then the resulting graph is an unrooted binary phylogenetic network N with \(r(N)=k\). Note that N has no pendant subtree with more than a single leaf. Hence, we have the following observation.

Observation 7

Let N be an unrooted binary phylogenetic network that has no pendant subtree with at least two leaves, and let G be a generator. Then G underlies N if and only if N can be obtained from G by attaching a (possibly empty) set of leaves to each side of G.

Now let T and \(T'\) be two subtree and chain reduced unrooted binary phylogenetic trees on X, and let N be an unrooted binary phylogenetic network on X that displays T and \(T'\). Let S and \(S'\) be spanning trees of N obtained by greedily extending a subdivision of T (respectively, \(T'\)) to become a spanning tree, if it is not that already. Since N displays T and \(T'\), S and \(S'\) exist. Furthermore, let G be the generator that underlies N. Since T and \(T'\) are subtree and chain reduced, N does not have a pendant subtree of size at least two. Hence, by Observation 7, we can obtain N from G by attaching leaves to G. Let \(s=\{u,w\}\) be a side of G. Let \(Y=\{\ell _1,\ell _2,\ldots ,\ell _m\}\) be the set of leaves that are attached to s in obtaining N from G. Recall that \(m \ge 0\). Then there exists a path

$$\begin{aligned} u=v_0,v_1,v_2,\ldots ,v_m,v_{m+1}=w \end{aligned}$$

of vertices in N such that, for each \(i\in \{1,2,\ldots ,m\}\), \(v_i\) is the unique parent of \(\ell _i\). We refer to this path as the path associated with s and denote it by \(P_s\). Importantly, for a path \(P_s\) in N that is associated with a side s of G, there is at most one edge in \(P_s\) that is not contained in S, and there is at most one (not necessarily distinct) edge in \(P_s\) that is not contained in \(S'\). We make this precise in the following definition and say that s has b breakpoints relative to S and \(S'\), where

  1. 1.

    \(b=0\) if S and \(S'\) both contain all edges of \(P_s\),

  2. 2.

    \(b=1\) if one element in \(\{S,S'\}\) contains all edges of \(P_s\) while the other element contains all but one edge of \(P_s\), and

  3. 3.

    \(b=2\) if each of S and \(S'\) contains all but one edge of \(P_s\).

Since S and \(S'\) span N, note that s cannot have more than 2 breakpoints relative to S and \(S'\).

In the language of this paper, Kelk and Linz [12] have established the following result.

Lemma 6

Let N be an unrooted binary phylogenetic network on X that displays two subtree and chain reduced unrooted binary phylogenetic trees T and \(T'\). Let S (resp. \(S')\) be a spanning tree of N obtained by extending a subdivision of T (resp. \(T')\). Furthermore, let G be the generator that underlies N, and let s be a side of G. Suppose that s has b breakpoints relative to S and \(S'\) for some \({b}\in \{0,1,2\}\). Then,

  1. (i)

    if \(b=0\), then N can be obtained from G by attaching at most 3 leaves to s;

  2. (ii)

    if \(b=1\), then N can be obtained from G by attaching at most 6 leaves to s; and

  3. (iii)

    if \(b=2\), then N can be obtained from G by attaching at most 9 leaves to s.

Since Lemma 6 only considers the subtree and chain reduction, a natural question is whether or not the five reductions presented in Sect. 3 improve the bounds on the number of leaves that are attached to a side of a generator. We answer this question positively in the next lemma.

Lemma 7

Let N be an unrooted binary phylogenetic network on X that displays two exhaustively reduced unrooted binary phylogenetic trees T and \(T'\). Let S (resp. \(S')\) be a spanning tree of N obtained by extending a subdivision of T (resp. \(T')\). Furthermore, let G be the generator that underlies N, and let \(s=\{u,v\}\) be a side of G. Suppose that s has b breakpoints relative to S and \(S'\) for some \({b}\in \{0,1,2\}\). Then,

  1. (i)

    if \(b=0\), then N can be obtained from G by attaching at most 3 leaves to s;

  2. (ii)

    if \(b=1\), then N can be obtained from G by attaching at most 4 leaves to s; and

  3. (iii)

    if \(b=2\), then N can be obtained from G by attaching at most 4 leaves to s.

Proof

By Lemma 6(i), (i) follows immediately.

To establish (ii), we show that neither 5 nor 6 leaves are attached to s and note that, by Lemma 6(ii), no more than 6 leaves are attached to s. Without loss of generality, we may assume that S contains all edges of \(P_s\) and that \(S'\) contains all but one edge of \(P_s\). Let e be the edge of \(P_s\) that \(S'\) does not contain. First, assume that 6 leaves are attached to s. Let \(P_s=v_0,v_1,v_2,\ldots ,v_{6},v_7\). Recall that \(u=v_0\) and \(v=v_7\). For each \(i\in \{1,2,\ldots ,6\}\), let \(\ell _i\) be the leaf adjacent to \(v_i\) in N. If \(e\ne \{v_3,v_4\}\), then T and \(T'\) have a common chain of length at least 4; a contradiction since T and \(T'\) are chain reduced. On the other hand, if \(e=\{v_3,v_4\}\), then T and \(T'\) have two common 3-chains \((\ell _1,\ell _2,\ell _3)\) and \((\ell _4,\ell _5,\ell _6)\) such that \((\ell _1,\ell _2,\ldots ,\ell _6)\) is a chain of T, and both of \(\{\ell _2,\ell _3\}\) and \(\{\ell _4,\ell _5\}\) are cherries of \(T'\). Hence, T and \(T'\) can be further reduced under a (3, 3)-reduction; again a contradiction. Second, assume that 5 leaves are attached to s. Let \(P_s=v_0,v_1,v_2,\ldots ,v_{5},v_6\). Since T and \(T'\) are chain reduced, we use an argument analogous to the previous 6-leaf case to show that \(e\in \{\{v_2,v_3\},\{v_3,v_4\}\}\). If \(e=\{v_2,v_3\}\), then T and \(T'\) have common chains \((\ell _1,\ell _2)\) and \((\ell _3,\ell _4,\ell _5)\) and T has a chain \((\ell _1, \ell _2, \ell _3,\ell _4,\ell _5)\), where \(\ell _i\) is again the leaf adjacent to \(v_i\) in N for each \(i\in \{1,2,\ldots ,5\}\). Furthermore, \(T'\) has cherries \(\{\ell _1,\ell _2\}\) and \(\{\ell _3,\ell _4\}\). It now follows that T and \(T'\) can be further reduced under a (3, 2)-reduction; a contradiction to the fact that both trees are exhaustively reduced. If \(e=\{v_3,v_4\}\), we use an symmetric argument to get the same contradiction; thereby establishing (ii).

We complete the proof by showing that (iii) holds. Throughout this part of the proof, suppose that at least 5 leaves get attached to s in the process of obtaining N from G since, otherwise, (iii) follows without proof. Again, consider the path

$$\begin{aligned} P_s=v_0,v_1,v_2,\ldots ,v_{m+1} \end{aligned}$$

that is associated with s in N. Recall that m is the number of leaves that are attached to s. Hence \(m\ge 5\). Let \(\ell _i\) be the leaf adjacent to \(v_i\) in N for each \(i\in \{1,2,\ldots ,m\}\). Furthermore, let \(e=\{v_i,v_j\}\) be the edge of \(P_s\) that is not contained in S, and let \(f=\{v_{i'},v_{j'}\}\) be the edge of \(P_s\) that is not contained in \(S'\). Without loss of generality, we may assume that \(i=j-1\), \(i' = j'-1\), and \(i\le i'\). Moreover, note that if \(i < i'\) then \(C=(\ell _{i+1},\ell _{i+2},\ldots ,\ell _{i'})\) is an \((i'-i)\)-chain that is common to T and \(T'\). Considering four cases and deriving a contradiction for each, we next show that \(i'-i=1\).

Case 1.:

If \(i'-i>3\), then C has length at least 4 and T and \(T'\) are not chain reduced.

Case 2.:

If \(i'-i=3\), then \(C=(\ell _{i+1},\ell _{i+2},\ell _{i+3})\) is a maximal common 3-chain of T and \(T'\). Moreover, as \(\{v_{i+1},v_{i+2}\}\) is a cherry of T and \(\{v_{i+2},v_{i+3}\}\) is a cherry of \(T'\), it follows that a \((*,3,*)\)-reduction can be applied to T and \(T'\); thereby contradicting that T and \(T'\) are exhaustively reduced.

Case 3.:

If \(i'-i=2\), then \(C=(\ell _{i+1},\ell _{i+2})\) is a maximal common 2-chain of T and \(T'\). In particular C is the leaf set of a pendant subtree that is common to T and \(T'\) that can be further reduced under the subtree reduction.

Case 4.:

If \(i'-i=0\), then \(\{\ell _1,\ell _2,\ldots ,\ell _i\}\) and \(\{\ell _j,\ell _{j+1},\ldots ,\ell _{m-1}\}\) are the leaf sets of two pendant subtrees that are common to T and \(T'\). Since \(m\ge 5\) one of these subtrees has size at least two and, so, T and \(T'\) can be further reduced under the subtree reduction.

All four cases contradict the fact that T and \(T'\) are exhaustively reduced. Thus, we may assume for the remainder of the proof that, if \(m\ge 5\), then \(i'-i=1\).

We next establish a maximum for i and minimum for \(i'\). Clearly, if \(i>3\), then \((\ell _1,\ell _2,\ldots ,\ell _i)\) is a chain of length at least 4 that is common to T and \(T'\) that can be reduced by applying a chain reduction. Moreover, if \(i=3\), first recall that \(i'=i+1\). It then follows that \((\ell _1,\ell _2,\ell _3)\) is a chain that is common to T and \(T'\), \(\{\ell _2,\ell _3\}\) is a cherry of T, and \(\{\ell _3,\ell _4\}\) is a cherry of \(T'\). Hence, T and \(T'\) can be further reduced by applying a \((3,1,*)\)-reduction, where \(\ell _4\) takes on the role of x in the definition of this reduction. Hence \(i\le 2\). By symmetry and applying an analogous argument, we derive that \(i'\ge m-2\). In summary, under the assumption that \(m\ge 5\), we have established the following three restrictions on the indices i and \(i'\):

$$\begin{aligned}&i\le 2;\\&i'= i+1, \text { and}\\&i'\ge m-2. \end{aligned}$$

Taken all three together, it follows that \(m\le 5\). So suppose that \(m=5\). Then, by the aforementioned three restrictions, this implies that \(e=\{v_2,v_3\}\) and \(f=\{v_3,v_4\}\). Furthermore, \((\ell _1,\ell _2)\) and \((\ell _4,\ell _5)\) are two 2-chains that are common to T and \(T'\) such that T has cherries \(\{\ell _1,\ell _2\}\) and \(\{\ell _3,\ell _4\}\), and \(T'\) has cherries \(\{\ell _2,\ell _3\}\) and \(\{\ell _4,\ell _5\}\). With \(\ell _3\) taking on the role of x in the definition of a (2, 1, 2)-reduction, it now follows that T and \(T'\) can be further reduced under this reduction. This contradicts our initial assumption that \(m\ge 5\); thereby establishing (iii). \(\square \)

We can now clarify the rather cryptic names of the new reduction rules. From the proof of Lemma 7 we can see that a side s that has 2 breakpoints, indexed by i and \(i'\), respectively (where \(i < i'\)), induces three common chains of length i, \(i'-i\) and \(m-i'\). We can summarize these three lengths in a vector \((i, i'-i, m-i')\). Then the \((*,3,*)\)-reduction can be applied when \(i'-i = 3\), irrespective of the values of i and \(m-i'\), and we denote this indifference using wildcard symbols. The same idea applies to the \((3,1,*)\)- and the (2, 1, 2)-reduction. For sides with a single breakpoint at position i, the vector of common chain lengths induced is given by \((i, m-i)\). Then essentially, the (3, 3)- and the (3, 2)-reduction capture the situation when a 6-chain (resp. 5-chain) in \(T'\) is split into two shorter chains in T by a breakpoint at \(i=3\).

We are now in a position to establish Theorem 6.

Proof of Theorem 6

Let N be an unrooted binary phylogenetic network on X that displays T and \(T'\) such that

$$\begin{aligned} r(N)=h^u(T,T')=d_\mathrm{TBR}(T,T')=k\ge 2, \end{aligned}$$

where the second equality follows from Theorem 1. Let S and \(S'\) be spanning trees of N that are obtained by extending subdivisions of T and \(T'\), respectively. Furthermore, let G be the generator that underlies N. To establish the theorem, we use Lemma 7 to bound from above the number of leaves that can collectively be attached to G over all of its sides. The following approach is similar to the one used in [12, Lemma 3]. By  [12, Lemma 1], G has \(3(k-1)\) sides. Furthermore N contains exactly k edges that are not contained in S, and exactly k edges that are not contained in \(S'\). Each of these edges induces a breakpoint on a corresponding side of G, so each side of G can have 0, 1 or 2 breakpoints and there are 2k breakpoints in total. Let q be the number of sides in G that have two breakpoints relative to S and \(S'\). Noting that \(0\le q\le k\), it follows that there are \(2(k-q)\) sides in G whose number of breakpoints is one relative to S and \(S'\). Hence, there are \(3(k-1)-(2k-q)\) sides in G that have zero breakpoints relative to S and \(S'\). Since T and \(T'\) are exhaustively reduced, we now apply Lemma 7 to derive the following inequality:

$$\begin{aligned} |X|\le 4q+4(2(k-q))+3(3(k-1)-(2k-q))=-q+11k-9. \end{aligned}$$

Clearly, \(-q+11k-9\) is maximum for \(q=0\) and, so, we have

$$\begin{aligned} |X|\le -q+11k-9\le 11k-9=11d_\mathrm{TBR}(T,T')-9 \end{aligned}$$

which establishes the theorem. \(\square \)

We finish the section with an additional kernel result that establishes an even smaller kernel for particular trees.

Corollary 1

Let T and \(T'\) be two unrooted binary phylogenetic trees on X. If \(d_\mathrm{TBR}(T,T')\ge 2\), T and \(T'\) are subtree reduced and do not have any common n-chain with \(n\ge 2\), then \(|X|\le 5d_\mathrm{TBR}(T,T') - 3\).

Proof

As previously, let N be an unrooted binary phylogenetic network on X that displays T and \(T'\) and has the property that \(r(N)=h^u(T,T')\). Let S and \(S'\) be spanning trees of N obtained by extending subdivisions of T and \(T'\), respectively, and let G be the generator underlying N. For a side s of G, it follows that we can attach at most 3 leaves to s if s has two breakpoints relative to S and \(S'\). Similarly, we can attach at most 2 leaves (resp. 1 leaf) to s if s has one breakpoint (resp. zero breakpoints) relative to S and \(S'\). Interestingly, and in comparison with the proof of Lemma 7, these upper bounds can be easily established using arguments that only rely on the (ordinary) subtree and chain reduction, but make no use of the five reductions presented in Sect. 3. Now, using the same counting argument as in the proof of Theorem 6, we have

$$\begin{aligned} |X|\le 5k-3=5d_\mathrm{TBR}(T,T')-3. \end{aligned}$$

\(\square \)

Fig. 4
figure 4

Two exhaustively reduced unrooted binary phylogenetic trees \(T_k\) and \(T'_k\) as well as the generator \(G_k\) (and \(G_4\)) that provide a family of trees to show that the kernel presented in Theorem 6 is tight for each \(k=d_\mathrm{TBR}(T_k,T_k')\ge 4\). Blocks A and B indicate a (common) 1-chain and 3-chain, respectively. For details see the main text and [12, Section 4]

5 Tightness of the Kernel Under the New Reductions

In this section, we show that, for two exhaustively reduced trees, the kernel result presented in Theorem 6 is tight. For each \(k\ge 4\), we do this by providing two exhaustively reduced unrooted binary phylogenetic trees \(T_k\) and \(T_k'\) whose leaf sets have size \(11k-9\), such that \(d_\mathrm{TBR}(T_k,T_k')=k\). To illustrate, \(T_k\) and \(T_k'\) are shown in Fig. 4. It is straightforward to check that \(T_k\) and \(T_k'\) are exhaustively reduced. While we do not go into detail about justifying that \(T_k\) and \(T_k'\) indeed provide a tight example, i.e. \(d_\mathrm{TBR}(T_k,T_k')=k\), we point the interested reader to [12, Section 4], where a very similar family of constructions is given to show that the kernel result presented in [12] is tight for phylogenetic trees that are subtree and chain reduced, and do not contain any common so-called cluster. The approach taken there, which uses unrooted generators to argue that \(d_\mathrm{TBR}(T_k,T_k') \le k\) and maximum parsimony distance [8] to prove \(d_\mathrm{TBR}(T_k,T_k') \ge k\), can be easily adapted to establish the following proposition from which tightness of the kernel presented in Theorem 6 immediately follows.Footnote 1

Proposition 1

For \(k\ge 4\), let \(T_k\) and \(T_k'\), be the two exhaustively reduced unrooted binary phylogenetic trees on X that are shown in Fig. 4. Then \(d_\mathrm{TBR} (T_k,T_k')=k\).

6 Discussion and Future Work

To further lower the \(11k-9\) bound using the approach described in this article requires reduction rules to prohibit generator sides from having 4 leaves (and 1 or 2 breakpoints), or 3 leaves (and 0 breakpoints). However, in such situations it is neither clear how to reduce the TBR distance by 1, or reduce the number of taxa without altering the TBR distance. Hence, new techniques are required which do not just look “locally” at individual sides of the generator, but at the way multiple sides of the generator interact. We hope to return to this issue in future work. Interestingly, although it is not yet clear how to eliminate these cases in the context of kernelization, the analysis in our paper does convey additional structural information. For example, the argument behind the (3, 3)- and (3, 2)-reduction directly identifies an edge, in one of the trees, that can safely be deleted if we wish to progressively transform that tree into a maximum agreement forest. These edges can sometimes still be identified even in situations when our new reduction rules do not apply. Such insights, together with Theorem 5, can potentially be used by FPT branching algorithms that compute the TBR distance by iteratively deleting edges to obtain agreement forests (see e.g. [5]). Could the unrooted generator approach, coupled with the reduction rules described in this article, be used to reduce the branching factor of such algorithms?

Fig. 5
figure 5

Two unrooted binary phylogenetic trees T and \(T'\) on X that have a common 3-chain \(C=(a,b,c)\) that is not pendant in T or \(T'\). The leaf sets of the subtrees indicated by the left and right solid grey triangle of T (resp. \(T'\)) are \(L_T(C)\) and \(R_T(C)\) (resp. \(L_{T'}(C)\) and \(R_{T'}(C)\)). (i) An inside-outside component B with respect to C that straddles C in T and \(T'\), and with \(c\in B\). (ii) An inside–outside component \(B'\) with respect to C that does not straddle C in T or \(T'\), and with \(b,c\in B'\). (iii) A bypass component \(B''\) in T and \(T'\) with respect to C. The components B, \(B'\), and \(B''\) are indicated by their embeddings in T and \(T'\) (thick black lines)

Fig. 6
figure 6

Two unrooted binary phylogenetic trees T and \(T'\) on X that have a common 2-chain \(C=(a,b)\) that is pendant in \(T'\). The leaf set of the subtree indicated by the left and right grey solid triangle of T is \(L_T(C)\) and \(R_T(C)\), respectively, whereas the leaf set of the subtree indicated by the solid grey triangle of \(T'\) is \(X{\setminus }\{a,b\}\). (i) An inside-outside component B with respect to C that does not straddle C in T, and with \(a\in B\). (ii) A bypass component \(B'\) in T with respect to C, where the leaf sets of the subtrees indicated by the thick black triangles are so that \(Q=P\cup P'\). The components B and \(B'\) are indicated by their embeddings in T and \(T'\) (thick black lines)

Fig. 7
figure 7

Setting as described in Case 1.2.1 in the proof of Theorem 5, where we consider a chain \(C=(1,2,\ldots ,n)\) with \(n\ge 3\) that is not pendant in T or \(T'\), and an agreement forest F for T and \(T'\) that does not contain a bypass component and does contain exactly one inside–outside component \(B_1\) with respect to C such that \(|B_1\cup C|\ge 2\): (i) \(B_1\) straddles C in T or \(T'\) (and hence \(B_1\) straddles C in both T and \(T'\)); (ii) \(B_1\) does not straddle C in T or \(T'\). The last column shows an agreement forest \(F^*\) for T and \(T'\) with \(|F^*|\le |F|\) such that C is preserved in \(F^*\). Note that only those elements of F and \(F^*\) are shown that contain leaves labeled by elements in C while all other elements are omitted since they are the same in \(F^*\) and F. Furthermore, thick black lines in T and \(T'\) indicate the embedding of \(B_1\)

Fig. 8
figure 8

Setting as described in Case 1.2.2 in the proof of Theorem 5, where we consider a chain \(C=(1,2,\ldots ,n)\) with \(n\ge 3\) that is not pendant in T or \(T'\), and an agreement forest F for T and \(T'\) that does not contain a bypass component and does contain exactly one inside-outside component \(B_1\) with respect to C such that \(|B_1\cap C|= 1\): (i) \(B_1\) straddles C in one of T and \(T'\), say T; (ii) \(B_1\) does not straddle C in T or \(T'\) which implies that there exists at least one component \(B_i\) with \(B_i\subseteq C\). The last column shows an agreement forest \(F^*\) for T and \(T'\) with \(|F^*|\le |F|\) such that C is preserved in \(F^*\). Note that only those elements of F and \(F^*\) are shown that contain leaves labeled by elements in C while all other elements are omitted since they are the same in \(F^*\) and F. Furthermore, thick black lines in T and \(T'\) indicate the embedding of \(B_1\)