Extremal Distances for Subtree Transfer Operations in Binary Trees

Atkins, Ross; McDiarmid, Colin

doi:10.1007/s00026-018-0410-4

Extremal Distances for Subtree Transfer Operations in Binary Trees

Open access
Published: 13 December 2018

Volume 23, pages 1–26, (2019)
Cite this article

Download PDF

You have full access to this open access article

Annals of Combinatorics Aims and scope Submit manuscript

Extremal Distances for Subtree Transfer Operations in Binary Trees

Download PDF

Ross Atkins¹ &
Colin McDiarmid¹

1358 Accesses
8 Citations
Explore all metrics

Abstract

Three standard subtree transfer operations for binary trees, used in particular for phylogenetic trees, are: tree bisection and reconnection (TBR), subtree prune and regraft (SPR), and rooted subtree prune and regraft (rSPR). We show that for a pair of leaf-labelled binary trees with n leaves, the maximum number of such moves required to transform one into the other is $n-\Theta (\sqrt{n})$, extending a result of Ding, Grünewald, and Humphries, and this holds also if one of the trees is fixed arbitrarily. If the pair is chosen uniformly at random, then the expected number of moves required is $n-\Theta (n^{2/3})$. These results may be phrased in terms of agreement forests: we also give extensions for more than two binary trees.

Centrality measures in networks

Article 24 April 2023

Clustal Omega, Accurate Alignment of Very Large Numbers of Sequences

Clustering, coding, and the concept of similarity

Article 19 March 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

A standard way to transform one binary tree into another is by performing ‘subtree-prune-and-regraft’ (SPR) moves (definitions are given below). These operations are of particular interest for phylogenetic trees. For a pair of binary trees with n labelled leaves, it was shown by Allen and Steel [2] in 2001 that the maximum number of SPR moves required to change one into the other, $D_{SPR}(n)$, is between $\frac{n}{2}-o(n)$ and $n-3$ (for $n \ge 3$). Martin and Thatte [11] show the existence of a common (unrooted) subtree of size $\Omega \left( \sqrt{\log n}\right) $, which brings the upper bound for $D_{SPR}(n)$ down to $n - \Omega \big (\sqrt{ \log n}\big )$.

Ding, Grünewald, and Humphries [6] show that $D_{SPR}(n)$ is actually $n-\Theta \left( \sqrt{n}\right) $. Our Theorem 1.9 is an extended version of this result for the rooted as well as the unrooted cases. It says also that if one of the trees is fixed arbitrarily and the other is chosen to maximise the number of SPR moves required to turn one into the other, then still $n-\Theta (\sqrt{n})$ moves are required. We prove the last result by showing that for a fixed pair of binary trees, we can label the leaves in such a way that $n - \Theta (\sqrt{n})$ moves are required to turn one into the other. This result is set in a more general context of agreement forests for sets of $k \ge 2$ trees.

We also consider the case when two trees are chosen independently, with an exchangeable distribution (for example, the uniform or the Yule–Harding distribution). We show that, if X is the number of moves required to transform one into the other, then $\mathbb {E}[X] = n-\Theta \left( n^{2/3}\right) $ and indeed $X = n-\Theta \big (n^{2/3}\big )$ with high probability; see Corollary 1.11. This implies that the expected size of a largest common (unrooted) subtree of two random trees is $\mathcal {O}\big (n^{2/3}\big )$, where we measure size by the number of leaves. It is known, see [3] and [5], that this expected size is $\Omega \big (n^{1/8}\big )$ in the uniform case and $\Omega \big (n^{0.344}\big )$ in the Yule–Harding case; and it is $\mathcal {O}\big (n^{1/2}\big )$ in both cases. Our results generalise the upper bound to random sets of $k \ge 2$ trees with n leaves in the following way. The expected size of a minimum agreement forest of such a random set of k trees is $n - \Theta \big (n^{2/(k+1)}\big )$, and the expected size of a maximum agreement subtree is $\mathcal {O}\big (n^{2/(k+1)}\big )$; see Eqs. (5.2) and (5.3).

Subtree transfer operations are subtly different when acting on rooted trees as opposed to unrooted trees. Some papers have considered SPR and rSPR moves as acting on different types of binary trees. We give a unified treatment here, and consider TBR, SPR, and rSPR to be three different classes of moves which act on the same class of trees. We work on the class $\mathcal {B}(n)$ of unrooted trees defined below, where always one of the leaf labels is 0. The leaf labelled 0 acts as the root of the tree for rooted-subtree-prune-and-regraft (rSPR) moves, and it behaves in exactly the same manner as the other leaves for SPR and TBR moves.^{Footnote 1}

In the subsections below, we define binary trees, the three subtree transfer operations TBR, SPR, and rSPR, and the corresponding metrics they induce over the class of binary trees, and present our main theorems.

1.1 Binary Trees

Definition 1.1

Let S be a non-empty, finite set. If $|S|>1$, then a binary leaf-labelled tree with leaf-label set S is a finite tree which has 2 types of vertices:

tree vertices of degree 3, and
leaf vertices of degree 1, labelled with the set S (there is a bijection between the leaves and S).

If $0 \in S$, the leaf labelled 0 is called the root of the tree. In the trivial case, when $S = \{ \alpha \}$ is a singleton, the trivial graph containing only one vertex, labelled $\alpha $, is considered to be a binary leaf-labelled tree, and its vertex is called a leaf.

Definition 1.2

For any non-empty set S, let $\mathcal {B}(S)$ be the set of all binary leaf-labelled trees, with label set S. For a leaf-labelled tree A, let L(A) denote the label set of A; so if $A \in \mathcal {B}(S),$ then $L(A) = S$. To simplify notation, for any positive integer n, let $[n]_0 = \{ 0,\,1,\,2, \ldots ,\,n-1 \}$ and let $\mathcal {B}(n)$ be given by

$$\begin{aligned} \mathcal {B}(n) = \mathcal {B}([n]_0) = \mathcal {B}( \{0,\,1,\,2,\,3, \ldots ,\,n-1\}). \end{aligned}$$

So, a tree in $\mathcal {B}(n)$ has n leaves and, for $n \ge 2$, it has $n-2$ tree-vertices (giving $2n-2$ vertices in total) and $2n-3$ edges.

For trees T and $T^\prime $ in $\mathcal {B}(S)$, we say T is isomorphic to $T^\prime $ (denoted $T \equiv T^\prime $) if there is a graph isomorphism preserving leaf labels; that is, if there is a bijection $\psi :V(T) \rightarrow V(T^\prime )$ such that uv is an edge in T, if and only if $\psi (u)\psi (v)$ is an edge in $T^\prime $, and for each leaf v of T, v and $\psi (v)$ are labelled with the same element of S. We identify isomorphic trees, as is customary. For non-empty $S\,^\prime \subseteq L(A)$, let $A | S\,^\prime $ be the minimal subtree of A that contains all the leaves with labels in $S\,^\prime $. Let $A / S\,^\prime $ be the tree formed by suppressing any vertices of degree 2 from $A | S\,^\prime $. So,

$$\begin{aligned} A | S\,^\prime \text{ is } \text{ a } \text{ subgraph } \text{ of } A \qquad \text{ and } \qquad A / S\,^\prime \in \mathcal {B}( S\,^\prime ). \end{aligned}$$

For a given tree $T \in \mathcal {B}(n)$ and a permutation $\pi \in S_{n-1}$, let $T^\pi $ denote the tree in $\mathcal {B}(n)$ obtained from T by using $\pi $ to permute the non-root leaf labels.

1.2 Tree Bisection and Reconnection

Definition 1.3

A tree-bisection-and-reconnection (TBR) move on a tree $T \in \mathcal {B}(n)$, for $n \ge 3$, is a two-step process:

(1)
(Bisection step) Delete an edge e from T (this edge is called the bisection edge) and then suppress all degree 2 vertices (the edges created by these suppressions are called new edges) to obtain a pair $T_1,\,T_2$ of binary trees with $L(T_1) \sqcup L(T_2) = [n]_0$. (The number of new edges is 2, unless the bisection edge was incident with a leaf. In the latter case, there is only one new edge and either $T_1$ or $T_2$ is an isolated vertex.)
(2)
(Reconnection step) Connect $T_1$ and $T_2$ by creating an edge between the midpoint of an edge in $T_1$ and the midpoint of an edge in $T_2$. This edge is called the reconnecting edge. (In the case that $T_1$ or $T_2$ is an isolated vertex, the reconnecting edge is incident to this vertex.) See Fig. 1: on the left, the dark vertical edge is the bisection edge, and the diagonal grey edges will form the new edges; and on the right, the dark vertical edge is the reconnection edge, and the horizontal grey edges are the new edges. If $A \in \mathcal {B}(n)$, and B is the result of performing a TBR on A, then $B \in \mathcal {B}(n)$ because B must be connected, acyclic, and all non-leaf vertices have degree 3.

Definition 1.4

Let $\mathcal {TBR}= \mathcal {TBR}(n)$ denote the set of pairs $(A,\,B) \in \mathcal {B}(n)^2$ such that B can be obtained from A by performing a single TBR move.

It is well known that TBR moves are reversible [14]; that is, if B can be obtained from A via a TBR move, then A can be obtained from B via a TBR move. Thus,

$$\begin{aligned} (A,\,B) \in \mathcal {TBR}\quad \text{ if } \text{ and } \text{ only } \text{ if } \quad (B,\,A) \in \mathcal {TBR}. \end{aligned}$$

1.3 Subtree Prune and Regraft

An SPR move (Definition 1.5) is an operation on binary trees, whereby a subtree is removed from one part of the tree and regrafted to another part of the tree. SPR moves are studied to measure similarity between different trees [2, 4, 10].

Definition 1.5

A subtree-prune-and-regraft (SPR) move taking xy from ab to cd is an operation which can be performed on a tree $T \in \mathcal {B}(n)$, as long as $xy,\,ax,\,xb,\,cd$ are distinct edges (vertices $a,\,b,\,c,\,d$ need not be distinct, possibly $a=c$, $b=c$, $a=d$, or $b=d$) and the path from c to y contains x. An SPR move is a two-step process:

(prune step) delete edges ax, xb and replace them with ab, then
(regraft step) delete edge cd and replace it with cx and xd.

Let $\mathcal {SPR}= \mathcal {SPR}(n)$ denote the set of pairs $(A,\,B)$ such that $A \in \mathcal {B}(n)$ and B is the result of an SPR move on A. If the path from 0 to y passes through x, then this is called a rooted-subtree-prune-and-regraft (rSPR) move. Let $r\mathcal {SPR}= r\mathcal {SPR}(n)$ denote the set of pairs $(A,\,B) \in \mathcal {SPR}(n)$ such that B is the result of an rSPR move on A.

See Fig. 2. By definition, an rSPR move is a special case of an SPR move. It is not difficult to check that an SPR move is a special case of a TBR move. Informally, an SPR move is a TBR move in which one of $T_1$ or $T_2$ is an isolated vertex or one of the end points of the reconnecting edge is the midpoint of one of the new edges. Therefore, any valid SPR move performed on a binary tree in $\mathcal {B}(n)$ will result in a tree in $\mathcal {B}(n)$, and moreover:

$$\begin{aligned} r\mathcal {SPR}\subseteq \mathcal {SPR}\subseteq \mathcal {TBR}. \end{aligned}$$

These inclusions have been noted in many publications [2, 4, 14]. If we think of the edges as being oriented away from the root, then an rSPR move is an SPR move that preserves these orientations. It is also known that SPR moves and rSPR moves are reversible [14]. Thus,

$$\begin{aligned} (A,\,B) \in \mathcal {SPR}\Longleftrightarrow (B,\,A) \in \mathcal {SPR}\quad \text{ and } \quad (A,\,B) \in r\mathcal {SPR}\Longleftrightarrow (B,\,A) \in r\mathcal {SPR}. \end{aligned}$$

Not all SPR moves can be achieved with a single rSPR move. Indeed, we may need an arbitrary number of rSPR moves to simulate one SPR move [4]. For example, the trees in Fig. 3 differ by a single SPR move by pruning off the single leaf labelled 0, and then regrafting it appropriately. In Example 2.3, we will show that at least $\frac{n-3}{2}$ rSPR moves are required to transform one into the other.

The following lemma appears as [2, Lemma 2.7 (2b)].

Lemma 1.6

[2]. Any TBR move can be achieved using at most 2 SPRs; that is, if $(A,\,B) \in \mathcal {TBR}$, then there exists some C such that $(A,\,C),\,(C,\,B) \in \mathcal {SPR}$.

1.4 Tree Metrics

The SPR distance (Definition 1.7) has been the subject of several publications, for example, [2, 8]. The problem of determining the rSPR distance between a pair of trees was shown to be NP-hard by Bordewich and Semple [4] in 2005. We shall see later, that all of the values in the following definition are finite.

Definition 1.7

Let $n \ge 2$ be a fixed integer. For $A,\,B \in \mathcal {B}(n)$ and any subtree transfer operation $\chi \in \{ TBR ,\, SPR ,\, rSPR \}$, define the $\chi $-distance between A and B$\big ($denoted $d_\chi (A,\,B)$$\big )$, to be the minimum number of $\chi $ moves required to change A into B. Let $D_\chi (n)$ denote the $\chi $-diameter of the class $\mathcal {B}(n)$,

$$\begin{aligned} D_\chi (n) \; := \; \max _{A,\,B \in \mathcal {B}(n)} \; d_\chi (A,\,B); \end{aligned}$$

and moreover, let $R_\chi (n)$ denote the $\chi $-radius of the class $\mathcal {B}(n)$,

$$\begin{aligned} R_\chi (n) \; := \; \min _{C \in \mathcal {B}(n)} \; \max _{A \in \mathcal {B}(n)} \; d_\chi (A,\,C). \end{aligned}$$

The following is an immediate consequence of $r\mathcal {SPR}\subseteq \mathcal {SPR}\subseteq \mathcal {TBR}$.

Proposition 1.8

For any $\chi \in \{ TBR,\,SPR,\,rSPR \}$, and any integer $n>1$, we have $D_{\chi }(n) \ge R_{\chi }(n)$. Moreover,

$$\begin{aligned} D_{{\text {r}}SPR}(n) \ge D_{SPR}(n) \ge D_{TBR}(n) \quad \text{ and } \quad R_{{\text {r}}SPR}(n) \ge R_{SPR}(n) \ge R_{TBR}(n). \end{aligned}$$

1.5 Main Theorems

We give both the worst-case and the average-case results on numbers of moves needed by each of the three subtree transfer operations, starting with the worst-case results. The values of the radius and diameter for $n \le 9$ are given in Fig. 4. The asymptotic values of the radius and diameter are given in the following theorem.

Theorem 1.9

For each $\chi \in \{ TBR,\,SPR,\,rSPR \}$, both $D_\chi (n)$ and $R_\chi (n)$ are $n-\Theta (\sqrt{n})$.

By Proposition 1.8, to prove Theorem 1.9, it suffices to give an upper bound for the rSPR diameter and a lower bound for the TBR radius, both of the form $n-\Theta (\sqrt{n})$. These bounds are given explicitly in the inequalities (2.2) and (4.3), respectively. One of the results contained within Theorem 1.9 is that $D_{TBR}(n) = n - \Theta (\sqrt{n})$. This result is not new: it was shown recently in [6], where the proof uses a similar approach to ours, based on agreement forests. As well as considering other types of move, concerning TBR moves we see here also that $R_{TBR}(n) = n - \Theta (\sqrt{n})$ and we generalise agreement forest results to arbitrary sets of trees.

Next, let us consider distances between random trees. The fundamental random distance to consider is that between one fixed tree and a second fixed tree with its leaves randomly relabelled. We shall see that the expected distance is $n-\Theta \big (n^{2/3}\big )$, and the same estimate holds with high probability for random distances. Recall that $T^\pi $ denotes the tree obtained from T by using $\pi $ to permute the non-root leaf labels.

Theorem 1.10

Let $\chi \in \{ TBR,\,SPR,\,rSPR \}$ be fixed. There exist constants $c_1>c_2>0$ and $c_3>0$ such that the following holds. Let $n \ge 3$ be an integer, let $T_1$ and $T_2$ be trees in $\mathcal {B}(n)$, and let the random permutation $\pi $ be uniformly distributed over $S_{n-1}$. Then

$$\begin{aligned} n - c_1 n^{2/3}< \mathbb {E}\left[ d_{\chi }\left( T_1,\,T_2^{\pi }\right) \right] < n - c_2n^{2/3}, \end{aligned}$$

(1.1)

and further

$$\begin{aligned} \mathbb {P}\left( n - c_1 n^{2/3}< d_{\chi }\left( T_1,\,T_2^{\pi }\right) < n - c_2n^{2/3}\right) \ge 1 - e^{-c_3 n^{1/3}}. \end{aligned}$$

(1.2)

Again by Proposition 1.8, we will prove the statement (1.1) by demonstrating that the expected rSPR distance is at most $n - \Omega \big (n^{2/3}\big )$ and the expected TBR distance is at least $n-\mathcal {O}\big (n^{2/3}\big )$; see the inequalities (3.3) and (5.1). The statement (1.2) will be proved from (1.1) and a lemma on concentration around the mean. In fact, since $0 \le d_{\chi }(A,\,B) \le n$, the second statement (1.2) easily implies the first (1.1) (with $c_1$ and $c_2$ replaced by $2c_1$ and $\tfrac{1}{2} c_2$, respectively).

Define an equivalence relation $\sim $ on $\mathcal {B}(n)$ by setting $T_1 \sim T_2$ when $T_1=T_2^{\sigma }$ for some permutation $\sigma \in S_{n-1}$. A random tree T in $\mathcal {B}(n)$ has an exchangeable distribution [1] if $\mathbb {P}(T=T_1)=\mathbb {P}(T=T_2)$ whenever $T_1 \sim T_2$. Exchangeability is also known as leaf-invariance [13]. The uniform distribution and the Yule–Harding distribution [15] are both examples of exchangeable distributions. We shall deduce the following corollary quickly from Theorem 1.10. It shows in particular that

$$\begin{aligned} \mathbb {E}[d_\chi (A,\,B)] = n - \Theta \big (n^{2/3}\big ). \end{aligned}$$

Corollary 1.11

Let $\chi \in \{ TBR,\,SPR,\,rSPR \}$ be fixed. There exist constants $c_1>c_2>0$ and $c_3>0$ such that the following holds. Let $n \ge 3$ be an integer, and let the random trees A and B in $\mathcal {B}(n)$ be sampled independently, where B has an exchangeable distribution (and A may have any distribution). Then,

$$\begin{aligned} n - c_1 n^{2/3}< \mathbb {E}\left[ d_{\chi }(A,\,B)\right] < n - c_2n^{2/3}, \end{aligned}$$

and further

$$\begin{aligned} \mathbb {P}\left( n - c_1 n^{2/3}< d_{\chi }(A,\,B) < n - c_2n^{2/3}\right) \ge 1 - e^{-c_3 n^{1/3}}. \end{aligned}$$

The rest of this paper is devoted to proving the above results, or amplified versions of them.

2 Upper Bound for the rSPR Diameter

The upper bound in Theorem 1.9 is proved in this section using rooted agreement forests. The definition of a rooted agreement forest given here is equivalent to that used by Bordewich and Semple [4].

Definition 2.1

Let $\mathcal {A}\subseteq \mathcal {B}(n)$ be a set of $k \ge 2$ trees. A rooted agreement forest of $\mathcal {A}$ is a partition $L_1,\,L_2, \ldots ,\, L_m$ of the leaf-label set $[n]_0 = \{ 0,\,1,\,2, \ldots ,\,n-1 \}$, such that for all A, B in $\mathcal {A}$:

the subtrees $A|L_1,\, A|L_2, \ldots ,\, A|L_m$ are pairwise disjoint, and
$A/(L_j \cup \{ 0 \}) \equiv B/(L_j \cup \{ 0 \})$ for $j = 1,\,2, \ldots ,\,m$.

Let $M_{\text {r}}(\mathcal {A})$ be the minimal possible value of m, the minimal number of parts in a rooted agreement forest of $\mathcal {A}$. When $k=2$, we write $M_{\text {r}}(A,\,B)$ for $M_{\text {r}}(\{ A,\,B \})$.

If $|L_j {\setminus } \{0\}| \le 2$ for each j, then the ‘isomorphism’ condition above must be satisfied, though not necessarily the ‘disjointness’ condition. We can always construct a trivial rooted agreement forest with $n-2$ parts (when $n \ge 3$) by setting $L_1 = \{ 0 ,\, 1 ,\, 2 \}$ and $L_{i-1} = \{ i \}$ for $i = 3,\,4,\,5, \ldots ,\,n-1$: thus, we always have $M_{\text {r}}(\mathcal {A}) \le n-2$. Three minimal agreement forests are depicted in Fig. 5.

The following result is proved by Bordewich and Semple [4], built on the work of Hein et al. [9].

Lemma 2.2

(Rooted Agreement Forest Lemma). If A and B are any two trees in $\mathcal {B}(n)$, then

$$\begin{aligned} M_{\text {r}}(A,\,B) = d_{{\text {r}}SPR}(A,\,B) + 1. \end{aligned}$$

The rSPR distance between a pair of binary trees is often used as a measure of the discrepancy between them. So $M_{\text {r}}(\mathcal {A})$ can be seen as generalisation of this notion from pairs of trees to arbitrary sets of trees.

Example 2.3

Let A and B be the two caterpillar trees in Fig. 3 (with A on the right), which differ only by a single SPR move changing the location of the root. The other leaves in A are ordered in the usual ($1,\,2,\,3, \ldots ,\,n-1$) order from the root, while in B they are in the reverse order. Notice that $A/\{ 0,\,a,\,b,\,c \}$ is never isomorphic to $B/\{ 0,\,a,\,b,\,c \}$ because the leaf nearest to 0 in $A/\{ 0,\,a,\,b,\,c \}$ is the smallest of $\{ a,\,b,\,c \},$ while the leaf nearest to 0 in $B/\{ 0,\,a,\,b,\,c \}$ is the largest. Therefore if $\{ L_1,\,L_2, \ldots ,\,L_m\}$ is a rooted agreement forest for $\{ A,\,B \}$, with $0 \in L_1$, then $|L_1| \le 3$ and $|L_i| \le 2$ for all $i>1$. So, $n \le 3+2(m-1)$ and therefore $m \ge \frac{n-1}{2}$. By the rooted agreement forest lemma (Lemma 2.2), this means $d_{{\text {r}}SPR}(A,\,B) \ge \frac{n-3}{2}$ and so

$$\begin{aligned} D_{{\text {r}}SPR}(n) \ge \frac{n-3}{2}. \end{aligned}$$

The next two lemmas (Lemmas 2.4, 2.5) are used in the proof of Theorem 1.9 and the proof of (1.1) in Theorem 1.10. Essentially, the lemmas demonstrate that any binary tree can be divided into disjoint subtrees, each having roughly the same number of leaves.

Lemma 2.4

For an integer $n \ge 2$, let $|S|=n$, let $T \in \mathcal {B}(S)$ be a binary tree with root $0 \in S$, and let a be a real number with $1 < a \le 2(n-1)$. It is possible to remove a single edge from T, so that the number of leaves in the component not containing the root is in the interval $\left[ \tfrac{a}{2} ,\, a \right) $.

Proof

For each vertex $x \in T$, let g(x) be the set of leaves y such that x is on the path between the root and y. For every non-leaf vertex p, there is a natural partition of g(p), namely,

$$\begin{aligned} g(p) = g(c_1) \sqcup g(c_2), \end{aligned}$$

where $c_1,\,c_2 \in \Gamma (p)$ are the neighbours of p which are not on the path from p to the root. Moreover, if v is a leaf then $|g(v)|=1$ (unless v is the root, in which case $g(v)=S$). Therefore, for each tree node $v_i$, there exists a neighbour $v_{i+1}$ such that

$$\begin{aligned} |g(v_i)| > |g(v_{i+1})| \ge \tfrac{1}{2} |g(v_i)|. \end{aligned}$$

(2.1)

Now, construct a path $(v_0,\,v_1, \ldots ,\,v_m)$, such that $v_0$ is the root, $v_1$ is the neighbour of the root and Eq. (2.1) holds for each $i=1,\,2, \ldots ,\,m-1$. This terminates when $v_m$ is a leaf. So the sequence $|g(v_i)|$ decreases from

$$\begin{aligned} |g(v_1)| = n-1 \ge \frac{a}{2} \quad \text{ to } \quad |g(v_m)| = 1 < a, \end{aligned}$$

and never decreases by a factor less than $\frac{1}{2}$ in a single step. Hence, there must be some $1 \le j \le m$ such that

$$\begin{aligned} \frac{a}{2} \le |g(v_j)| < a. \end{aligned}$$

If edge $v_{j-1}v_j$ is deleted, then $g(v_j)$ is precisely the set of leaves in the component containing $v_j$. $\square $

Lemma 2.5 is a direct consequence of Lemma 2.4.

Lemma 2.5

For an integer $n \ge 2$, let $T \in \mathcal {B}(n)$ and let $a>1$. It is possible to find disjoint subtrees of T, such that: each leaf lies in a subtree, each subtree contains less than a leaves and every subtree that does not contain the root contains at least $\frac{a}{2}$ leaves.

Proof

The case $a>n$ is trivial. If $n \ge a$, then we can use Lemma 2.4 iteratively to prune off subtrees with less than a$\big ($but at least $\frac{a}{2}$$\big )$ leaves until the remaining tree has less than a leaves left $\big ($this remainder might have less than $\frac{a}{2}$ leaves$\big )$. This final remaining tree contains the root. $\square $

The following lemma is the crux of this section. In the proof, for any set of k binary trees, we find a large set of pairs of labels such that the paths between each pair are disjoint in each of the k trees. The rooted agreement forest which we construct in the proof contains only parts of size 1 or 2.

Lemma 2.6

For $k \ge 2$, and any set $\mathcal {A}= \{ A_1,\,A_2, \ldots ,\,A_k \}$ of trees in $\mathcal {B}(n)$ where $n \ge 3$,

$$\begin{aligned} M_{\text {r}}(\mathcal {A}) < n - \frac{1}{2(k+1)} \left( \frac{n}{k+1} \right) ^{1/k}. \end{aligned}$$

Proof

Since $M_{\text {r}}(\mathcal {A}) \le n-2$ for $n \ge 3$, we may assume, without loss of generality, that $n \ge k+1$. With foresight, we define $t = \big \lfloor \big ( \tfrac{n}{k+1} \big )^{1/k} \big \rfloor $. Divide each tree $A_i$ into t (possibly empty) disjoint connected parts,

$$\begin{aligned}A_i = A_{i,\,1} \sqcup A_{i,\,2} \sqcup \cdots \sqcup A_{i,\,t},\end{aligned}$$

such that each part contains less than $\frac{2n}{t}$ leaves. This is possible by letting $a = \frac{2n}{t}$ in Lemma 2.5. Now, let us define a token to be a map $\tau :[k] \rightarrow [t]$. Thus, a token specifies a part for each tree $A_i$. For any token $\tau $, define

$$\begin{aligned} I(\tau ) := \bigcap _{i=1}^k L\left( A_{i,\,\tau (i)}\right) \qquad \text{ and } \qquad U(\tau ) := \bigcup _{i=1}^k L\left( A_{i,\,\tau (i)}\right) . \end{aligned}$$

For each label x, let $\tau _x$ be the token where, for each $i \in [k]$, we set $\tau _x(i)= j$ when $x \in L(A_{i,\,j})$. Observe that $\tau _x$ is the unique token $\tau $ such that $x \in I(\tau )$. Thus, the non-empty sets $I(\tau )$ partition the label set $[n]_0$. We say that the token $\tau $ is nice if $|I(\tau )| \ge 2$, and we say a pair $\tau ,\,\tau ^\prime $ are friendly if $\tau (i) \not = \tau ^\prime (i)$ for all i. Now, if F is a set of pairwise friendly, nice tokens, then we can construct an agreement forest of $n-|F|$ parts in the following manner:

for each $\tau \in F$, construct a part of size 2 using labels in $I(\tau )$,
all other labels are put in their own part of size 1.

In each tree in $\mathcal {A}$, the paths between labels in each part of this partition must be disjoint because the tokens are friendly. Hence, this is a valid agreement forest. So for any such set F, we have $M_{\text {r}}(\mathcal {A}) \le n-|F|$. Now, let F be any maximal set of friendly nice tokens (i.e. every nice token that is not in F is unfriendly with a token in F). If the label x is not in $U(\tau )$ for any $\tau \in F$, then the token $\tau _x$ is not friendly with any $\tau \in F$; and so by the maximality of F, $\tau _x$ must not be nice $\big ($so $I(\tau _x)=\{x\}\big )$. Thus for each label $x \in [n]_0$ either $x \in I(\sigma )$ for some not-nice token $\sigma $, or $x \in U(\tau )$ for some $\tau \in F$. So, let us count: there are

n labels in total,
at most $t^k$ not-nice tokens $\sigma $, each with at most 1 label in $I(\sigma )$, and
less than ka labels in $U(\tau )$ for each token $\tau \in F$.

Now, $F \ne \emptyset $ since $t^k<n$, so $t^k + ka|F| > n$. Substituting the value of t, we obtain

$$\begin{aligned} |F| > \frac{n-t\,^k}{ka} \ge \frac{n - \left( \tfrac{n}{k+1} \right) }{ka} = \frac{t}{2(k+1)}. \end{aligned}$$

Since the above inequality is strict and $t,\,k$ are both integers, we can tighten the bound to:

$$\begin{aligned} |F| \ge \frac{t+1}{2(k+1)} > \frac{1}{2(k+1)} \left( \frac{n}{k+1} \right) ^{1/k}. \end{aligned}$$

Finally, since $M_{\text {r}}(\mathcal {A}) \le n-|F|$, we obtain the required inequality. $\square $

Set $k=2$ in Lemma 2.6 and then apply the rooted agreement forest lemma (Lemma 2.2). We obtain

$$\begin{aligned} D_{{\text {r}}SPR}(n) < n - \tfrac{1}{6 \sqrt{3}} \sqrt{n} -1 \end{aligned}$$

(2.2)

for each $n \ge 2$, which is a suitable upper bound in Theorem 1.9.

3 Upper Bound for the Expectation

In this section, we prove the upper bound in statement (1.1) of Theorem 1.10. The main step is provided by the following lemma.

Lemma 3.1

Let $k \ge 2$ be a fixed integer, and let $\mathcal {T}$ be an arbitrary set of k trees in $\mathcal {B}(n)$ (where $n \ge 3$). Let $\mathcal {A}$ be obtained by performing a random permutation of the non-root leaf labels of each $T \in \mathcal {T}$, where the k permutations are chosen independently and uniformly from $S_{n-1}$. Then,

$$\begin{aligned} \mathbb {E}[M_{\text {r}}(\mathcal {A})] \le n - \Omega \big ( n^{2/(k+1)} \big ) \end{aligned}$$

as $n \rightarrow \infty $.

Proof

Since $M_{\text {r}}(\mathcal {A}) \le n-2$ when $n \ge 3$, we may assume, without loss of generality, that n is sufficiently large where needed. Set $b = \big \lfloor \big ( \frac{n^{k-1}}{k} \big )^{1/(k+1)}\!\big \rfloor $ and note that $b \ge 1$; and set $s = \left\lfloor \frac{n}{2b} \right\rfloor $. For each $T \in \mathcal {T}$, let $T_1,\, T_2, \ldots ,\,T_s$ be disjoint subtrees of T such that the number of leaves in each $T_i$ is exactly b and none of them contain the root. This is possible by Lemma 2.5 (with $a=2b$) and taking subtrees as necessary. Let $A \in \mathcal {A}$ be the tree obtained by the random labelling of T, and let $A_1,\,A_2, \ldots ,\,A_s$ be the subtrees of A corresponding to $T_1,\,T_2, \ldots ,\,T_s$, respectively.

Let $x_1, \ldots ,\,x_j$ be distinct non-root labels (i.e. distinct elements of $\{1,\ldots ,\,n-1\}$). Call the set $\{ x_1, \ldots ,\,x_j \}$good if for each $A \in \mathcal {A}$, the leaves labelled $x_1, \ldots ,\,x_j$ all lie in the same subtree $A_i$ of A. We are most interested in good pairs, when $j=2$. Observe that if $\{x_1,\,x_2\}$ and $\{x_1,\,x_3\}$ are good pairs, then the set $\{x_1,\,x_2,\,x_3\}$ is good. For each $A \in \mathcal {A}$, we say that a good pair $\{ x_1,\,x_2 \}$ is spoiled inA if there is another good pair in the same subtree $A_i$ of A as $x_1$ and $x_2$. We say that a good pair $\{ x_1,\,x_2 \}$ is spoiled if it is spoiled in A for some $A \in \mathcal {A}$. Note that a good pair $\{x_1,\,x_2\}$ might be spoiled by a good pair $\{x_1,\,x_3\}$ (with $x_2 \not = x_3$) or by a disjoint good pair $\{ x_3,\,x_4 \}$. In the first case, we say that $\{x_1,\,x_2\}$ is triply spoiled, and in the second case we say that it is disjointly spoiled. We call a pair super if it is good and not spoiled.

For distinct non-root labels $x_1$ and $x_2$, we have

$$\begin{aligned} \mathbb {P}\big (\{ x_1,\,x_2 \} \text{ is } \text{ good } \big ) = \left( \frac{sb}{n-1} \times \frac{b-1}{n-2} \right) ^k, \end{aligned}$$

(3.1)

because for any $A \in \mathcal {A}$ the probability that $x_1 \in \bigcup _{i=1}^s L(A_i)$ is $\frac{sb}{n-1}$, and given that $x_1 \in L(A_i)$ the probability that $x_2 \in L(A_i)$ is $\frac{b-1}{n-2}$. Similarly, if $x_1$, $x_2$, and $x_3$ are distinct non-root labels, we have

$$\begin{aligned} \mathbb {P}\big ( \{ x_1,\,x_2,\,x_3 \} \text{ is } \text{ good }\big ) = \left( \frac{sb}{n-1} \times \frac{b-1}{n-2} \times \frac{b-2}{n-3} \right) ^k,\, \end{aligned}$$

because the probability that $\{ x_1,\,x_2\} \subseteq L(A_i)$ for some i is $\frac{sb}{n-1} \times \frac{b-1}{n-2}$, and given that $\{ x_1,\,x_2 \} \subseteq L(A_i)$ the probability that $x_3 \in L(A_i)$ is $\frac{b-2}{n-3}$. Observe that $\frac{b}{n} \le \frac{2}{3}$ for n sufficiently large, and then $\frac{b-2}{n-3} \le \frac{b}{n}$.

Let $x_1$ and $x_2$ be distinct non-root labels. Let $\mathcal {B}$ be the collection of subsets of $\{1,\ldots ,\,n-1\}$ of size b which contain $\{x_1,\,x_2\}$. For each $A \in \mathcal {A}$ and $B \in \mathcal {B}$, let $E_{A,\,B}$ be the event that one of the subtrees $A_i$ of A has $L(A_i)=B$. We shall need to condition on these events to control dependence.

Now, let $A \in \mathcal {A}$ and $B \in \mathcal {B}$. Arguing as for (3.1) we may see that

$$\begin{aligned} \mathbb {P}\big ( \{ x_1,\,x_2 \} \text{ good } \mid E_{A,\,B}\big )= \left( \frac{sb}{n-1} \times \frac{b-1}{n-2} \right) ^{k-1}. \end{aligned}$$

Let $x_3$ and $x_4$ be distinct elements of $B {\setminus } \{x_1,\,x_2\}$. Condition on $E_{A,\,B}$ and on $\{x_1,\,x_2\}$ being good. Then, for each $A^\prime \ne A$ in $\mathcal {A}$, the probability that $x_3 \in \bigcup _{j=1}^s L\big (A_j^\prime \big )$ is $\frac{sb-2}{n-3}$ and given that $x_3 \in L\big (A_j^\prime \big )$, the probability that $x_4 \in L\big (A_j^\prime \big )$ is at most $\frac{b-1}{n-4}$. Therefore,

$$\begin{aligned} \mathbb {P}\big ( \{ x_3,\,x_4 \} \text{ good } \mid \{ x_1,\,x_2 \} \text{ good },\, E_{A,\,B} \big ) \le \left( \frac{sb-2}{n-3} \times \frac{b-1}{n-4} \right) ^{k-1} \le \left( \frac{b}{n} \right) ^{k-1} \end{aligned}$$

(for sufficiently large n). Hence, using a union bound,

$$\begin{aligned}&\mathbb {P}\big ( \{ x_1,\,x_2 \} \text{ good } \text{ and } \text{ disjointly } \text{ spoiled } \text{ in }\,\, A \big )\\&\quad = \sum _{B \in \mathcal {B}} \mathbb {P}\big ( \{ x_1,\,x_2 \} \text{ good } \text{ and } \text{ disjointly } \text{ spoiled } \text{ in }\,\, A \mid E_{A,\,B}\big )\, \mathbb {P}(E_{A,\,B})\\&\quad \le \sum _{B \in \mathcal {B}}\Big ( \mathbb {P}\big ( \{ x_1,\,x_2 \} \text{ good } \mid E_{A,\,B}\big )\\&\quad \qquad \times \sum _{x_3 \ne x_4 \in B\backslash \{x_1,\,x_2\}}\!\!\!\mathbb {P}\big ( \{ x_3,\,x_4 \} \text{ good } \mid \{ x_1,\,x_2 \} \text{ good },\, E_{A,\,B} \big )\, \mathbb {P}(E_{A,\,B})\Big )\\&\quad \le \sum _{B \in \mathcal {B}} \left( \frac{sb}{n-1} \times \frac{b-1}{n-2} \right) ^{k-1} \left( {\begin{array}{c}b\!-\!2\\ 2\end{array}}\right) \left( \frac{b}{n}\right) ^{k-1} \mathbb {P}(E_{A,\,B})\\&\quad = \mathbb {P}\big ( \{ x_1,\,x_2 \} \text{ good } \big ) \, \left( {\begin{array}{c}b\!-\!2\\ 2\end{array}}\right) \left( \frac{b}{n}\right) ^{k-1}, \end{aligned}$$

by (3.1) and using $\sum _{B \in \mathcal {B}} \mathbb {P}(E_{A,\,B}) = \frac{sb}{n-1} \times \frac{b-1}{n-2}$. The sum above involving $x_3,\,x_4$ is over unordered pairs (Figs. 6, 7).

Similarly, we may see that for each $x_3 \in B\backslash \{x_1,\,x_2\},$

$$\begin{aligned} \mathbb {P}\big ( \{x_1,\,x_2,\, x_3 \} \text{ good } \mid \{ x_1,\,x_2 \} \text{ good },\, E_{A,\,B} \big ) \le \left( \frac{b}{n}\right) ^{k-1}, \end{aligned}$$

and

$$\begin{aligned}&\mathbb {P}\big ( \{ x_1,\,x_2 \} \text{ good } \text{ and } \text{ triply } \text{ spoiled } \text{ in }\,\, A \big )\\&\quad = \sum _{B \in \mathcal {B}} \mathbb {P}\big ( \{ x_1,\,x_2 \} \text{ good } \text{ and } \text{ triply } \text{ spoiled } \text{ in } \,\,A \mid E_{A,\,B}\big )\, \mathbb {P}(E_{A,\,B})\\&\quad \le \sum _{B \in \mathcal {B}}\Big ( \mathbb {P}\big ( \{ x_1,\,x_2 \} \text{ good } \mid E_{A,\,B}\big )\\&\quad \qquad \times \sum _{x_3 \in B\backslash \{x_1,\,x_2\}}\!\!\!\mathbb {P}\big ( \{x_1,\,x_2,\, x_3 \} \text{ good } \mid \{ x_1,\,x_2 \} \text{ good },\, E_{A,\,B} \big )\, \mathbb {P}(E_{A,\,B})\Big )\\&\quad \le \sum _{B \in \mathcal {B}} \left( \frac{sb}{n-1} \times \frac{b-1}{n-2} \right) ^{k-1} (b-2) \left( \frac{b}{n}\right) ^{k-1} \mathbb {P}(E_{A,\,B})\\&\quad = \mathbb {P}\big ( \{ x_1,\,x_2 \} \text{ good } \big ) \, (b-2) \left( \frac{b}{n}\right) ^{k-1}. \end{aligned}$$

Since $\left( {\begin{array}{c}b-2\\ 2\end{array}}\right) +(b-2)= \left( {\begin{array}{c}b-1\\ 2\end{array}}\right) $, using a union bound, we see that

$$\begin{aligned} \mathbb {P}\big ( \{ x_1,\,x_2 \} \text{ good } \text{ and } \text{ spoiled } \big ) \le \mathbb {P}\big ( \{ x_1,\,x_2 \} \text{ good } \big ) \, k \left( {\begin{array}{c}b\!-\!1\\ 2\end{array}}\right) \left( \frac{b}{n}\right) ^{k-1}. \end{aligned}$$

But by our choice of b,

$$\begin{aligned} k \left( {\begin{array}{c}b\!-\!1\\ 2\end{array}}\right) \left( \frac{b}{n}\right) ^{k-1} \le \; \frac{1}{2} \, k \frac{b^{k+1}}{n^{k-1}} \; \le \; \frac{1}{2}, \end{aligned}$$

and so

$$\begin{aligned} \mathbb {P}( \{ x_1,\,x_2 \} \text{ super } ) \ge \frac{1}{2} \, \mathbb {P}\big ( \{ x_1,\,x_2 \} \text{ good } \big ) = \Omega \big ( n^{-2k/(k+1)}\big ). \end{aligned}$$

Therefore, if S is the random number of super pairs, then

$$\begin{aligned} \mathbb {E}[S] = \left( {\begin{array}{c}n-1\\ 2\end{array}}\right) \mathbb {P}\big ( \{ x_1,\,x_2 \} \text{ super } \big ) = \Omega \big ( n^{2/(k+1)}\big ). \end{aligned}$$

Now, let us partition the labels such that each super pair forms a part of size two (super pairs are disjoint by definition, see Fig. 8) and all other labels are in individual parts. To prove that this partition is a valid agreement forest, it suffices to show that for any distinct super pairs ${\varvec{x}}$ and ${\varvec{y}}$, and each $A \in \mathcal {A}$, the paths $A|{\varvec{x}}$ and $A|{\varvec{y}}$ are disjoint. However, $A|{\varvec{x}}$ is contained within subtree $A_i$ and $A|{\varvec{y}}$ is contained within subtree $A_j$ for some $i \not = j$; and since $A_i$ and $A_j$ are disjoint, the partition must be a valid rooted agreement forest. It has $n-S$ components, so $M_{\text {r}}(\mathcal {A}) \le n-S$ and hence

$$\begin{aligned} \mathbb {E}[M_{\text {r}}(\mathcal {A})] \le n-\mathbb {E}[S] = n - \Omega \big ( n^{2/(k+1)} \big ),\end{aligned}$$

as required. $\square $

We shall deduce the required upper bound in statement (1.1) in Theorem 1.10 from Lemma 3.1. Initially, we phrase the argument generally, so we can reuse it later in the proof of the lower bound (Figs. 9, 10).

Fix $\chi \in \{ r\mathcal {SPR},\, \mathcal {SPR},\, \mathcal {TBR}\}$. Let $T_1,\,T_2 \in \mathcal {B}(n)$, and let $\pi _1,\,\pi _2$ be independently and uniformly distributed over $S_{n-1}$. If we apply the same permutation $\sigma $ to the non-root leaf labels of two trees, then the $\chi $-distance remains unchanged. Taking $\sigma $ as $\pi _1^{-1},$ we see that

$$\begin{aligned} d_\chi \left( T_1^{\pi _1},\,T_2^{\pi _2}\right) = d_\chi \left( T_1,\,\left( T_2^{\pi _2}\right) ^{\pi _1^{-1}}\right) = d_\chi \left( T_1,\,T_2^\pi \right) , \end{aligned}$$

(3.2)

where $\pi =\pi _2 \circ \pi _1^{-1}$ (with $\pi _2$ acting first). Further, $\pi $ is uniformly distributed over $S_{n-1}$. To see this, let $\tau \in S_{n-1}$ be fixed. Then by conditioning on $\pi _1$, using the independence of $\pi _1$ and $\pi _2$, and then the uniformity of $\pi _2$, we find that

$$\begin{aligned} \mathbb {P}(\pi =\tau )&= \mathbb {P}(\pi _2=\tau \circ \pi _1)\\&= \sum _{\tau \,^\prime } \mathbb {P}(\pi _2= \tau \circ \tau \,^\prime | \pi _1=\tau \,^\prime ) \, \mathbb {P}(\pi _1=\tau \,^\prime )\\&= \sum _{\tau \,^\prime } \mathbb {P}(\pi _2= \tau \circ \tau \,^\prime ) \, \mathbb {P}(\pi _1=\tau \,^\prime )\\&= \tfrac{1}{(n-1)!}, \end{aligned}$$

so indeed $\pi $ is uniformly distributed over $S_{n-1}$.

By Lemma 3.1 with $k=2$ and the rooted agreement forest lemma (Lemma 2.2), we obtain

$$\begin{aligned} \mathbb {E}\left[ d_{{\text {r}}SPR}\left( T_1^{\pi _1},\, T_2^{\pi _2}\right) \right] \le n - \Omega \big ( n^{2/3} \big ). \end{aligned}$$

But by (3.2) above,

$$\begin{aligned} \mathbb {E}\left[ d_{{\text {r}}SPR}\left( T_1^{\pi _1},\, T_2^{\pi _2}\right) \right] = \mathbb {E}\left[ d_{{\text {r}}SPR}\left( T_1,\, T_2^{\pi }\right) \right] , \end{aligned}$$

where $\pi $ is uniformly distributed over $S_{n-1}$; and so we have

$$\begin{aligned} \mathbb {E}\left[ d_{{\text {r}}SPR}\left( T_1,\, T_2^{\pi }\right) \right] \le n - \Omega \big ( n^{2/3} \big ), \end{aligned}$$

(3.3)

which proves the upper bound in statement (1.1) of Theorem 1.10.

4 Lower Bound for the Radius

The lower bound in Theorem 1.9 is established in Lemma 4.5 at the end of this section. The proofs in this section use a notion similar to the rooted agreement forest from the previous section; Definition 4.1 and Lemma 4.2 below correspond, respectively, to Definition 2.1 and Lemma 2.2, for TBRs instead of rSPRs.

Definition 4.1

Let $\mathcal {A}\subset \mathcal {B}(n)$ be a set of $k \ge 2$ trees. An unrooted agreement forest of $\mathcal {A}$ is a partition $L_1, \ldots ,\, L_m$ of the label set $[n]_0 = \{ 0,\,1,\,2, \ldots ,\,n-1 \}$, such that for all $A,\,B \in \mathcal {A}$:

The subtrees $A|L_1,\, A|L_2, \ldots ,\, A|L_m$ are pairwise disjoint, and
$A/L_j \equiv B/L_j$ for $j = 1, \ldots ,\, m$.

Let $M_u(\mathcal {A})$ be the minimal possible value for m, the minimal number of parts in an unrooted agreement forest of $\mathcal {A}$ . When $k=2$, we write $M_u(A,\,B)$ for $M_u(\{ A,\,B \})$.

Any rooted agreement forest is also an unrooted agreement forest, so we always have $M_u(\mathcal {A}) \le M_{\text {r}}(\mathcal {A})$. The next result (the Unrooted Agreement Forest Lemma) appears in [2].

Lemma 4.2

(Unrooted Agreement Forest Lemma). If A and B are any two trees in $\mathcal {B}(n)$, then,

$$\begin{aligned} M_u(A,\,B) = d_{TBR}(A,\,B) + 1. \end{aligned}$$

We now present two preliminary lemmas, which we shall use in the proof of the main result of this section, Lemma 4.5. The first shows that by choosing k permutations, we may ensure that each pair of labels is spread well apart by at least one of the permutations.

Lemma 4.3

Let n, b, and k be integers such that $b,\, k \ge 2$ and $n \ge b^k$. Then there exist k permutations $\{ \phi _j :[n]_0 \rightarrow [n]_0 \}_{j=1}^k$ such that for any distinct $x,\,y \in [n]_0$,

$$\begin{aligned} \max _i \, |\phi _i(x)-\phi _i(y)| > b^{k-1} - 2b^{k-3}. \end{aligned}$$

(4.1)

Note that (4.1) gives $ \max _i \, |\phi _i(x)-\phi _i(y)| \ge b$ when $k=2$, and $\max _i | \phi _i(x) -\phi _i(y)|\ge b^2-1$ when $k=3$.

Table 1 For $n=8$, $k=3$, and $b=2$, the table shows $\phi _1$, $\phi _2$, and $\phi _3$ (listing the elements of $[8]_0$ in order for each $\phi _i$)

Full size table

Proof

We may represent each $x \in [n]_0$ uniquely by a k-tuple $(x_1,\,x_2, \ldots ,\,x_k)$ of non-negative integers such that

$x = x_1b^{k-1} + x_2b^{k-2} + \cdots + x_k$, and
$0 \le x_i < b$ for all $1 < i \le k$.

If $x < b^k$, then this exactly represents x in base b. For larger x, we allow the coefficient of $b^{k-1}$ to exceed $b-1$. Let $\phi _1$ be the identity permutation, and for each $2 \le i \le k$ define $\phi _i$ so that

$$\begin{aligned} \phi _i(x)> \phi _i(y) \quad \text{ if } \text{ and } \text{ only } \text{ if } \quad \left\{ \begin{array}{ll} \text{ either }, &{}\quad x_i<y_i, \\ \text{ or }, &{}\quad x_i = y_i \;\; \text{ and } \;\; x>y. \end{array}\right. \end{aligned}$$

See Table 1 and Fig. 11 for illustrations. Given $x,\, y \in [n]_0$, let $d_i = x_i - y_i$ for each i. Now, let $2 \le i \le k$ and suppose that $\phi _i(x)>\phi _i(y)$ (so $d_i \le 0$). We claim that

$$\begin{aligned} \phi _i(x) - \phi _i(y) \ge -d_ib^{k-1} + \sum _{j=1}^{i-1} d_jb^{k-j-1} + \sum _{j=i+1}^k d_jb^{k-j}. \end{aligned}$$

(4.2)

We may establish this claim as follows. Define the function $f_i$ on $[n]_0$ by setting

$$\begin{aligned} f_i(z) = \sum _{j=1}^{i-1} z_j b^{k-j-1} + \sum _{j=i+1}^k z_j b^{k-j}. \end{aligned}$$

Thus, the claim (4.2) says that

$$\begin{aligned} \phi _i(x) - \phi _i(y) \ge -d_i b^{k-1} + f_i(x-y) = -d_i b^{k-1} + f_i(x)-f_i(y). \end{aligned}$$

Suppose first that $d_i=0$, and say $x_i=y_i =t$. Let $S=\{z \in [n]_0 :z_i = t \}$. Restricted to S, $f_i$ is a monotonic bijection onto the interval of integers $\{0,\,1,\ldots ,\, |S|-1\}$. Thus,

$$\begin{aligned} \phi _i(x) - \phi _i(y) = |\{z \in S :x >z \ge y\}| = f_i(x) - f_i(y). \end{aligned}$$

Now, suppose that $x_i< y_i$. Let $A_t= \{ z \in [n]_0:\phi _i(x) > \phi _i(z) \ge \phi _i(y) \text{ and } z_i=t \}$ and let $a_t=|A_t|$. Thus, $\phi _i(x) - \phi _i(y) = \sum _{t=x_i}^{y_i} a_t$. For any integer t with $x_i+1 \le t \le y_i-1$ we have $a_t \ge b^{k-1}$$\big ($with equality if $n=b^k\big )$, so the total contribution from such t is at least $(y_i-x_i-1)b^{k-1}$. Now consider $t=x_i$. Each $z \in A_{x_i}$ satisfies $\phi _i(z)>\phi _i(y)$, so $A_{x_i}= \{ z \in [n]_0:\phi _i(x) > \phi _i(z) \text{ and } z_i=x_i \}$; and thus $a_{x_i} = f_i(x)$. Finally, consider $t=y_i$. Each $z \in A_{y_i}$ satisfies $\phi _i(z)<\phi _i(x)$, so $A_{y_i}= \{ z \in [n]_0:\phi _i(z) \ge \phi _i(y) \text{ and } z_i=y_i\}$, and thus $a_{y_i} = f_i(n\!-\!1) - f_i(y)+1$. Thus,

$$\begin{aligned} \phi _i(x) - \phi _i(y) = \sum _{t=x_i}^{y_i} a_t \ge (y_i-x_i-1)b^{k-1} + f_i(n\!-\!1)+1 +f_i(x_i) -f_i(y) \end{aligned}$$

$\big ($with equality if $n=b^k\big )$. Now suppose that $n=b^k$. Then, $f_i(n-1)+1=b^{k-1}$ and we have

$$\begin{aligned} \phi _i(x) - \phi _i(y) = (y_i-x_i)b^{k-1} +f_i(x) -f_i(y), \end{aligned}$$

so (4.2) holds with equality. But $f_i$ is an increasing function, so (4.2) holds for all $n \ge b^k$, as required to establish the claim.

Now, we can prove the inequality (4.1). Fix distinct $x,\,y \in [n]_0$. Suppose first that $|x_i - y_i| > 1$ for some $i \in \{1,\ldots ,\,k\}$. All $z \in [n]_0$ such that $z_i= \min \{x_i,\,y_i\}+1$ must have $\phi _i(z)$ strictly between $\phi _i(x)$ and $\phi _i(y)$. But there are at least $b^{k-1}$ such values z, so (4.1) holds.

Thus, we may assume from now on that $|x_i - y_i| \le 1$ for each i. Also, we may assume, without loss of generality, that $x_1 \ge y_1$. We now treat three cases separately:

(1)
$x_1=y_1$.
(2)
$x_1=y_1+1$ and $x_i \ge y_i$ for all i.
(3)
$x_1=y_1+1$ and $x_i<y_i$ for some $i \ge 2$.

Case 1. We may assume, without loss of generality that $x < y$. Then there is an $i \ge 2$ such that $d_i =-1$ and $d_j=0$ for each $1 \le j <i$. Substituting this into Eq. (4.2) yields
$$\begin{aligned} \phi _i(x)-\phi _i(y) \; \ge \; b^{k-1} - \sum _{j=0}^{k-i-1} b^j \; \ge \; b^{k-1} - \sum _{j=0}^{k-3} b^j \; > \; b^{k-1} - 2 b^{k-3}. \end{aligned}$$
Case 2. Since $d_1=1$ and $d_j \ge 0$ for all j, we have $\phi _1(x) - \phi _1(y) = x-y \ge b^{k-1}$.
Case 3. We have $d_1=1$, and there is some $i \ge 2$ such that $d_i=-1$ and $d_j \ge 0$ for all $j<i$. Substituting this into Eq. (4.2) yields
$$\begin{aligned} \phi _i(x)-\phi _i(y) \; \ge \; b^{k-1} + b^{k-2} - \sum _{j=0}^{k-3} b^j \; > \; b^{k-1}. \end{aligned}$$

Thus the inequality (4.1) holds in each case, which completes the proof of the lemma. $\square $

The second preliminary lemma gives a property of a set $\mathcal {A}$ of trees which ensures that $M_u(\mathcal {A})$ is large. It will be used in the proofs of both Lemmas 4.5 and 5.1.

Lemma 4.4

Let n, k, and t be positive integers, and let $\mathcal {A}\subseteq \mathcal {B}(n)$ be a set of k trees. For each $A \in \mathcal {A}$, let $A = A_1 \sqcup A_2 \sqcup A_3 \sqcup \cdots $ be a partition of the vertices of A into at most t connected parts. Let $z \ge 0$, and suppose that for all but at most z pairs of distinct labels $x,\,y \in [n]_0$ there is at least one $A \in \mathcal {A}$ such that the leaves labelled with x and y are in different parts of A. Then,

$$\begin{aligned} M_u(\mathcal {A}) \ge n-k(t-1) -z. \end{aligned}$$

Proof

Let $\{ L_1,\,L_2, \ldots ,\,L_m \}$ be an unrooted agreement forest of $\mathcal {A}$. Let $\tilde{A}$ be a tree in $\mathcal {A}$, and let $F = \tilde{A}/L_1 \sqcup \tilde{A}/L_2 \sqcup \cdots \sqcup \tilde{A}/L_m$ be the corresponding forest with leaves labelled by $[n]_0$ (by Definition 4.1 it will not matter which tree in $\mathcal {A}$ we choose). Each tree $A \in \mathcal {A}$ can be divided into the given forest $ A = A_1 \sqcup A_2 \sqcup A_3 \sqcup \cdots $ by deleting at most $t-1$ edges. Let these edges be called hot edges. There is a total of at most $k(t-1)$ hot edges (at most $t-1$ in each $A \in \mathcal {A}$). An edge in A might not correspond to any edge in F, moreover there may be several edges that correspond to the same edge in F. In any case, each edge corresponds to at most one edge in F; and in particular this holds for hot edges. Now form the forest $F'$ by deleting from F all the edges in F corresponding to hot edges. Then we have deleted at most $k(t-1)$ edges (Fig. 12).

Consider two distinct leaves, labelled x and y, which are in the same component of $F'$. The unique path in $F'$ between these leaves must be contained in $\tilde{A}/L_j$ for some j. Since no edge corresponding to a hot edge was deleted from this path, the leaves labelled x and y must be in the same component of each tree $A \in \mathcal {A}$. Thus, the number p of pairs of distinct leaves which are in the same component of $F'$ satisfies $p \le z$. Suppose that $F'$ has j components, and pick a leaf in each component. The remaining $n-j$ leaves each contribute at least 1 to p, so $p \ge n-j$ and $j \ge n-p \ge n - z$; that is, $F'$ has at least $n-z$ components. Therefore, F must have had $m \ge n-z-k(t-1)$ components. $\square $

Lemma 4.5

Let $k \ge 2$ and $n \ge (3k)^k$ be integers. If $\mathcal {T}= \{ T_1,\,T_2, \ldots ,\,T_k \}$ is a set of binary trees with n leaves, then there exists $\mathcal {A}= \{ A_1,\,A_2, \ldots ,\,A_k \} \subseteq \mathcal {B}(n)$ such that $A_i$ is obtained by labelling the leaves of $T_i$ (for each i), and

$$\begin{aligned} M_u(\mathcal {A}) > n - 3k\, n^{1/k}. \end{aligned}$$

Proof

First set

$$\begin{aligned} b= \left\lfloor n^{1/k} \right\rfloor \; , \;\; t = \left\lceil \frac{2n}{b\,^{k-1}-2b\,^{k-3}} \right\rceil , \quad \text{ and } \quad a = \frac{2n}{t} \le b^{k-1} - 2b^{k-3}. \end{aligned}$$

Then, divide each tree $T_i$ into at most t connected parts, $T_i = T_{i\,1} \sqcup T_{i\,2} \sqcup T_{i\,3} \sqcup \cdots $ (using Lemma 2.5) so each part contains less than a leaves. We shall label the leaves of each $T_i$ in such a way that, for each pair of distinct labels, there is some i such that the leaves with these labels are in different parts of $T_i$, and then apply Lemma 4.4 (with $z=0$).

Since $n \ge b^k$, by Lemma 4.3, we can find k permutations $\phi _1,\ldots ,\, \phi _k$ of $[n]_0$ such that for any two distinct labels $x,\,y \in [n]_0$ there is some $1 \le i \le k$ such that $|\phi _i(x)-\phi _i(y)| \ge a$. For each tree $T_i$, assign labels to the leaves of $T_i$ in the order given by $\phi _i$, firstly to all the leaves of $T_{i\,1}$, then $T_{i\,2}$, then $T_{i\,3}$, and so on. For any distinct labels x and y, consider a permutation $\phi _i$ in which $|\phi _i(x)-\phi _i(y)| \ge a$. If in the tree $T_i$ some $T_{ij}$ contained both x and y, then it would also have to contain all the labels z with $\phi _i(z)$ between $\phi _i(x)$ and $\phi _i(y)$: but this is impossible since each $T_{ij}$ has less than a leaves. Thus by Lemma 4.4,

$$\begin{aligned} M_u(\mathcal {A}) \ge n - k(t-1). \end{aligned}$$

For constant k, one can easily observe that $t \sim 2 n^{1/k}$ as $n \rightarrow \infty $. For an explicit bound $\big ($using $n \ge (3k)^k\big )$, we can show that $t < 3n^{1/k} + 1$, in the following manner. Observe that $b = \left\lfloor n^{1/k} \right\rfloor $ satisfies $b \ge 3k$ and

$$\begin{aligned} b > n^{1/k} \big ( 1 - n^{-1/k} \big ) \ge n^{1/k} \left( 1 - \frac{1}{3k} \right) . \end{aligned}$$

Thus,

$$\begin{aligned} b^{k-1} - 2b^{k-3}= & {} b^{k-1}\left( 1- \frac{2}{b^2}\right) \\&> n^{(k-1)/k} \left( 1-\frac{1}{3k}\right) ^{k-1} \left( 1-\frac{2}{9k^2}\right) > \frac{2}{3} n^{(k-1)/k}, \end{aligned}$$

where the last inequality holds since

$$\begin{aligned} \left( 1-\frac{1}{3k}\right) ^{k-1} \left( 1-\frac{2}{9k^2}\right)> 1- \frac{k-1}{3k} -\frac{2}{9k^2} > \frac{2}{3}. \end{aligned}$$

Hence, by the definition of t,

$$\begin{aligned} t \le \left\lceil \frac{2n}{\frac{2}{3} n^{(k-1)/k}} \right\rceil < 3n^{1/k}+1. \end{aligned}$$

Finally, we have

$$\begin{aligned} M_u(\mathcal {A}) \ge n - k(t-1) > n - 3k \, n^{1/k}, \end{aligned}$$

as required. $\square $

If $\mathcal {A}\,^\prime = \left\{ A_1^\prime , \ldots ,\,A_k^\prime \right\} $ is obtained from $\mathcal {A}$ by a single permutation $\pi $ of the labels (i.e. the leaf labelled x in $A_i$ is labelled $\pi (x)$ in $A_i^\prime $ for all $x \in [n]_0$ and all $i \in [k]$), then $M_u(\mathcal {A}\,^\prime ) = M_u(\mathcal {A})$. Therefore, in the above lemma, we could have constructed $\mathcal {A}$ so that one of the trees has a particular labelling, say $A_1$. Now if we set $k=2$ and apply the unrooted agreement forest lemma (Lemma 4.2), then we find

$$\begin{aligned} R_{TBR}(n) > n - 6 \sqrt{n} - 1 \end{aligned}$$

(4.3)

for all integers $n \ge 36$, and the result is trivially true for $n<36$ since the right hand side is negative. This is a suitable lower bound for Theorem 1.9.

When $k\ge 2$ is fixed and $n \rightarrow \infty $, Lemmas 2.6 and 4.5 show that the maximum value of $M(\mathcal {A})$ over any set $\mathcal {A}$ of k trees in $\mathcal {B}(n)$ is $n - \Theta (n^{1/k})$, and the same holds for $M_{\text {r}}(\mathcal {A})$.

We close this section by looking briefly at the general tool Lemma 4.3, again with $k\ge 2$ fixed and $n \rightarrow \infty $. The lemma shows that with k permutations, we can ensure a ‘spread’ of at least $(1+o(1))\, n^{(k-1)/k}$. (Here “spread” refers to the value $\max _i |\phi _i (x)- \phi _i(y)|$.) Let us show that this bound cannot be much improved. Consider first $k=2$. We may assume, without loss of generality, that $\phi _1$ is the identity. Suppose that $\max _i |\phi _i(x)-\phi _i(y)| \ge a$ for each pair $x,\,y$ of distinct labels. By a classical result of Erdős and Szekeres [7] in 1935, the n-sequence corresponding to $\phi _2$ has a monotonic subsequence of length $t=\lceil \sqrt{n}\, \rceil $. By considering the $t-1$ gaps between successive terms in such a subsequence, we may see that $\,(t\!-\!1)/2 \cdot (a\!+\!1) \le n\!-\!1$. This shows that $a \le 2 \sqrt{n} +O(1)$. Thus, we cannot improve the bound in (4.1) by a factor greater than 2.

For general $k \ge 2$, we cannot improve the bound by a factor $\alpha > 4k(k+1)^{(k+1)/k}$. For, if we could, then in the proof of Lemma 4.5 we could take t about $(2/\alpha ) n^{1/k}$, and that would yield

$$\begin{aligned} M_u(\mathcal {A}) \ge n - k(t-1) \ge n - (1+o(1)) \frac{2k}{\alpha } n^{1/k}, \end{aligned}$$

which would contradict Lemma 2.6 for large n.

5 Lower Bound for the Expectation

In Lemma 5.1 below, we give a lower bound on the expected size of an unrooted maximum agreement forest, which asymptotically matches the upper bound in Lemma 3.1 for the rooted case; see (5.3). This will yield the required lower bound on the expected TBR distance in statement (1.1) of Theorem 1.10, and the upper bound (5.2) on the expected size of a maximum agreement subtree.

Lemma 5.1

Let $k \ge 2$ and $n \ge 2$ be integers, and let $\mathcal {T}= \{ T_1,\,T_2, \ldots ,\,T_k \} \subseteq \mathcal {B}(n)$ be a set of k binary trees. For each $i=1,\,2, \ldots ,\,k$, let $A_i$ be the tree $T_i^{\pi _i}$ obtained from $T_i$ by a random relabelling $\pi _i$ of the non-root labels, where the $\pi _i$ are sampled independently and uniformly from $S_{n-1}$. If $\mathcal {A}= \{ A_1,\,A_2, \ldots ,\,A_k \}$ then

$$\begin{aligned} \mathbb {E}[M_u(\mathcal {A})] > n - (2k+\tfrac{1}{2})n^{2/(k+1)}. \end{aligned}$$

Proof

Set $a = n^{\frac{k-1}{k+1}}$ and $t = \left\lceil \frac{2n}{a} \right\rceil $, so $t-1 < \frac{2n}{a} = 2n^{\frac{2}{k+1}}$. Now, using Lemma 2.5, divide each tree $T_i$ into t (possibly empty) connected parts, so that each part has less than a leaves; and let $A_i = A_{i\,1} \sqcup \cdots \sqcup A_{i\,t}$ be the corresponding partition of $A_i$. For distinct $x,\,y \in [n]_0$, notice that

$$\begin{aligned} \mathbb {P}\big ( y \in L(A_{ij}) \,\big |\, x \in L(A_{ij}) \big ) = \frac{|L(A_{ij})|-1}{n-1}< \frac{a-1}{n-1} < \frac{a}{n} = n^{-\frac{2}{k+1}}. \end{aligned}$$

Now, let Z be the set of pairs $\{ x,\,y \}$ of distinct labels such that, for each i, x and y are in the same part $A_{ij}$. Since the relabellings $\pi _i$ are sampled independently, $\mathbb {P}( \{ x,\,y \} \in Z ) < (n^{-\frac{2}{k+1}})^k$ for all distinct $x,\,y \in [n]_0$. By the linearity of expectation,

$$\begin{aligned} \mathbb {E}[|Z|] = \left( {\begin{array}{c}n\\ 2\end{array}}\right) \mathbb {P}( \{ x,\,y \} \in Z)< \left( {\begin{array}{c}n\\ 2\end{array}}\right) n^{-\frac{2k}{k+1}} < \tfrac{1}{2} n^{\frac{2}{k+1}}. \end{aligned}$$

Since $t-1 < 2n^{\frac{2}{k+1}}$, we can apply Lemma 4.4 to get

$$\begin{aligned} \mathbb {E}[M_u(\mathcal {A})]&\ge n - k(t-1) - \mathbb {E}[|Z|] \\&> n - (2k+\tfrac{1}{2})n^{\frac{2}{k+1}}, \end{aligned}$$

as required. $\square $

Setting $k=2$ in Lemma 5.1 gives $\mathbb {E}\big [M_u\big (T_1^{\pi _1},\, T_2^{\pi _2}\big )\big ] > n- \frac{9}{2} n^{2/3} \ge n-5n^{2/3} +1$ for $n \ge 3$. Applying the unrooted agreement forest lemma (Lemma 4.2), we get

$$\begin{aligned} \mathbb {E}\left[ d_{TBR}\left( T_1^{\pi _1},\, T_2^{\pi _2}\right) \right] >n-5n^{2/3} \end{aligned}$$

for all positive integers n. Now, we may complete the proof of the lower bound in statement (1.1) of Theorem 1.10, as in the proof of the upper bound. By (3.2), $d_{TBR}\big (T_1^{\pi _1},\,T_2^{\pi _2}\big ) = d_{TBR}(T_1,\,T_2^\pi ) $ where $\pi $ is uniformly distributed over $S_{n-1}$, and we have

$$\begin{aligned} \mathbb {E}[d_{TBR}(T_1,\, T_2^{\pi })]>n-5n^{2/3}, \end{aligned}$$

(5.1)

which is an explicit version of the required lower bound in (1.1).

Given a list $\mathcal {A}$ of k trees, let $\mathcal {MAST}(\mathcal {A})$ denote the maximum size of an(unrooted) agreement subtree, where we measure size by the number of leaves. Since $M_u(\mathcal {A}) \le n+1-\mathcal {MAST}(\mathcal {A})$, for $\mathcal {A}$ as in Lemma 5.1, we have

$$\begin{aligned} \mathbb {E}[\mathcal {MAST}(\mathcal {A})] \le n+1 - \mathbb {E}[M_u(\mathcal {A})] = \mathcal {O}\big (n^{2/(k+1)}\big ). \end{aligned}$$

(5.2)

We already noted that, for any set $\mathcal {A}$ of $k \ge 2$ trees in $\mathcal {B}(n)$, we have $M_u(\mathcal {A}) \le M_{\text {r}}(\mathcal {A})$. Putting together Lemmas 3.1 and 5.1, we see that, for $\mathcal {A}$ as in these lemmas, we have both

$$\begin{aligned} \mathbb {E}[M_u(\mathcal {A})] = n - \Theta \big (n^{2/(k+1)}\big )\quad \text{ and } \quad \mathbb {E}[M_{\text {r}}(\mathcal {A})] = n - \Theta \big (n^{2/(k+1)}\big ). \end{aligned}$$

(5.3)

6 Concentration Results

In this section, we shall prove statement (1.2) of Theorem 1.10 and deduce Corollary 1.11. Fix $\chi \in \{ TBR,\, SPR,\, rSPR \}$. Let $n \ge 2$ be an integer and let $T_1$ and $T_2$ be fixed trees in $\mathcal {B}(n)$. We investigate the distribution of the random variable $X = d_\chi (T_1,\,T_2^{\pi })$, where $\pi $ is uniformly distributed on $S_{n-1}$. Specifically, we shall show that X is highly concentrated about its mean, using the following result from [12] (Theorem 7.4 applied to Example 7.2).

Lemma 6.1

[12]. Let $S_{m}$ denote the group of all permutations of $\{ 1,\,2, \ldots ,\,m \}$, and let $k>0$ be a constant. Let $f :S_{m} \rightarrow \mathbb {R}$ be a function such that for any $\sigma \in S_{m}$ and any transposition $\tau $, we have

$$\begin{aligned} \left| f(\sigma \circ \tau ) - f(\sigma ) \right| \le k. \end{aligned}$$

Let $\pi $ be chosen uniformly at random from $S_{m}$, let X be the random variable $X = f(\pi )$, and let $\mu $ denote $\mathbb {E}[X]$. Then for any $\alpha >0$, we have

$$\begin{aligned} \mathbb {P}(|X-\mu | \ge \alpha ) \le 2 \exp \left( - \frac{2\alpha ^2}{(m-1)k^2} \right) . \end{aligned}$$

For any tree $T \in \mathcal {B}(n)$, if $T^\prime $ is the result of a transposition of the labels of two non-root leaves of T, then $d_{\chi }(T,\,T^\prime ) \le 2$ because we can obtain $T^\prime $ from T by swapping the two leaves with 2 rSPR moves. Therefore, we can apply Lemma 6.1 to the random variable $X = d_\chi (T_1,\,T_2^{\pi })$ with $k=2$, and $m=n-1$. By statement (1.1) of Theorem 1.10, there are constants $\lambda _1> \lambda _2 > 0$ such that

$$\begin{aligned} n - \lambda _1n^{2/3}< \mu < n - \lambda _2n^{2/3} \end{aligned}$$

for all integers $n \ge 2$. Let $\alpha = \tfrac{1}{2} \lambda _2n^{2/3}$. Then by Lemma 6.1,

$$\begin{aligned} \mathbb {P}\left( |X-\mu | \ge \alpha \right)&\le 2\exp \left( -\frac{2\left( \tfrac{1}{2}\lambda _2n^{2/3}\right) ^2}{4(n-2)} \right) \\&\le 2\exp \left( -\tfrac{1}{8}\lambda _2^2n^{1/3} \right) . \end{aligned}$$

Choose constants $c_1>c_2>0$ and $c_3>0$ by setting $c_1 = \lambda _1+\tfrac{1}{2}\lambda _2$ and $c_2 = \tfrac{1}{2}\lambda _2$ and $c_3 = \tfrac{1}{8}\lambda _2^2$. Then by the above,

$$\begin{aligned} \mathbb {P}\left( n-c_1n^{2/3}< X < n - c_2n^{2/3} \right) \ge 1 - 2\exp \left( -c_3 n^{1/3} \right) . \end{aligned}$$

Finally, recall that $0 \le X \le n-2$ for each $n \ge 2$. Hence, by replacing $c_3$ by $\tfrac{1}{2} c_3$, and choosing $c_1$ larger and $c_2$ smaller if necessary, we may drop the factor 2 in the last term. This completes the proof of statement (1.2) of Theorem 1.10.

It remains to deduce Corollary 1.11. Observe first that, if the random tree T in $\mathcal {B}(n)$ has an exchangeable distribution, and $T_0$ is a fixed tree with $\mathbb {P}(T=T_0)>0$, then conditioned on $T \sim T_0$, T is uniformly distributed on the equivalence class of $T_0$ under $\sim $, and so T has the same distribution as $T_0^{\pi }$ where $\pi $ is uniformly distributed on $S_{n-1}$.

Let A and B be sampled independently in $\mathcal {B}(n)$, where B has an exchangeable distribution. Let $T_1$ be any tree in $\mathcal {B}(n)$ with $\mathbb {P}(A=T_1)>0$, and let $T_2$ be any tree in $\mathcal {B}(n)$ with $\mathbb {P}(B=T_2)>0$. Conditioned on $A=T_1$ and $B \sim T_2$, $d_\chi (A,\,B)$ has the same distribution as $d_{\chi }(T_1,\,T_2^{\pi })$ where $\pi $ is uniformly distributed on $S_{n-1}$. Let I be the interval $\big (n-c_1n^{2/3},\, n-c_2n^{2/3}\big )$ in statement (1.1) of Theorem 1.10. Then by that result,

$$\begin{aligned} \mathbb {E}[d_\chi (A,\,B)\, | \, (A=T_1) \wedge (B \sim T_2)] \; = \; \mathbb {E}[d_{\chi }(T_1,\,T_2^{\pi }) ] \; \in I. \end{aligned}$$

But $ \mathbb {E}[d_\chi (A,\,B)]$ is a weighted average of the terms on the left, so $\mathbb {E}[d_\chi (A,\,B)] \in I$. Similarly, let J be the interval $\big (n-c_1n^{2/3},\, n-c_2n^{2/3}\big )$ in statement (1.2) of Theorem 1.10. Then by that result,

$$\begin{aligned} \mathbb {P}(d_\chi (A,\,B) \in J\, | \, (A=T_1) \wedge (B \sim T_2)) \; = \; \mathbb {P}(d_{\chi }(T_1,\,T_2^{\pi }) \in J) \; \ge \; 1-e^{-c_3n^{1/3}}. \end{aligned}$$

But $ \mathbb {P}(d_\chi (A,\,B) \in J)$ is a weighted average of the terms on the far left, so

$$\begin{aligned} \mathbb {P}(d_\chi (A,\,B) \in J) \ge \; 1-e^{-c_3n^{1/3}}. \end{aligned}$$

This completes the proof of Corollary 1.11.

7 Concluding Remarks

For each move $\chi \in \{ TBR,\, SPR,\, rSPR \}$ we have estimated the diameter $D_{\chi }(n)$ and the radius $R_{\chi }(n)$ of the class $\mathcal {B}(n)$ quite precisely: we saw in Theorem 1.9 that both $D_{\chi }(n)$ and $R_{\chi }(n)$ are $n - \Theta (\sqrt{n})$. It would be interesting to learn about finer behaviour.

Can we separate diameter and radius? Are there constants $0<c_{\chi }<c\,'_{\chi }$ such that
$$\begin{aligned} D_{\chi }(n) > n-(c_{\chi }+o(1)) \sqrt{n} \quad \text{ and } \quad R_{\chi }(n) < n-(c\,'_{\chi }+o(1)) \sqrt{n}? \end{aligned}$$
Can we separate, for example, rSPR diameter and (unrooted) SPR diameter? Are there constants $0<c_{\text {r}}<c_{u}$ such that
$$\begin{aligned} D_{{\text {r}}SPR}(n) > n-(c_{\text {r}}+o(1)) \sqrt{n} \; \text{ and } \; D_{SPR}(n) < n-(c_{u}+o(1)) \sqrt{n}? \end{aligned}$$
What about even finer behaviour? Are there constants $c_{\chi } > 0$ and $c\,^\prime _{\chi } > 0$ such that
$$\begin{aligned} D_{\chi }(n) = n - \big ( c_{\chi } +o(1) \big ) \sqrt{n} \qquad \text{ and } \qquad R_{\chi }(n) = n - \big ( c\,^\prime _{\chi } +o(1) \big ) \sqrt{n} \; ? \end{aligned}$$
If so, echoing the first two questions, is the inequality $c_{\chi } \le c\,^\prime _{\chi }$ strict, and does $c_{\chi }$ or $c\,^\prime _{\chi }$ depend on $\chi $?

There are similar natural questions arising from Theorem 1.10 and Corollary 1.11 on the behaviour of random distances.

Notes

Sometimes, the root of a rooted-binary-tree is defined to have degree 2. This is equivalent to our definition(see Definition 1.1). Simply add a new leaf, adjacent to the degree-2 root, label it 0 and make it the new root.

References

Aldous, D.: Probability distributions on cladograms. In: Aldous, D., Pemantle, R. (eds.) Random Discrete Structures, IMA Vol. Math. Appl., Vol. 76, 1–18. Springer, New York (1996)
Allen, B., Steel, M.: Subtree transfer operations and their induced metrics on evolutionary trees. Ann. Combin. 5(1), 1–15 (2001)
Bernstein, D.I., Ho, L.S.T., Long, C., Steel, M., St. John, K., Sullivant, S.: Bounds on the expected size of the maximum agreement subtree. SIAM J. Discrete Math. 29(4), 2065–2074 (2015)
Bordewich, M., Semple, C.: On the computational complexity of the rooted subtree prune and regraft distance. Ann. Combin. 8(4), 409–423 (2005)
Article MathSciNet MATH Google Scholar
Bryant, D., McKenzie, A., Steel, M.: The size of a maximum agreement subtree for random binary trees. In: Janowitz, M.F., Lapointe, F.-J., McMorris, F.R., Mirkin, B., Roberts, F.S. (eds.) BioConsensus, Discrete Math. Theoret. Comput. Sci., Vol. 61, 55–65. American Mathematical Society, Providence, RI (2003)
Ding, Y., Grünewald, S., Humphries, P.J.: On agreement forests. J. Combin. Theory Ser. A 118(7), 2059–2065 (2011)
Erdős, P., Szekeres, G.: A combinatorial problem in geometry. Compositio Math. 2, 463–470 (1935)
Hein, J.: Reconstructing evolution of sequences subject to recombination using parsimony. Math. Biosci. 98(2), 185–200 (1990)
Hein, J., Jiang, T., Wang, L., Zhang, K.: On the complexity of comparing evolutionary trees. Discrete Appl. Math. 71(1-3), 153–169 (1996)
Humphreys, P.J., Semple, C.: Note on the hybridization number and subtree distance in phylogenetics. App. Math. Lett. 22(4), 611–615 (2009)
Article MathSciNet MATH Google Scholar
Martin, D.M., Thatte, B.D.: The maximum agreement subtree problem. Discreat Appl. Math. 161(13-14), 1805–1817 (2013)
McDiarmid, C.: On the method of bounded differences. In: Siemons, J. (ed.) Surveys in Combinatorics, London Mathematical Society Lecture Note Series, Vol. 141, pp. 148–188. Cambridge University Press, Cambridge (1989)
Penny, D., Steel, M.: Distributions of tree comparison metrics—some new results. Systematic Biol. 42(2), 126–141 (1993)
Semple, C., Steel, M.: Phylogenetics. Oxford University Press, Oxford (2003)
Zhu, S., Steel, M.: Does random tree puzzle produce Yule-Harding trees in the many-taxon limit? Math. Biosci. 243(1), 109–116 (2013)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We would like to thank Charles Semple and the two anonymous referees for several helpful comments, and Mike Steel for drawing Ref. [6] to our attention.

Author information

Authors and Affiliations

Department of Statistics, University of Oxford, 24-29 St Giles’, Oxford, OX1 3LB, UK
Ross Atkins & Colin McDiarmid

Authors

Ross Atkins
View author publications
You can also search for this author in PubMed Google Scholar
Colin McDiarmid
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Colin McDiarmid.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Atkins, R., McDiarmid, C. Extremal Distances for Subtree Transfer Operations in Binary Trees. Ann. Comb. 23, 1–26 (2019). https://doi.org/10.1007/s00026-018-0410-4

Download citation

Received: 29 January 2015
Published: 13 December 2018
Issue Date: 07 March 2019
DOI: https://doi.org/10.1007/s00026-018-0410-4

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Extremal Distances for Subtree Transfer Operations in Binary Trees

Abstract

Similar content being viewed by others

Centrality measures in networks

Clustal Omega, Accurate Alignment of Very Large Numbers of Sequences

Clustering, coding, and the concept of similarity

1 Introduction

1.1 Binary Trees

Definition 1.1

Definition 1.2

1.2 Tree Bisection and Reconnection

Definition 1.3

Definition 1.4

1.3 Subtree Prune and Regraft

Definition 1.5

Lemma 1.6

1.4 Tree Metrics

Definition 1.7

Proposition 1.8

1.5 Main Theorems

Theorem 1.9

Theorem 1.10

Corollary 1.11

2 Upper Bound for the rSPR Diameter

Definition 2.1

Lemma 2.2

Example 2.3

Lemma 2.4

Proof

Lemma 2.5

Proof

Lemma 2.6

Proof

3 Upper Bound for the Expectation

Lemma 3.1

Proof

4 Lower Bound for the Radius

Definition 4.1

Lemma 4.2

Lemma 4.3

Proof

Lemma 4.4

Proof

Lemma 4.5

Proof

5 Lower Bound for the Expectation

Lemma 5.1

Proof

6 Concentration Results

Lemma 6.1

7 Concluding Remarks

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation