Cluster property
A tree space has the cluster property, if all trees on every shortest path between two trees sharing a cluster C also contain C. This is a desirable property in evolutionary biology applications as trees sharing a cluster or subtree are expected to be closer to each other than to a tree not sharing a cluster with them. This property is also desirable in centroid-based summary methods, where a summary tree minimises a function on distances to trees in the given tree set. For a given sample of trees containing a common subtree, it is expected that their summary tree also contains this subtree. It is therefore desirable to have a tree space that has the cluster property. Related to the cluster property is the idea to split the computation of distances into computing the distance between the subtrees induced by a shared cluster and the remaining tree (Bordewich and Semple 2005).
A mathematical motivation for investigating the cluster property in \({\mathrm {RNNI}}\) is its importance in a similar tree space, the nearest neighbour interchange graph (\({\mathrm {NNI}}\)). In the \({\mathrm {NNI}}\) graph, trees have no times and \({\mathrm {NNI}}\) moves are allowed on every edge, while rank moves and length moves are not possible as no times are assigned to internal nodes. Computing the \({\mathrm {NNI}}\) distance between two trees is \({\mathcal {NP}}\)-hard (Dasgupta et al. 2000), and the proof relies on the fact that this tree space does not have the cluster property (Li et al. 1996). In the \({\mathrm {RNNI}}\) graph, however, distances can be computed in polynomial time using \({\textsc {FindPath}}\) (Collienne and Gavryushkin 2021), which preserves common clusters (Lemma 1). The question whether \({\mathrm {RNNI}}\) has the cluster property is hence natural, and will be settled by Theorem 2.
Theorem 2
The \({\mathrm {RNNI}}\) graph has the cluster property.
Proof
We assume to the contrary that there are two ranked trees T and R sharing a cluster C and a shortest path p between these trees where C is not present in every tree. We furthermore assume that there is no pair of trees \(T',R'\) with \(d_{{\mathrm {RNNI}}}(T',R') < d_{{\mathrm {RNNI}}}(T,R)\) that shares a cluster \(C'\) and is connected by a shortest path \(p'\) that does not preserve \(C'\). We hence say that T and R give a minimum counterexample. Because of this minimality assumption on the length of p, the first tree \(T'\) following T on p does not contain C. Since \({\mathrm {NNI}}\) moves only change one cluster, C is the only cluster changed in \(T'\) compared to T, and all nodes with rank below \((C)_T\) induce the same clusters in T and \(T'\) (Fig. 5). We now compare distances \(d_{{\mathrm {RNNI}}}(T,R)\) and \(d_{{\mathrm {RNNI}}}(T',R)\) by using properties of \({\textsc {FindPath}}\).
Therefore, we first show that all nodes with rank less than i induce the same clusters in T and R. If this was not the case, then all trees on \({\mathrm {FP}}(R,T)\) and \({\mathrm {FP}}(R,T')\) coincide up to iteration \(i = {\mathrm {rank}}((C)_T)\), in which the cluster considered on \({\mathrm {FP}}(R,T)\) is C. Let \(R'\) denote the tree at this point of the path, meaning that \({\mathrm {FP}}(R,T)\) and \({\mathrm {FP}}(R,T')\) coincide up to this tree \(R'\). Since \({\textsc {FindPath}}\) preserves clusters (Lemma 1), \(R'\) has the cluster C. Furthermore, the tree \(T'\), which does not have the cluster C, is on a shortest path from T to R. This is a contradiction to the minimality assumption on T and R, so we can assume that all clusters induced by nodes with rank less than i coincide in R and T.
We now show that by the minimality assumption on T and R, C is induced by the node of rank i in R. We therefore assume to the contrary that C is induced by a node of rank greater than i in R. Then the first cluster considered on \({\mathrm {FP}}(T,R)\) and \({\mathrm {FP}}(T',R)\) is a cluster D induced by \((R)_i\) that does not intersect C. By the definition of most recent common ancestors, both subtrees rooted in the children of the most recent common ancestor of D must contain elements of D. Therefore, the most recent common ancestor of D has rank greater than \(i+1\) in both T and \(T'\), as D would intersect C if \((D)_T\) or \((D)_{T'}\) had rank i or \(i+1\) (see Fig. 5). If \({\mathrm {rank}}(D)_T > i+2\), then the first move on \({\mathrm {FP}}(T,R)\) and \({\mathrm {FP}}(T',R)\) applies the same changes to clusters in T and \(T'\), resulting in trees \(T_1\) and \(T'_1\), respectively. Since \(T_1\) has the cluster C, but \(T'_1\) does not, this contradicts our assumption that T and R give a minimum counterexample. If \({\mathrm {rank}}(D)_T = i+2\), then the rank of the most recent common ancestor of D decreases from \(i+2\) to i in the first two steps of \({\mathrm {FP}}(T,R)\) and \({\mathrm {FP}}(T',R)\), which result in trees \(T_2\) and \(T'_2\) that are \({\mathrm {NNI}}\) neighbours connected by the same move as T and \(T'\). This again contradicts the minimality assumption on T and R. Hence there can be no such a cluster D in R and we can conclude that C is induced by the node of rank i in R.
The first iteration of \({\textsc {FindPath}}\) applied to the pair of trees \((T',R)\) hence considers the cluster C. To construct the cluster C in \(T'\), there is just one \({\mathrm {NNI}}\) move needed, which results in the tree T, as T and \(T'\) are \({\mathrm {NNI}}\) neighbours such that T contains C and \(T'\) does not (Fig. 5). Therefore, T is the first tree following \(T'\) on \({\mathrm {FP}}(T',R')\), resulting in \(|{\mathrm {FP}}(T',R)| = |{\mathrm {FP}}(T',T)| + |{\mathrm {FP}}(T,R)|\) and hence \(d_{{\mathrm {RNNI}}}(T',R) = 1 + d_{{\mathrm {RNNI}}}(T,R)\). From the assumption that \(T'\) is the first tree on a shortest path from T to R we can however infer \(d_{{\mathrm {RNNI}}}(T',R) = d_{{\mathrm {RNNI}}}(T,R) - 1\), which leads to a contradiction. There is hence no shortest path between T and R that does not preserve C, which proves the cluster property for \({\mathrm {RNNI}}\). \(\square \)
The fact that \({\textsc {FindPath}}^+\) computes shortest paths in \({\mathrm {DCT}}_m\) already suggests that shortest paths in \({\mathrm {RNNI}}\) and \({\mathrm {DCT}}_m\) have similar properties. Indeed, the cluster property in \({\mathrm {DCT}}_m\) follows from Theorem 2.
Corollary 1
The graph \({\mathrm {DCT}}_m\) has the cluster property.
Proof
Assume that there is a shortest path between two trees T and R in \({\mathrm {DCT}}_m\) that does not preserve a common cluster. This path corresponds to a path between \(T_r\) and \(R_r\), the extended ranked versions of T and R in \({\mathrm {RNNI}}\), as already discussed in Theorem 1. Since this path has the same length as the one between \(T_r\) and \(R_r\), it is a shortest path in \({\mathrm {RNNI}}\) as well, which leads to a contradiction to Theorem 2. \(\square \)
Caterpillar trees
In this subsection we focus on the set of caterpillar trees and establish some properties of shortest paths between those trees in both \({\mathrm {RNNI}}\) and \({\mathrm {DCT}}_m\). In Theorem 3 we will see that, in both \({\mathrm {DCT}}_m\) and \({\mathrm {RNNI}}\), any two caterpillar trees are connected by a shortest path consisting only of caterpillar trees. We say that a set of trees is convex in a tree space, if there is a shortest path between any two trees in this set that stays within the set. The set of caterpillar trees is hence convex in \({\mathrm {RNNI}}\) and \({\mathrm {DCT}}_m\). The \({\mathrm {NNI}}\) space of unranked trees however does not have this property (Gavryushkin et al. 2018). Based on the convexity of the set of caterpillar trees in \({\mathrm {RNNI}}\) we introduce a way to compute distances between caterpillar trees in this space in time \({\mathcal {O}}(n \sqrt{\log n})\) in Corollary 2, and hence with better worst-case time complexity than \({\textsc {FindPath}}\). Whether this complexity can be achieved in \({\mathrm {DCT}}_m\) for pairs of caterpillar trees is an open question.
Theorem 3
The set of caterpillar trees is convex in \({\mathrm {DCT}}_m\).
Proof
Let T and R be two caterpillar trees in \({\mathrm {DCT}}_m\). We prove the theorem by showing that there is a caterpillar tree \(T'\) that is a neighbour of T and closer to R than T, that is, \(d_{{\mathrm {DCT}}_m}(T', R) = d_{{\mathrm {DCT}}_m}(T, R) - 1\). The existence of a shortest path consisting only of caterpillar trees between T and R follows from this property inductively. In the proof of Theorem 1 we have seen that all paths in \({\mathrm {DCT}}_m\) can be transformed to paths in \({\mathrm {RNNI}}\) between the extended ranked versions of trees, and vice versa, such that these two paths are of equal length. It is therefore sufficient to show that for trees \(T_r\) and \(R_r\), the extended ranked versions of T and R, there is a tree \(T'\) that is a neighbour of T with extended ranked version \(T'_r\) such that \({T'_r}^d\) is a caterpillar tree.
Let \(a_k\) be the leaf with parent of highest rank in \(R_r\) that is not at the same position in \(R_r\) as in \(T_r\): \(a_k := {{\,\mathrm{argmax}\,}}_{a_1, \ldots , a_n}\{{\mathrm {rank}}(a_i)_{R_r} \ |\ {\mathrm {rank}}(a_i)_{R_r} \ne {\mathrm {rank}}(a_i)_{T_r}\}\). You could also think of comparing the trees \(T_r\) and \(R_r\) in a top-down approach, starting at the root, and finding the first node that does not induce the same cluster in these two trees. Since all subtrees \({T_r}^d, {T_r}^c, {R_r}^d\) and \({R_r}^c\) are caterpillar trees, this node has a child that is a leaf, which is \(a_k\). Let furthermore \(a_j \in \{a_1, \ldots , a_{m+2}\}\) be the leaf directly ‘above’ \(a_k\) in \(T_r\), i.e. \({\mathrm {rank}}(a_j)_{T_r} = {\mathrm {rank}}(a_k)_{T_r} + 1\). Note that with the definition of \(a_k\) it immediately follows that \(a_j\) is ‘below’ \(a_k\) in \(R_r\) (\({\mathrm {rank}}(a_j)_{R_r} < {\mathrm {rank}}(a_k)_{R_r}\)). If otherwise it was \({\mathrm {rank}}(a_j)_{R_r} > {\mathrm {rank}}(a_k)_{R_r}\), the parent of \(a_j\) would have the same rank in \(T_r\) as in \(R_r\) and \({\mathrm {rank}}(a_j)_{T_r} > {\mathrm {rank}}(a_k)_{T_r}\) would follow, which contradict our choice of \(a_j\).
Let \(T'_r\) be the tree resulting from \(T_r\) by an \({\mathrm {NNI}}\) move or rank move exchanging the ranks of \((a_k)_{T_r}\) and \((a_j)_{T_r}\). An \({\mathrm {NNI}}\) move is necessary if these two nodes are connected by an edge, otherwise a rank move corresponding to a length move is performed on \(T_r\) to obtain \(T'_r\) (Fig. 6). \({T'_r}^d\) is a caterpillar tree in both cases. We will use properties of shortest paths computed by \({\textsc {FindPath}}\) to show that \(|{\mathrm {FP}}(R_r,T'_r)| = |{\mathrm {FP}}(R_r,T_r)| - 1\).
Since all clusters of \(T_r\) and \(T'_r\) induced by nodes of rank less than \({\mathrm {rank}}(a_k)_{T_r}\) coincide, the paths \({\mathrm {FP}}(R_r,T_r)\) and \({\mathrm {FP}}(R_r,T'_r)\) coincide up to a ranked tree \(R'_r\), which contains all these clusters. We therefore compare only the lengths of \({\mathrm {FP}}(R'_r,T_r)\) and \({\mathrm {FP}}(R'_r,T'_r)\). From \({\mathrm {rank}}(a_j)_{R_r} < {\mathrm {rank}}(a_k)_{R_r}\) we can follow \({\mathrm {rank}}(a_j)_{R'_r} < {\mathrm {rank}}(a_k)_{R'_r}\), as \(a_j\) and \(a_k\) are not in any of the clusters considered by \({\textsc {FindPath}}\) before \(R'_r\), which means that their parents do not exchange ranks before \(R'_r\). We now consider the move on \({\mathrm {FP}}(R_,T_r)\) on the tree \(R'_r\), which corresponds to some iteration l in \({\textsc {FindPath}}\). Note that by the choice of \(R'_r\), all clusters with rank less than \(rank(a_k)_{T_r}\) coincide between \(R'_r\) and \(T_r\), from which we can follow \(l = rank(a_k)_{T_r}\).
By our assumptions on \(T_r\) consisting of two caterpillar trees joined at the root, the cluster considered in iteration l is \(S \cup \{a_k\}\), where S is a cluster that is present in all three trees \(T_r, T'_r,\) and \(R_r\). In the following iteration \(l+1 = {\mathrm {rank}}(a_j)_{T_r}\), \(S' \cup \{a_j\}\) is considered for a cluster \(S'\). \(S'\) either equals \(S \cup \{a_k\}\), if \(T_r\) and \(T'_r\) are connected by an \({\mathrm {NNI}}\) move (bottom of Fig. 6), or \(S'\) is a cluster present in \({T_r}^c\), \({T'_r}^c\), and \({R'_r}^c\), if \(T_r\) and \(T'_r\) are connected by a rank move (top of Fig. 6). Decreasing the rank of \((S \cup \{a_k\})_{R'_r}\) takes \({\mathrm {rank}}(S \cup \{a_k\})_{R'_r} - l\) \({\mathrm {RNNI}}\) moves in both cases. Because the rank of \((S \cup \{a_j\})_{R'_r}\) increases by one when the parents of \(a_k\) and \(a_j\) swap ranks in this iteration, the following iteration for \(S' \cup \{a_j\}\) needs \({\mathrm {rank}}(S' \cup \{a_j\})_{R'_r} + 1 - (l+1)\) \({\mathrm {RNNI}}\) moves. On \({\mathrm {FP}}(R_r,T'_r)\) however, first \({\mathrm {rank}}(S' \cup \{a_j\})_{R'_r} - l\) \({\mathrm {RNNI}}\) moves decrease the rank of \((S' \cup \{a_j\})_{R'_r}\) in \(R'_r\), and then \({\mathrm {rank}}(S \cup \{a_k\})_{R'_r} - (l+1)\) are needed for \(S \cup \{a_k\}\). In total, these two iterations combined result in at least one extra move on \({\mathrm {FP}}(R_r, T_r)\) comparing to \({\mathrm {FP}}(R_r, T'_r)\).
The only difference in the trees after iteration \(l+1\) on the two different paths is the order of ranks of the parents of \(a_j\) and \(a_k\). Since the rest of \(T_r\) and \(T'_r\) coincide, the remaining parts of \({\mathrm {FP}}(R_r, T_r)\) and \({\mathrm {FP}}(R_r, T'_r)\) consist of the same moves. With our previous observation we can follow \(d_{{\mathrm {RNNI}}}(R_r,T_r) = d_{{\mathrm {RNNI}}}(R_r,T'_r) + 1\), and hence \(T'_r\) is on a shortest path from \(T_r\) to \(R_r\). \(\square \)
Note that with \({\mathrm {RNNI}}= {\mathrm {DCT}}_{n-1}\) it follows that the set of caterpillar trees is convex in \({\mathrm {RNNI}}\). This convexity property implies that the distance between caterpillar trees can be computed more efficiently than by \({\textsc {FindPath}}\). We prove this in the rest of this section. To do so, we first establish that the problem of computing a shortest path consisting only of caterpillar trees can be interpreted in a few different ways.
A problem related to the shortest path problem for caterpillar trees in \({\mathrm {RNNI}}\) is the Token Swapping Problem (Kawahara et al. 2017) on a special class of graphs, so-called lollipop graphs. We will show that a pair of caterpillar trees in \({\mathrm {RNNI}}\) can be translated to an instance of the Token Swapping Problem, such that the \({\mathrm {RNNI}}\) distance between the trees is equal to the number of swaps, as explained in the following. An instance of the token swapping problem is a simple graph where every vertex is assigned a token. Two tokens are allowed to swap positions if they are on vertices that are connected by an edge. Each token is assigned a unique goal vertex, and the aim is to find the minimum number of token swaps for all tokens to reach their goal vertex.
The problems of computing distances between caterpillar trees can be seen as an instance of the token swapping problem on lollipop graphs. A lollipop graph is a graph consisting of a complete graph that is connected to a path by exactly one edge. An instance of the token swapping problem that corresponds to the distance problem for caterpillar trees is described in the following. An example is illustrated in Fig. 7. Let T and R be caterpillar trees with
$$\begin{aligned} {\mathrm {rank}}(a_1)_R= & {} {\mathrm {rank}}(a_2)_R< {\mathrm {rank}}(a_3)_R< \ldots< {\mathrm {rank}}(a_n)_R \text { and}\\ {\mathrm {rank}}(b_1)_T= & {} {\mathrm {rank}}(b_2)_T< {\mathrm {rank}}(b_3)_T< \ldots < {\mathrm {rank}}(b_n)_T \end{aligned}$$
such that \([b_1, \ldots , b_n]\) is a permutation of \([a_1, \ldots , a_n]\). The corresponding instance of the token swapping problem consists of a lollipop graph consisting of a complete graph on three leaves, connected to a path of length \(n-3\) by an edge. The vertex in the complete graph incident to the edge connecting the complete graph with the path is labelled by \(a_3\), the other ones in the complete graph are labelled by \(a_1\) and \(a_2\). The vertices on the paths are then labelled inductively, starting at the neighbour of \(a_3\), such that the unique unlabelled neighbour of the last already labelled node with label \(a_{i-1}\) is labelled by \(a_i\). We place the token on vertex \(a_i\) that has \(b_i\) as goal vertex for all \(i \ge 3\). On \(a_1\) and \(a_2\), which represent the cherry of the caterpillar tree, we place the tokens with goal vertices \(b_1\) and \(b_2\) so that if \(a_i = b_j\) for some \(i,j \in \{1,2\}\), the token with goal vertex \(b_j=a_i\) is placed on the node labelled \(a_i=b_j\). Since the only moves between two caterpillar trees in \({\mathrm {RNNI}}\) are \({\mathrm {NNI}}\) moves, which simply swap two leaves, they correspond to swapping two tokens in the above described instance of the token swapping problem.
Therefore, the algorithm described by Kawahara et al. (2017) to solve the token swapping problem on lollipop graphs can be used for computing distances between caterpillar trees. It however has worst-case time complexity \({\mathcal {O}}(n^2)\), the same as \({\textsc {FindPath}}\).
In the following we present an algorithm for computing distances between caterpillar trees with better worst-case time complexity, \({\mathcal {O}}(n \sqrt{\log n})\), for \({\mathrm {RNNI}}\) (Corollary 2). To do so, we first establish a formula to express distances between two caterpillar trees in \({\mathrm {RNNI}}\) (Theorem 4). This algorithm can also be used to solve the token swapping problem on lollipop graphs, improving the worst-case running time of the known algorithm (Kawahara et al. 2017).
For improving on the time-complexity of computing distances between caterpillar trees, we use a representation of caterpillar trees as a list of leaves, ordered according to increasing rank of their parents. The caterpillar tree on the left of Fig. 7 for example can be represented as
$$\begin{aligned} \text { or }[a_5,a_4,a_3,a_2,a_1].\end{aligned}$$
There are two possible list representations of a caterpillar tree because the first two leaves (\(a_4\) and \(a_5\) in this example) share their parent of rank one. For two given caterpillar trees T and R we call a pair of leaves \((a_i,a_j)\) transposition in T with respect to R, if the rank of the parent of \(a_i\) is lower than the rank of the parent of \(a_j\) in T, and the opposite is true for R: \({\mathrm {rank}}(a_i)_T < {\mathrm {rank}}(a_j)_T\) and \({\mathrm {rank}}(a_i)_R > {\mathrm {rank}}(a_j)_R\). For two leaves \(a_i\) and \(a_j\) in a caterpillar tree T we say that \(a_i\) is below \(a_j\) and \(a_j\) is above \(a_i\) in T if \({\mathrm {rank}}(a_i)_T < {\mathrm {rank}}(a_j)_T\).
Theorem 4
Let T and R be caterpillar trees in \({\mathrm {RNNI}}\) such that
$$\begin{aligned} 1 = {\mathrm {rank}}(a_1)_R = {\mathrm {rank}}(a_2)_R< {\mathrm {rank}}(a_3)_R< \ldots < {\mathrm {rank}}(a_n)_R = n-1. \end{aligned}$$
Define
$$\begin{aligned} P(T,R)= & {} \{(a_i,a_j)\ |\ {\mathrm {rank}}(a_i)_T< {\mathrm {rank}}(a_j)_T \text { and } {\mathrm {rank}}(a_i)_R> {\mathrm {rank}}(a_j)_R\}, \\ M(T,R)= & {} \{a_i\ |\ \text {for all } l \text { with } {\mathrm {rank}}(a_l)_T \le {\mathrm {rank}}(a_i)_T \text { it is } {\mathrm {rank}}(a_l)_R > {\mathrm {rank}}(a_i)_R\} \\&\cap \{a_i \ |\ {\mathrm {rank}}(a_i)_T < \min \{{\mathrm {rank}}(a_1)_T, {\mathrm {rank}}(a_2)_T\}\}. \end{aligned}$$
Then for \({m(T,R) = |M(T,R)|}\) and \({p(T,R) = |P(T,R)|}\):
$$\begin{aligned} d_{{\mathrm {RNNI}}}(T,R) = p(T,R) - m(T,R). \end{aligned}$$
The set P(T, R) in Theorem 4 is the set of transpositions for the caterpillar tree T with respect to R. M(T, R) contains the leaves \(a_i\) in T for which in the representation of T as a list (i) every leaf that is below \(a_i\) in T (if \(a_i\) is in the cherry, this includes the other cherry leaf) is strictly above \(a_i\) in R and (ii) no cherry leaf of R is below \(a_i\) in T. The caterpillar trees T and R in Fig. 7 for example have \(P(T,R) = \{(a_1,a_3),(a_1,a_4),(a_1,a_5),(a_2,a_3),(a_2,a_4),(a_2,a_5),(a_3,a_4),(a_3,a_5)\}\) and \(M(T,R) = \{a_3,a_4\}\).
Proof
Let T and R be caterpillar trees in \({\mathrm {RNNI}}\) as described above and let \({\widehat{d}}(T,R) := p(T,R) - m(T,R)\). For proving \({\widehat{d}}(T,R) = d_{{\mathrm {RNNI}}}(T,R)\) it is sufficient to show that there is a caterpillar tree \(T'\) that is neighbour of T such that \({\widehat{d}}(T',R) = {\widehat{d}}(T,R) - 1\). Then it follows inductively that \({\widehat{d}}(T,R) = d_{{\mathrm {RNNI}}}(T,R)\), because the sets P(T, T) and M(T, T) are empty.
We assume, similar to the proof of Theorem 3 that T and R are caterpillar trees such that \(a_k\) is the leaf with parent of highest rank in R that is not at the same position in T: \(a_k := {{\,\mathrm{argmax}\,}}_{a_1, \ldots , a_n}\{{\mathrm {rank}}(a_i)_{R} \ |\ {\mathrm {rank}}(a_i)_{R} \ne {\mathrm {rank}}(a_i)_{T}\}\). Let \(T'\) be the tree that results from an \({\mathrm {NNI}}\) move on T swapping the leaves \(a_k\) and \(a_i\) with \({\mathrm {rank}}(a_i)_T = {\mathrm {rank}}(a_k)_T + 1\). In the proof of Theorem 3 we saw that it follows \(d_{{\mathrm {RNNI}}}(T',R) = d_{{\mathrm {RNNI}}}(T,R) - 1\). We now prove \({\widehat{d}}(T',R) = {\widehat{d}}(T,R) - 1\).
Therefore, we distinguish two cases: (i) \({\mathrm {rank}}(a_i)_T>1\) and (ii) \({\mathrm {rank}}(a_i)_T = 1\), meaning that \(a_i\) is in the cherry of T.
-
Case (i) By the definition of \(a_k\), \((a_k,a_i)\) is a transposition in the set P(T, R). As \(a_k\) and \(a_i\) are the only leaves whose order changed between T and \(T'\), they build the only transposition that is in T but not in \(T'\) with respect to R. Hence it is \(p(T',R) = p(T,R) - 1\). Because the definition of \(a_k\) requires all leaves that are above \(a_k\) in R to be at the same position in T, there is no leaf that is below \(a_k\) in T and above it in R. Therefore, it is \(a_k \notin M(T,R)\) and \(a_k \notin M(T',R)\) for the same reason. If \(a_i \in M(T,R)\), it follows \(a_i \in M(T',R)\), as only the relationship between \(a_i\) and \(a_k\) changes and the inequalities required for \(a_i\) to be in M(T, R) are true for \(a_k\). For the same reason, if \(a_i \notin M(T,R)\), it is \(a_i \notin M(T',R)\). We can conclude \(M(T',R) = M(T,R)\) and hence:
$$\begin{aligned} {\widehat{d}}(T',R) = p(T',R) - m(T',R) = p(T,R) - 1 - m(T,R) = {\widehat{d}}(T,R) - 1 \end{aligned}$$
-
Case (ii) As in the previous case, \((a_k,a_i)\) is a transposition in P(T, R), but not in \(P(T',R)\). There is however another transposition that could be in P(T, R), but not in \(P(T',R)\), that is the pair \((x,a_i)\), where x is the second cherry leaf of T (see Fig. 8). We now distinguish the case (a) that \((x,a_i)\) is not a transposition in T from the case (b) that \((x,a_i)\) is a transposition in T with respect to R.
If \((x,a_i)\) is not a transposition in T, it follows \(p(T',R) = p(T,R) - 1\), as \((a_k,a_i)\) is the only transposition that is in P(T, R), but not in \(P(T',R)\). As in the previous case (i) it also follows \(m(T,R) = m(T',R)\) and we conclude that it is
$$\begin{aligned} {\widehat{d}}(T',R) = p(T',R) - m(T',R) = p(T,R) - 1 - m(T,R) = {\widehat{d}}(T,R) - 1 \end{aligned}$$
If \((x,a_i)\) is a transposition, it is \(p(T',R) = p(T,R) - 2\). To compare m(T, R) with \(m(T',R)\), it is sufficient to consider the membership of \(x,a_i,\) and \(a_k\) in M(T, R) and \(M(T',R)\). The relationships between all other leaves are the same in M(T, R) and \(M(T',R)\), resulting in \(a_j \in M(T,R)\) if and only if \(a_j \in M(T',R)\) for all \(j \in \{1, \ldots ,n \} \setminus \{a_i, a_k, x\}\). We again know by the definition of \(a_k\) that \(a_k \notin M(T,R)\) and \(a_k \notin M(T',R)\). Since both x and \(a_k\) are below \(a_i\) in T, but above it in R, and neither of x and \(a_k\) is in the cherry of R, it follows \(a_i \in M(T,R)\), and similarly \(a_i \in M(T',R)\). The leaf x is in M(T, R), as there is only one leaf \(a_k\) that fulfils \({\mathrm {rank}}(a_k)_T \le {\mathrm {rank}}(x)_T\), and it also is \({\mathrm {rank}}(a_k)_R > {\mathrm {rank}}(a_k)_R\). Since \((x,a_i)\) is a transposition in T, it follows \({\mathrm {rank}}(x)_R < {\mathrm {rank}}(a_i)_R\). Together with \({\mathrm {rank}}(a_i)_{T'} = {\mathrm {rank}}(x)_{T'}\) it follows that \(x \notin M(T',R)\). Therefore, it is \(m(T',R) = m(T,R) - 1\) and we can conclude that in total
$$\begin{aligned} {\widehat{d}}(T',R) \!= \!p(T',R) - m(T',R)\! = \!p(T,R) - 2 - (m(T,R)-1) \!=\! {\widehat{d}}(T,R) - 1. \end{aligned}$$
\(\square \)
Corollary 2
The distance between two caterpillar trees can be computed in \({\mathcal {O}}(n \sqrt{\log n})\) in \({\mathrm {RNNI}}\).
Proof
By Theorem 4 the distance between two caterpillar trees in \({\mathrm {RNNI}}\) is the number of transpositions between two sequences of length n minus m(T, R) as defined in Theorem 4. The value m(T, R) can be computed in time linear in n for any caterpillar tree T by considering the leaves of the tree ordered according to increasing rank of their parents. The number of transpositions of a sequence of length n (Kendall-tau distance) can be computed in time \({\mathcal {O}}(n \sqrt{\log n})\) (Chan and Pătraşcu 2010). This number is equal to p(T, R), as defined in Theorem 4, when ignoring transpositions for the pairs of leaves sharing a parent in T and R, respectively. The worst-case running time for computing the \({\mathrm {RNNI}}\) distance between caterpillar trees is therefore \({\mathcal {O}}(n \sqrt{\log n})\). \(\square \)
Diameter and radius
In this section we investigate the diameter of \({\mathrm {RNNI}}\) and \({\mathrm {DCT}}_m\), which is the greatest distance between any pair of trees in each of these graphs, respectively, i.e. \(\max \limits _{\text {trees }T,R}d(T,R)\). We first establish the exact diameter of \({\mathrm {RNNI}}\), improving the upper bound \(n^2 - 3n - \frac{5}{8}\) given by Gavryushkin et al. (2018). Afterwards, we generalise this result to \({\mathrm {DCT}}_m\).
Theorem 5
The diameter of \({\mathrm {RNNI}}\) is \(\frac{(n-1)(n-2)}{2}\).
Proof
For proving this theorem we use the fact that \({\textsc {FindPath}}\) computes shortest paths in \({\mathrm {RNNI}}\). Each iteration i of \({\textsc {FindPath}}\), applied to two ranked trees T and R, decreases the rank of the most recent common ancestor of a cluster C, induced by the node of rank i in R, in the currently last tree \(T'\) on the already computed path (starting wth \(T' = T\)). The maximum rank of \((C)_{T'}\) at the beginning of iteration i is \(n-1\), the rank of the root. As every move decreases the rank of \((C)_{T'}\) by one, there are at most \(n-1-i\) moves in iteration i. The maximum length of a shortest path in \({\mathrm {RNNI}}\) is hence \(\sum \limits _{i = 1}^{n-1} i = \frac{(n-1)(n-2)}{2}\). Note that the caterpillar trees \([\{a_1, a_2\}, \{a_1, a_2, a_3\}, \ldots , \{a_1, \ldots , a_n\}]\) and \([\{a_n, a_{n-1}\}, \{a_n, a_{n-1}, a_{n-2}\}, \ldots , \{a_n, \ldots , a_1\}]\) provide an example of trees that have distance \(\frac{(n-1)(n-2)}{2}\), as already pointed out in Collienne and Gavryushkin (2021, Corollary 1), proving that this upper bound for the length of a shortest path is tight. \(\square \)
Theorem 6
The diameter of \({\mathrm {DCT}}_m\) is \(\frac{(n-1)(n-2)}{2} + (m-n+1)(n-1)\).
Proof
In order to prove the diameter of \({\mathrm {DCT}}_m\), we consider the maximum number of moves that \({\textsc {FindPath}}\) can perform on the extended ranked versions \(T_r\) and \(R_r\) of any two trees T and R. With Theorem 1 it follows that this is the diameter of \({\mathrm {DCT}}_m\), indeed. Therefore, we distinguish \({\mathrm {RNNI}}\) moves in the subtrees on the leaf set \(\{a_1, \ldots , a_n\}\) from the rank moves corresponding to length moves, i.e. rank moves between one node of each of the subtrees on leaf subsets \(\{a_1, \ldots , a_n\}\) and \(\{a_{n+1}, \ldots a_{m+2}\}\).
The maximum number of \({\mathrm {RNNI}}\) moves (excluding rank moves corresponding to length moves) on \({\mathrm {FP}}(T_r,R_r)\) follows from Theorem 5 and is \(\frac{(n-1)(n-2)}{2}\). The maximum number of rank moves corresponding to length moves on a shortest path between \(T_r\) and \(R_r\) is reached when every internal node of the subtree \(T_r^c\) of \(T_r\) swaps rank with every internal node of the subtree \(T_r^d\). The maximum number of such rank swaps corresponding to length moves is hence \((m-n+1)(n-1)\).
The sum of the maximum number for \({\mathrm {RNNI}}\) and length moves is therefore \(\frac{(n-1)(n-2)}{2} + (m-n+1)(n-1)\). To show that this upper bound is actually the diameter of \({\mathrm {DCT}}_m\) we give an example of trees T and R (Fig. 9) for which the path computed by \({\textsc {FindPath}}^+\) has length \(\frac{(n-1)(n-2)}{2} + (m-n+1)(n-1)\). Both T and R are caterpillar trees defined as follows.
$$\begin{aligned}m-n-1 = {\mathrm {rank}}(a_1)_T = {\mathrm {rank}}(a_2)_T< {\mathrm {rank}}(a_3)_T< \ldots < {\mathrm {rank}}(a_n)_T = m\end{aligned}$$
and
$$\begin{aligned}1 = {\mathrm {rank}}(a_1)_R = {\mathrm {rank}}(a_n)_R< {\mathrm {rank}}(a_{n-1})_R< \ldots < {\mathrm {rank}}(a_1)_R = n-1.\end{aligned}$$
\(\square \)
Note that the worst-case running time of \({\textsc {FindPath}}\) in \({\mathrm {RNNI}}\) is \({\mathcal {O}}(n^2)\) and the running time of \({\textsc {FindPath}}^+\) in \({\mathrm {DCT}}_m\) is \({\mathcal {O}}(nm)\), as it depends on the diameter of the corresponding tree spaces. For computing a shortest path, there are no algorithm with better worst-case running time than these, as the running time for algorithms computing shortest paths is bounded from below by the diameter of the corresponding space. There can however be more efficient algorithms for computing distances, if this is not done by computing the shortest path as \({\textsc {FindPath}}\) and \({\textsc {FindPath}}^+\) do, but by finding an invariant that determines the distance without needing to compute every tree on a shortest path.
The radius of a graph is defined as the minimum distance of any vertex in the graph to the vertex with maximum distance from it, that is, \(\min \limits _{T}\max \limits _{R} d(T,R)\), where d is the distance measure in the corresponding graph. In the following we see that the radius of \({\mathrm {RNNI}}\) equals its diameter, which is not true for \({\mathrm {DCT}}_m\), as we will see afterwards.
Theorem 7
The radius of \({\mathrm {RNNI}}\) equals its diameter \(\frac{(n-1)(n-2)}{2}\).
Proof
We prove this theorem by showing that every ranked tree T in \({\mathrm {RNNI}}\) has a caterpillar tree R with distance \(\frac{(n-1)(n-2)}{2}\) to T, using induction on the number of leaves n.
The base case \(n=3\) is trivial, as all three trees in this space are caterpillar trees with distance one from each other. For the induction step we consider an arbitrary tree T with \(n+1\) leaves. Let x and y be the leaves of T that share the internal node of rank one as parent in T, and let \(T_n\) be the tree on n leaves resulting from deleting one of these leaves, say x, from T, and suppressing the resulting degree-2 vertex. By the induction hypothesis there is a caterpillar tree \(R_n\) with distance \(\frac{(n-1)(n-2)}{2}\) to \(T_n\). Now consider the tree R resulting from adding x at the top of \(R_n\) such that the root of R has x and \(R_n\) as children.
We now consider \({\mathrm {FP}}(R,T)\). In the first iteration of \({\textsc {FindPath}}\), \((\{x,y\})_R\) moves down until it reaches rank one. Therefore, first \((x)_R\) moves down by \({\mathrm {NNI}}\) moves until it reaches rank \({\mathrm {rank}}(y)_R + 1\). Then a further \({\mathrm {NNI}}\) move creates an internal node with children x and y, before this node is moved down by rank swaps to reach rank one as depicted in Fig. 10. Altogether, there are \(n-1\) \({\mathrm {RNNI}}\) moves needed in the first iteration, as the rank of the parent of x decreases by one within every move, starting at the root with rank n and ending at the internal node of rank one. The tree at the end of this first iteration on \({\mathrm {FP}}(R,T)\) is identical to \(R_n\) when removing the leaf x and suppressing its parent (the node of rank one). Since the cluster \(\{x,y\}\) is not considered again in \({\textsc {FindPath}}\), the remaining part of \({\mathrm {FP}}(R,T)\) contains the same moves as \({\mathrm {FP}}(R_n,T_n)\), and hence \(|{\mathrm {FP}}(R,T)| = |{\mathrm {FP}}(R_n,R_n)| + n-1\). Therefore it is \(d_{{\mathrm {RNNI}}}(T,R) = \frac{(n-1)(n-2)}{2} + n-1 = \frac{n(n-1)}{2}\), which proves the lemma. \(\square \)
Unlike in \({\mathrm {RNNI}}\), the radius of \({\mathrm {DCT}}_m\) does not equal its diameter. A counterexample is given by the tree depicted in Fig. 11 on three leaves in \({\mathrm {DCT}}_4\). The diameter of \({\mathrm {DCT}}_4\) on three leaves is \(\frac{(n-1)(n-2)}{2} + (m-n+1)(n-1) = 5\), but there is no tree with this distance from the tree provided in Fig. 11. The maximum distance between any tree in \({\mathrm {DCT}}_4\) and the tree in Fig. 11 is 4.