Distribution and asymptotic behavior of the phylogenetic transfer distance
Abstract
The transfer distance (TD) was introduced in the classification framework and studied in the context of phylogenetic tree matching. Recently, Lemoine et al. (Nature 556(7702):452–456, 2018. https://doi.org/10.1038/s4158601800430) showed that TD can be a powerful tool to assess the branch support on large phylogenies, thus providing a relevant alternative to Felsenstein’s bootstrap. This distance allows a reference branch\(\beta \) in a reference tree \({\mathcal {T}}\) to be compared to a branch b from another tree T (typically a bootstrap tree), both on the same set of n taxa. The TD between these branches is the number of taxa that must be transferred from one side of b to the other in order to obtain \(\beta \). By taking the minimum TD from \(\beta \) to all branches in T we define the transfer index, denoted by \(\phi (\beta ,T)\), measuring the degree of agreement of T with \(\beta \). Let us consider a reference branch \(\beta \) having p tips on its light side and define the transfer support (TS) as \(1  \phi (\beta ,T)/(p1)\). Lemoine et al. (2018) used computer simulations to show that the TS defined in this manner is close to 0 for random “bootstrap” trees. In this paper, we demonstrate that result mathematically: when T is randomly drawn, TS converges in probability to 0 when n tends to \(\infty \). Moreover, we fully characterize the distribution of \(\phi (\beta ,T)\) on caterpillar trees, indicating that the convergence is fast, and that even when n is small, moderate levels of branch support cannot appear by chance.
Keywords
Phylogenetic trees Distances between bipartitions and phylogenies Rdistance Random phylogenies Concentration inequalities Lattice pathsMathematics Subject Classification
Primary 9208 62P10 62F40 62F05 62F35 Secondary 05A05 05C05 60E151 Introduction
The transfer distance or Rdistance was introduced in the classification framework by Day (1981) and Régnier (1965), as a measure of (dis)similarity between partitions of a set. It is defined as the minimum number of elements that need to be transferred from one class to another (or removed), in order to transform one partition into the other. This distance possesses some desirable properties, for example its low computational cost in comparison with other metrics, as established by Day (1981). Charon et al. (2006) studied other characteristics of this distance such as the maximum transfer distance that can be obtained when comparing two partitions with a fixed, but possibly different, number of classes. As highlighted by Denœud (2008), it proves challenging to study the theoretical properties of the transfer distance, so the author used simulations to approximate its distribution and mean on random partitions.
The interest in using the transfer distance to compare phylogenetic trees started with a seminal paper by Day (1985). In the field of computational biology, problems involving tree comparison have remained a major challenge for many years. A common concern is to define biologically meaningful metrics on trees. The transfer distance is a measure to compare bipartitions, and a phylogenetic tree is unambiguously defined by the set of bipartitions induced by its branches. Then, a logical question to ask is whether we can define a metric on trees based on this transfer distance on bipartitions. However, there are multiples ways to define such a metric. Day (1985) proposes several algorithms and methods to solve related tree problems, in particular the construction of tree consensus. As discussed by Day (1985), this task requires the optimization of a consensus index, which can be defined using the transfer distance or other metrics, such as the wellknown Robinson–Foulds (RF) metric (Robinson and Foulds 1981). The latter is probably the most widely used distance between trees and is defined as the number of bipartitions belonging to one tree but not to the other. However, the RF metric is known to have several drawbacks, including its lack of robustness, since it is highly sensitive to small tree changes, as pointed out by Lin et al. (2012) and Bogdanowicz and Giaro (2012).
Boc et al. (2010) used RF and a transferbased dissimilarity to compare gene trees to species trees and detect horizontal gene transfers. They showed that the transfer approach provides better results than RF. However, their transferbased dissimilarity is not a metric, since it violates the triangle inequality in some cases (Boc et al. 2010, p. 197, Proposition 1). Lin et al. (2012) addressed this problem using the minimumcost matching between the two sets of bipartitions induced by both trees. For that metric, also relying on the transfer distance, the triangle inequality holds. Moreover, Lin et al. (2012) proposed a lowpolynomial time algorithm to compute this new tree metric and demonstrated its robustness compared to RF.
Recently, we proposed a new bootstrap method for large phylogenetic trees that relies on branch comparisons based on the transfer distance (Lemoine et al. 2018). The aim was to use a more finegrained measure for the presence of a branch in a tree, rather than the binary values used in Felsenstein’s classical phylogenetic bootstrap technique (Felsenstein 1985). In this approach, we compare a reference branch\(\beta \) in a reference tree\({\mathcal {T}}\) to another tree T, typically a bootstrap tree, by taking the minimum of the transfer distance from \(\beta \) to any branch b in T, which is called the transfer index and denoted by \(\phi (\beta ,T)\). Next, the average of \(\phi \) over a set of bootstrap trees is used to define, after appropriate normalization, the socalled transfer bootstrap expectation (TBE). We explored the behavior of TBE as a measure of support for the branches of a phylogenetic tree, compared to that of Felsenstein’s support (FS). In a number of experiments using both real and simulated data, we found that TBE outperformed FS. This was particularly noticeable for deep branches and large values of n, where FS often failed to detect the phylogenetic signal in the trees. Unstable taxa are inevitable (sequence and reconstruction errors, recombination, etc.) under these conditions, and nearly correct branches are seen as absent by the standard bootstrap, thus yielding low support values, while TBE is able to detect the few misplaced taxa and provides high support values for those branches. In view of those results, TBE shows promise as a useful tool in phylogenetic analysis. Lemoine et al. (2018) studied and discussed several of its properties, but there was still a need for further mathematical work to fully understand the transfer index and TBE. The main motivation for the present work is therefore to study the properties of the transfer index and support, and characterize their asymptotic behavior when the reference branch is compared to a tree T drawn randomly according to some null model, reflecting in a bootstrap context the absence of phylogenetic signal in the analyzed data set.
To be more specific about the results obtained here, let us fix \(n\ge 4\) and consider phylogenetic trees on a set X of n taxa. To distinguish the two sides of the reference bipartition \(\beta \), we say that its light side contains \(p \ge 2\) taxa while its heavy side has \(np \ge p\) taxa. The TBE proposed by Lemoine et al. (2018) is the average, over all the bootstrap trees, of the transfer support function (TS), which is defined as \(1  \phi (\beta ,T)/(p1)\). Equivalently, we can compute the average of \(\phi (\beta , T)\) over all boostrap trees and apply the same linear normalization to obtain TBE. It is not hard to see that \(0\le \phi (\beta ,T)\le p1\) and thus \(\text {TS}\in [0,1]\).
Our first result consists in the characterization of the asymptotic behavior of the transfer index \(\phi (\beta ,T)\) for a random bipartition \(\beta \) of the set X and any given tree T. We prove that the transfer index converges in probability to \(p1\) when p is fixed (or grows slower than \(\sqrt{n}\)) and n tends to \(\infty \). The proof relies on the comparison of the transfer index with the parsimony score of a binary character, and a result from (Steel and Penny 2005). We then use concentration inequalities to characterize the asymptotic behavior of the transfer index when p grows faster, depending on n (e.g. when \(p = {\lfloor }{n/2}{\rfloor }\)). Lastly, when T is a caterpillar tree, we fully characterize the probability distribution of the transfer index based on a onetoone correspondence between these trees and NorthEast (NE) lattice paths, a common technique for counting combinatorial objects (Mohanty 1979). All of these results show that \(p1\) is the appropriate normalization constant for the TS and TBE, as proposed by Lemoine et al. (2018).
The paper is organized as follows. In Sect. 2, we give the main definitions and properties of the concepts described earlier. Section 3 is devoted to the results concerning the parsimony score, and Sect. 4 presents the asymptotic results using concentration inequalities. Details on the specific case of the caterpillar tree are given in Sect. 5. In Sect. 6, we discuss the impact of our findings on the phylogenetic bootstrap, and propose several conjectures and directions for further work.
2 Preliminaries
In this section, we give the main definitions and general properties on phylogenetic trees that are needed for the rest of the paper. We refer to Semple and Steel (2003) for an extensive mathematical treatment of this subject.
Let us fix \(n\ge 4\) and X, a set of n taxa. We consider phylogenetic trees on X, that is, trees whose leaves are mapped onetoone to X. These trees are called phylogeneticXtrees or simply phylogenies. For simplicity of notation, we shall always take \(X = \{1,2,\ldots ,n\}\). Denote by \(\text {UB}(n)\) the set of all unrooted binary phylogenetic trees (every interior vertex has degree 3) on n leaves. For a phylogenetic tree T, we use \({\mathcal {E}}(T)\), \({\mathcal {V}}(T)\) to denote respectively the set of edges (or branches) and the set of vertices (or nodes) of the tree.
For any Xtree T, a branch \(b\in {\mathcal {E}}(T)\) can be encoded in several equivalent ways, that we will use indistinctly depending on the context. First, any branch b defines a bipartition (or split), and we can associate b to a vector v(b) in \(\{0,1\}^n\) by assigning the same number (e.g. 0) to all the elements on the same side of the split induced by this branch. Notice that b is also encoded by \({{\overline{v}}}(b)\), the negation of v (i.e. the 0 values are turned into 1 and vice versa). Likewise, we can identify a bipartition, with a bicoloration of the leaves, that is a function that assigns one of two colors (black \(={\mathbf{B }}\) or white \(={\mathbf{W }}\)) to each leaf label. Notice however, that we can consider a bipartition or a bicoloration on a tree that does not correspond to any branch in this tree. To make the distinction, we say \(b\in {\mathcal {E}}(T)\) for the bipartitions induced by branches on the tree T, and we use \({\mathcal {X}}:=\{\chi :X\rightarrow \{{\mathbf{B }},{\mathbf{W }}\} \}\) to denote the set of all possible bicolorations of the tips of T. Then, an element of \({\mathcal {X}}\) does not necessarily correspond to a branch in T, but to a bicoloration of its tips.
As described in the Introduction, the transfer distance is used to compare a branch \(\beta \) in the reference phylogeny \({\mathcal {T}}\), to a second branch b in another phylogeny T, both on the same taxa set X. This distance can easily be defined using the Hamming distance H between two vectors of equal size.
Definition 1
Based on this definition, notice that \(\delta (\beta ,b)=0\) if and only if \(\beta \) and b define the same bipartition. To measure the degree of presence of \(\beta \) in T, we define the transfer index, denoted by \(\phi (\beta ,T)\), which is the minimum of the transfer distance over all branches in T (Lemoine et al. 2018).
Definition 2
As mentioned before, we are interested in the case where the reference tree \({\mathcal {T}}\) and a branch \(\beta \) on this tree are fixed. The core idea proposed by Lemoine et al. (2018) is to measure the presence of this reference branch in a set of bootstrap trees by using the following transfer support function.
Definition 3
 (i)
\(\phi \left( \beta ,T \right) = 0 \Longleftrightarrow \beta \in T\),
 (ii)
\(\phi \left( \beta ,T \right) \in [0,p1]\) or equivalently \(\text {TS}\left( \beta ,T \right) \in [0,1]\).
2.1 Null models
The aim of this study is to characterize the distribution and the asymptotic behavior of the transfer index and transfer support when the reference bipartition \(\beta \) is compared to a binary phylogenetic Xtree T that follows a certain null model. We are interested mainly in unrooted trees, but it should be noted that the existence of a root has no influence on transfer distance values: both branches adjacent to the root define the same bipartition.
There are two ways to define the probabilistic models we are considering. First, we can suppose that we have a fixed bicoloration \(\chi _p\in {\mathcal {X}}_p\) and that we draw a tree T from \(\text {UB}(n)\), following some specific probabilistic model. Another way is to consider that the tree is fixed and a bicoloration of its tips is uniformly chosen from \({\mathcal {X}}_p\). In the first case, an interesting question is to consider the probabilistic models that are most commonly used in the field of phylogenetics, such as the YuleHarding or PDA models. For a fixed tree, a natural question is to look at the two extreme cases for the topology regarding balance. The most imbalanced tree is called the caterpillar tree, which is defined as a binary phylogenetic tree for which the induced subtree on the interior vertices forms a path graph (if the tree is rooted, then the root is at one end of the path). On the other side, we have perfectly balanced trees, that is rooted binary phylogenetic trees with \(n = 2^h\) leaves (for some \(h\in {\mathbb {N}}\)), each of which is at a distance of exactly h edges from the root. We refer to Semple and Steel (2003) and Steel (2016) for further details on these tree models.
As explained in the Introduction, we performed computer simulations for these four models to exhibit their asymptotic properties. In Fig. 1, we observe that the asymptotic behavior of the TS seems to be independent of the model considered. This is explained in the following sections. Then, a full theoretical treatment is carried out for the caterpillar tree topology.
3 Comparing the transfer index to the parsimony score
We are now interested in comparing the transfer index to the widely used parsimony score introduced by Farris (1970), Fitch (1971), and Hartigan (1973). We show that the transfer index is lowerbounded by the parsimony score minus one, and use this result to obtain our first characterization of the asymptotic behavior of the transfer index.
Definition 4
By using a simple argument, one can prove the following result from Lemoine et al. (2018), given here for the sake of completeness.
Lemma 1
Proof
3.1 Asymptotic results for fixed p
In this subsection, we use inequality (1) between the parsimony score and the transfer index to establish that the transfer index converges to \(p1\) when p is fixed and n grows to infinity. Let us consider a random bicoloration \(\chi _p\) from \({\mathcal {X}}_p\). Let T be a given binary phylogenetic tree with n tips colored by \(\chi _p\). The larger n, the more dispersed the black tips in T, and the higher the probability that the parsimony score is equal to p and the transfer index to \(p1\). This is formalized as follows.
Proposition 1
Proof
This result is valid for any given binary phylogenetic tree T as long as the bicoloration \(\chi _p\) is uniformly distributed in the set \({\mathcal {X}}_p\). It has the following immediate consequences.
Corollary 1

when p is fixed, the transfer index \(\phi (\chi _p,T)\) converges in probability to \(p1\) when \(n\rightarrow \infty \);

when \(p=o(\sqrt{n})\), we have that \(\phi (\chi _p,T)  (p1)\) converges in probability to 0 when \(n\rightarrow \infty \).
4 Behavior of the transfer distance when p grows with n
In the previous section, we showed that, when n tends to infinity, the transfer index converges in probability to \(p1\) for fixed p, and TS converges to 0 when p grows slowly as \(o(\sqrt{n})\). However, simulations in Fig. 1 suggest that the transfer index also behaves in a similar manner for larger values of p relative to n. For example, for all null models when \(p = {\lfloor }{n/2}{\rfloor }\), the expected value of TS is larger than 0.1 with \(n = 128\), but lower than 0.05 with \(n = 1024\). In this section, we will show that for “all values of p”, the distribution of the transfer index is concentrated around \(p1\), meaning that the probability of the transfer index being “far away” from \(p1\) vanishes as n grows. This explains what we observe in our simulations and motivates the use of \(p1\) as the normalization term in the definition of TS.
The results we obtain in this section are based on concentration inequalities. More precisely, we make use of the ChernoffHoeffding bounds for sums of independent random variables, as stated by Dubhashi and Panconesi (2009). In his original paper, Hoeffding (1963) proved that these inequalities also hold for sums of variables obtained by sampling without replacement, which is the case of interest here. The following lemma is a direct consequence of the results in Hoeffding (1963) and Dubhashi and Panconesi (2009).
Lemma 2
We can now state the main theorem of this section.
Theorem 1
 1.
if \(p=O(n^{\alpha })\) for some \(0< \alpha <1\), then \(\phi (\chi _p,T) \ge pC\) for some constant C;
 2.
if \(p=cn + o(n)\) for some \(0<c<1/2\), then \(\phi (\chi _p,T) \ge pC\log n\) for some constant C;
 3.
if \(p=\frac{1}{2}no(n)\), then \(\phi (\chi _p,T) \ge pC\sqrt{n\log n}\) for some constant C.
Corollary 2
For any given tree T, any \(\chi _p\) uniformly distributed in the set \({\mathcal {X}}_p\), and any p that grows with n as in cases 1, 2, and 3, the transfer support \(\text {TS}(\chi _p,T)\) converges in probability to 0 when \(n\rightarrow \infty \).
Remark 1
The convergence established by the previous corollary also holds for TBE, which is obtained by averaging TS over N “bootstrap trees”, when all those trees follow the same null model.
Proof of Theorem 1
We now must consider three cases depending on the growth rate of p with respect to n.
Case 1:\(p=O(n^\alpha )\) for some \(\alpha <1\).
Case 2:\(p=cn+o(n)\) for some \(0<c<\frac{1}{2}\).
Case 3: \(p=\frac{1}{2}no(n)\)
5 Exact distribution of the transfer index on caterpillar trees
In this section, we provide exact formulae for the transfer index distribution on caterpillar trees. We shall see in the discussion section that these formulae can be used to compute pvalues for the general case, under suitable assumptions (conjectures). Moreover, the combinatorial techniques used here could potentially help obtain similar results with other kinds of trees (e.g. fully balanced).
5.1 Correspondence between bicolored caterpillar trees and NE lattice paths
A NE lattice path is a path in \({\mathbb {Z}}^2\) where the only steps allowed are (0, 1) (a step towards the north) and (1, 0) (a step towards the east). From now on, we call them lattice paths for short. Let \({\mathcal {P}}(p,np)\) denote the \(p \times (np)\) NE lattice, that is the set of all lattice paths from the origin (0, 0) to the destination \((p, np)\). A path in \({\mathcal {P}}(p,np)\) can be encoded in a single vector of length n indicating the sequence of steps of the path, which is an element on \(\{N,E\}^n\) with a total number of east steps equal to p and a total number of north steps equal to \(np\). On the other hand, the set \({\mathcal {X}}_p\) of bicolorations with p black tips is a subset of \(\{{\mathbf{B }},{\mathbf{W }}\}^n\). We define the function \(F: {\mathcal {X}}_p \rightarrow {\mathcal {P}}(p,np)\) that associates a lattice path to a bicolored tree by scanning the tips on the tree from 1 to n as follows: whenever we read a white tip, we move towards the north; and whenever we read a black tip, we move towards the east. Consequently, a bicolored oriented caterpillar tree on n tips with p black tips, corresponds to a unique path in \({\mathcal {P}}(p,np)\) and vice versa, as represented in Fig. 3. This result can be summarized as follows.
Lemma 3
The function F is a bijection from \({\mathcal {X}}_p\) to \({\mathcal {P}} (p,np)\).
Let us denote the lower right corner in \({\mathcal {P}} (p,np)\) by \(Q=(p,0)\) and the upper left corner by \(Q'= (0,np)\). Observe that the two extreme paths going through Q and \(Q'\) correspond to the only bicolored oriented caterpillar trees T with transfer index \(\phi (\chi _p,T) = 0\): all black leaves cluster on one side and all white leaves on the other side (green path in Fig. 3). Moreover, we are able to retrieve the transfer index for any bicolored oriented caterpillar tree from the associated lattice path, as we demonstrate in the following proposition. Use M(A, B) to denote the Manhattan distance between any two lattice points \(A,B\in {\mathbb {Z}}^2\), and by \(M(\gamma ,B) = \min _{A\in \gamma } M(A,B)\) the Manhattan distance between any lattice path \(\gamma \) and a lattice point B.
Proposition 2
Proof
Consider an oriented caterpillar tree T, a bicoloration \(\chi _p\), and the corresponding lattice path \(\gamma \) from (0, 0) to \((p,np)\). Let us denote the \(n1\) consecutive internal lattice points in \(\gamma \) by \(P_1 = (x_1,y_1),\ldots , P_{n1}=(x_{n1},y_{n1})\). Also, use \(b_{i}\) to denote the internal branch in T between tips i and \(i+1\), for \(2\le i\le n2\). Lastly, let \(b_{1}\) and \(b_{n1}\) be the pendant branches of tips 1 and n respectively (Fig. 3, left).
Finally, all branches on the caterpillar tree that are not on the path from leaf 1 to leaf n, are pendant branches. The minimum over all the pendant branches is equal to \(p1\), obtained on any black leaf, so the transfer index is at most \(p1\), as stated in the proposition. Also notice that, in the case of a bicolored cherry, the choice of the labels (1 versus 2 and \(n1\) versus n) has no influence on the result since the distance from any of these pendant branches to the reference bipartition is at least \(p1\). Since we have covered the distance obtained on any branch of the tree, we achieve the desired result. \(\square \)
5.2 Counting bicolorations through lattice paths: the transfer index distribution
Lattice paths under certain restrictions appear in various problems in probability and statistics, such as the classical ballot problem (for instance, see Feller (1968)), which leads to counting lattice paths that do not touch the diagonal \(y=x\). Here, we are interested in a slightly different problem, in the sense that we count NE lattice paths that are not allowed to touch certain boundaries.
More precisely, for fixed n and p, consider \(2\le l\le p+1\) and let \({\mathscr {L}}(n,p,l)\) denote the subset of paths in \({\mathcal {P}} (p,np)\) that do not touch \(y = x  l\) or \(y = x + (n2p+l)\). Set \(L(n,p,l):={\mathscr {L}}(n,p,l)\). The following result is from Mohanty (1979). Here, we will give a sketch of the proof that is slightly different from the one given by Mohanty (1979), since it will be useful for understanding the upcoming results. We use \({\lfloor }{\cdot }{\rfloor }\) and \({\lceil }{\cdot }{\rceil }\) to denote respectively floor and ceiling functions, and \(\mathbb {1}(\cdot )\) for the indicator function.
Lemma 4
Proof
We can now establish the main theorem in this section.
Theorem 2
Proof
6 Discussion
The results we obtained in Sects. 3 and 4 allow us to characterize the asymptotic behavior of the transfer support when n tends towards \(\infty \), for various growth rates of p, up to \(p = {\lfloor }{n/2}{\rfloor }\), and for any given binary phylogenetic tree. These results demonstrate that the normalisation by \(p1\) proposed by Lemoine et al. (2018) is fully justified for large n and irrespective of the shape of the inferred tree: in the absence of phylogenetic signal, TBE (i.e. the average of TS over all bootstrap trees) is close to 0. The same holds for the standard Felsenstein’s phylogenetic boostrap, as with “random bootstrap trees” the chance of exactly recovering the reference branch is close to zero, even for small values of p. On the other hand, Lemoine et al. (2018) used real and simulated data to show that, in the presence of strong phylogenetic signal, Felsenstein’s supports can be close to zero, whereas TBE reveals the signal and provides high support to branches that are (nearly) correct. Moreover, the TBE values can be interpreted in terms of the proportion of stable versus unstable taxa, and the unstable taxa (e.g. recombinant sequences with virus data) can be identified in further analysis, and studied to explain their phylogenetic instability.
However, the bounds we obtained in Sect. 4 are not sufficiently tight to justify what we observe in simulations in Fig. 1. If we think about the applications, these bounds might not be sufficient to give good estimates for the pvalues of the TS and TBE distributions in the absence of phylogenetic signal. We propose two conjectures that would allow us to use the exact results obtained for the caterpillar tree as a proxy for the statistical significance of TS and TBE.
The first conjecture concerns the extreme case \(p={\lfloor }{n/2}{\rfloor }\). Based on simulation results (Fig. 1 and Lemoine et al. (2018)) and the proofs in Sect. 4, we believe that for any given phylogeny, the expected value of TS attains its maximum over p at \({\lfloor }{n/2}{\rfloor }\). The second conjecture (based on Fig. 1 and other experiments, not shown) is that with \(p={\lfloor }{n/2}{\rfloor }\), the expected value of TS is highest for caterpillar trees, among all possible tree topologies.
These two conjectures, combined with the fact that TBE is obtained by averaging TS, and thus has necessarily a small variance compared to its mean (see also Fig. 1), forms the basis of a simple test to reject that a branch support could be obtained by chance.
For instance, consider a tree of 128 tips, a bootstrap study with 100 replicates, and a reference branch with \(p = n/2 = 64\). Using the exact results of Sect. 5, we can compute the mean and variance of TS distribution in the caterpillar null model, and apply the Central Limit Theorem to approximate the distribution of TBE; the results show that when TBE is larger than 0.147, the pvalue of the null hypothesis/model is less than \(10^{3}\). With 256, 512, and 1024 tips, the corresponding \(0.1\%\) confidence levels (for \(p=n/2\)) are equal to 0.108, 0.078, and 0.057, respectively (see Appendix B for more details). When the number of replicates is large (e.g. 1000), the standard deviation of TBE is nearly null and the \(0.1\%\) confidence level is roughly equal to the expected value of TS, as shown in Table 1 in Appendix B. Any branch with TBE support less than or equal to these bounds could have been obtained by chance and might not reflect any phylogenetic signal. On the opposite, standard levels of branch support using TBE (say \(>70\%\), following Hillis and Bull (1993)) cannot be observed by chance, and reveal a strong phylogenetic signal in the data, even with small trees. Such high support does not tell us that the branch is “true”, but estimates the number of stable versus unstable taxa, given the reconstruction method used to infer the original and bootstrap trees. As explained by Felsenstein (1985), an inconsistent method (e.g. subject to long branch attraction) can produce erroneous trees with high branch supports (TBE as well as Felsenstein’s bootstrap).
For trees that are not caterpillars, deriving the distribution of the transfer index under random bicolorations appears to be challenging. It would be relevant for both theoretical and applicative reasons to characterize this distribution for a random model such as Yule or PDA, which are the most commonly used in phylogenetics.
Notes
Acknowledgements
This work was supported by the EUH2020 Virogenesis project (Grant No. 634650 – JT, OG), and by the INCEPTION project (PIA/ANR16CONV0005 – MDF, FL, OG).
References
 André D (1887) Solution directe du problème résolu par M. Bertrand. CR Acad Sci Paris 105:436–437zbMATHGoogle Scholar
 Boc A, Philippe H, Makarenkov V (2010) Inferring and validating horizontal gene transfer events using bipartition dissimilarity. Syst Biol 59(2):195–211. https://doi.org/10.1093/sysbio/syp103 CrossRefGoogle Scholar
 Bogdanowicz D, Giaro K (2012) Matching split distance for unrooted binary phylogenetic trees. IEEE/ACM Trans Comput Biol Bioinf 9(1):150–160. https://doi.org/10.1109/tcbb.2011.48 CrossRefGoogle Scholar
 Charon I, Denœud L, Guénoche A, Hudry O (2006) Maximum transfer distance between partitions. J Classif 23(1):103–121. https://doi.org/10.1007/s0035700600062 MathSciNetCrossRefzbMATHGoogle Scholar
 Day WHE (1981) The complexity of computing metric distances between partitions. Math Soc Sci 1(3):269–287. https://doi.org/10.1016/01654896(81)900421 MathSciNetCrossRefzbMATHGoogle Scholar
 Day WHE (1985) Optimal algorithms for comparing trees with labeled leaves. J Classif 2(1):7–28. https://doi.org/10.1007/BF01908061 MathSciNetCrossRefzbMATHGoogle Scholar
 Denœud L (2008) Transfer distance between partitions. Adv Data Anal Classif 2(3):279–294. https://doi.org/10.1007/s1163400800290 MathSciNetCrossRefzbMATHGoogle Scholar
 Dubhashi DP, Panconesi A (2009) Concentration of measure for the analysis of randomized algorithms. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
 Farris JS (1970) Methods for computing wagner trees. Syst Zool 19(1):83–92CrossRefGoogle Scholar
 Feller W (1968) An introduction to probability theory and its applications, vol 1. Wiley, New YorkzbMATHGoogle Scholar
 Felsenstein J (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791CrossRefGoogle Scholar
 Fitch WM (1971) Toward defining the course of evolution: minimum change for a specific tree topology. Syst Zool 20(4):406–416CrossRefGoogle Scholar
 Fréchet M (1940) Les probabilités associées à un système d’événements compatibles et dépendants. Actualités scientifiques et industrielles, Hermann & CieGoogle Scholar
 Hartigan JA (1973) Minimum mutation fits to a given tree. Biometrics 29(1):53–65CrossRefGoogle Scholar
 Hillis DM, Bull JJ (1993) An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Syst Biol 42(2):182–192. https://doi.org/10.1093/sysbio/42.2.182 CrossRefGoogle Scholar
 Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58(301):13–30MathSciNetCrossRefzbMATHGoogle Scholar
 Lemoine F, Domelevo Entfellner JB, Wilkinson E, Correia D, Dávila Felipe M, De Oliveira T, Gascuel O (2018) Renewing Felsenstein’s phylogenetic bootstrap in the era of big data. Nature 556(7702):452–456. https://doi.org/10.1038/s4158601800430 CrossRefGoogle Scholar
 Lin Y, Rajan V, Moret BME (2012) A metric for phylogenetic trees based on matching. IEEE/ACM Trans Comput Biol Bioinf 9(4):1014–1022. https://doi.org/10.1109/TCBB.2011.157 CrossRefGoogle Scholar
 Mohanty G (1979) Lattice path counting and applications. In: Probability and mathematical statistics. Academic Press, CambridgeGoogle Scholar
 Régnier S (1965) Sur quelques aspects mathématiques des problèmes de classification automatique. ICC Bull 4:175–191Google Scholar
 Robinson D, Foulds L (1981) Comparison of phylogenetic trees. Math Biosci 53(1):131–147. https://doi.org/10.1016/00255564(81)900432 MathSciNetCrossRefzbMATHGoogle Scholar
 Semple C, Steel M (2003) Phylogenetics. Oxford University Press, OxfordzbMATHGoogle Scholar
 Steel M (2016) Phylogeny: discrete and random processes in evolution. CBMSNSF Regional Conference Series on Mathematics, Society for Industrial and Applied MathematicsGoogle Scholar
 Steel M, Penny D (2005) Maximum parsimony and the phylogenetic information in multistate characters. In: Albert V (ed) Parsimony, phylogeny and genomics, chap 9. Oxford University Press, Oxford, pp 163–178Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.