## Abstract

The neighbor-joining algorithm for phylogenetic inference (NJ) has been seen to have three specific properties when applied to distance matrices that contain an admixed taxon: (1) antecedence of clustering, in which the admixed taxon agglomerates with one of its source taxa before the two source taxa agglomerate with each other; (2) intermediacy of distances, in which the distance on an inferred NJ tree between an admixed taxon and either of its source taxa is smaller than the distance between the two source taxa; and (3) intermediacy of path lengths, in which the number of edges separating the admixed taxon and either of its source taxa is less than or equal to the number of edges between the source taxa. We examine the behavior of neighbor-joining on distance matrices containing an admixed group, investigating the occurrence of antecedence of clustering, intermediacy of distances, and intermediacy of path lengths. We first mathematically predict the frequency with which the properties are satisfied for a labeled unrooted binary tree selected uniformly at random in the absence of admixture. We then introduce a taxon constructed by a linear admixture of distances from two source taxa, examining three admixture scenarios by simulation: a model in which distance matrices are chosen at random, a model in which an admixed taxon is added to a set of taxa that reflect treelike evolution, and a model that introduces a perturbation of the treelike scenario. In contrast to previous conjectures, we observe that the three properties are sometimes violated by distance matrices that include an admixed taxon. However, we also find that they are satisfied more often than is expected by chance when the distance matrix contains an admixed taxon, especially when evolution among the non-admixed taxa is treelike. The results contribute to a deeper understanding of the nature of evolutionary trees constructed from data that do not necessarily reflect a treelike evolutionary process.

This is a preview of subscription content, access via your institution.

## References

Aldous DJ (2001) Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today. Statist Sci 16:23–34

Atteson K (1999) The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica 25:251–278

Blum MGB, François O (2006) Which random processes describe the tree of life? A large-scale study of phylogenetic tree imbalance. Syst Biol 55:685–691

Boca SM, Rosenberg NA (2011) Mathematical properties of \({F}_{st}\) between admixed populations and their parental source populations. Theor Popul Biol 80:208–216

Bowcock AM, Kidd JR, Mountain JL, Hebert JM, Carotenuto L, Kidd KK, Cavalli-Sforza LL (1991) Drift, admixture, and selection in human evolution: a study with DNA polymorphisms. Proc Nat Acad Sci USA 88:839–843

Bryant D (2005) On the uniqueness of the selection criterion in neighbor-joining. J Classif 22:3–15

Buneman P (1974) A note on the metric properties of trees. J Combin Theory Ser B 17:48–50

Cavalli-Sforza LL, Menozzi P, Piazza A (1994) The history and geography of human genes. Princeton University Press, Princeton

Cueto MA, Matsen FA (2011) Polyhedral geometry of phylogenetic rogue taxa. Bull Math Biol 73:1202–1226

Dantzig GB, Eaves BC (1973) Fourier-Motzkin elimination and its dual. J Comb Theory Ser A 14:288–297

Davidson R, Rusinko J, Vernon Z, Xi J (2017) Modeling the distribution of distance data in Euclidean space. In: Harrington HA, Omar M, Wright M (eds) Algebraic and geometric methods in discrete mathematics. American Mathematical Society, Providence, pp 117–136

Eickmeyer K, Huggins P, Pachter L, Yoshida R (2008) On the optimality of the neighbor-joining algorithm. Algorithms Mol Biol 3:5

Eickmeyer K, Yoshida R (2008) The geometry of the neighbor-joining algorithm for small trees. In: Horimoto K, Regensburger G, Rosenkranz M, Yoshida H (eds) Algebraic Biology: AB 2008. Lecture Notes in Computer Science, vol 5147. Springer, Berlin, pp 81–95

Felsenstein J (1984) Distance methods for inferring phylogenies: a justification. Evolution 38:16–24

Felsenstein J (2004) Inferring phylogenies. Sinauer, Sunderland

Gascuel O, Steel M (2006) Neighbor-joining revealed. Mol Biol Evol 23:1997–2000

Kopelman NM, Stone L, Gascuel O, Rosenberg NA (2013) The behavior of admixed populations in neighbor-joining inference of population trees. Pacific Symp Biocomput 18:273–284

Mountain JL, Cavalli-Sforza LL (1994) Inference of human evolution through cladistic analysis of nuclear DNA restriction polymorphisms. Proc Nat Acad Sci USA 91:6515–6519

Nee S (2006) Birth-death models in macroevolution. Annu Rev Ecol Evol Syst 37:1–17

Robinson DF, Foulds LR (1981) Comparison of phylogenetic trees. Math Biosci 53:131–147

Ruiz-Linares A, Minch E, Meyer D, Cavalli-Sforza LL (1995) Analysis of classical and DNA markers for reconstructing human population history. In: Brenner S, Hanihara K (eds) The origin and past of modern humans as viewed from DNA. World Scientific, Singapore, pp 123–148

Saitou N, Nei M (1987) The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425

Sanderson MJ, Shaffer HB (2002) Troubleshooting molecular phylogenetic analyses. Annu Rev Ecol Syst 33:49–72

Schrijver A (1986) Theory of linear and integer programming. Wiley, Chichester

Steel M (2016) Phylogeny: discrete and random processes in evolution, Society for Industrial and Applied Mathematics, Philadelphia

Studier JA, Keppler KJ (1988) A note on the neighbor-joining algorithm of Saitou and Nei. Mol Biol Evol 5:729–731

Thomson RC, Shaffer HB (2010) Sparse supermatrices for phylogenetic inference: taxonomy, alignment, rogue taxa, and the phylogeny of living turtles. Syst Biol 59:42–58

Westover KM, Rusinko JP, Hoin J, Neal M (2013) Rogue taxa phenomenon: a biological companion to simulation analysis. Mol Phylogenet Evol 69:1–3

Yule GU (1925) A mathematical theory of evolution, based on the conclusions of Dr. J. C. Willis, F.R.S., Philosophical Transactions of the Royal Society of London. Series B, Containing Papers of a Biological Character 213:21–87

Ziegler GM (1995) Lectures on Polytopes. Springer-Verlag, New York, NY

## Acknowledgements

This work has been supported by National Institutes of Health grant R01 GM117590, by National Science Foundation grant BCS-1515127, and by a 2014 Rita Levi Montalcini grant from the Ministero dell’Istruzione, dell’Università e della Ricerca.

## Author information

### Authors and Affiliations

### Corresponding author

## Electronic supplementary material

Below is the link to the electronic supplementary material.

## Appendix

### Appendix

### A The Q-Criterion

For each agglomerative step of the NJ algorithm, the algorithm evaluates the *Q-criterion* (Bryant 2005; Gascuel and Steel 2006) and picks a pair of taxa *i* and *j* with the minimal *q* value, \(q_{ij} = (n-2)x_{ij} - \sum _{k=1}^n x_{ik} - \sum _{k=1}^n x_{kj}\). Here, \(i \ne j\) and *i*, *j* range from 1 to *n*. The Q-criterion gives a linear transformation of the original input distance matrix \(\mathbf {D}^{(n)}\).

Capitalizing on the linearity to make use of matrix algebra (Eickmeyer and Yoshida 2008), we write \(\mathbf {Q} = \mathbf {A}^{(n)} \mathbf {x}^{(n)}\). In the matrix \(\mathbf {A}^{(n)}\), *a* and *b* represent taxon pairs, ranging from 1 to \({n \atopwithdelims ()2}\), and *r*, \(\ell \), *s*, and *t* represent taxa.

For example, for \(n=4\), annotating rows and columns of \(\mathbf {A}^{(4)}\) by the order of the entries in \(\mathbf {x}^{(4)}\),

The process of cherry picking continues until only three nodes are left, at which point the three remaining taxon groups are joined to a shared node. NJ produces a particular tree topology given an input distance matrix \(\mathbf {D}^{(n)}\) if and only if the original distances \(x_{ij}\) satisfy an associated set of linear inequalities defined by the Q-criterion.

### B Correction to Section 4.1 of Kopelman et al. (2013)

In the case of an additive *n*-taxon matrix \(\mathbf {D}^{(n)}\) with taxon \(t_n\) admixed between source taxa \(t_1\) and \(t_2\), Section 4.1 of Kopelman et al. (2013) demonstrates that in the NJ tree of all *n* taxa, (1) taxa \(t_1\), \(t_2\), and \(t_n\) must be collinear, with \(t_n\) between \(t_1\) and \(t_2\). Moreover, they showed that (2) each taxon \(t_i\) for \(i=3,4,\ldots ,n-1\) must be collinear with \(t_1\), \(t_2\), and \(t_n\), with taxon \(t_i\) exterior to the path from \(t_1\) to \(t_n\) to \(t_2\). Finally, they claimed that (3) the NJ tree structure can be characterized by a line from \(t_1\) to \(t_n\) to \(t_2\), with \(t_1\) and \(t_2\) each placed at a multifurcating node to one of which each taxon \(t_3,t_4,\ldots ,t_{n-1}\) must be connected by a single edge (depicted in their Fig. 3b).

We comment here that although points (1) and (2) are correct, the claim (3) does not follow from (1) and (2) as assumed by Kopelman et al. (2013). It is possible to place the various taxa \(t_i\) for \(i=3,4,\ldots ,n-1\) in relation to the line segment \(t_1 - t_n - t_2\) in a manner in which each \(t_i\) is collinear with and exterior to the segment, but the nodes for \(t_1\) and \(t_2\) are not necessarily multifurcating, and the \(t_i\) are not necessarily connected to one of those nodes by only a single edge. It follows from (1) and (2) merely that exterior to the segment \(t_1 - t_n - t_2\) are two possibly but not necessarily multifurcating trees, one rooted at \(t_1\) and the other rooted at \(t_2\), and that each taxon \(t_i\) for \(i=3,4,\ldots ,n-1\) is placed in one of those trees (Fig. 12).

### C Proofs for the 5-Taxon Case

In this appendix, we provide the proofs of Propositions 1, 2, and 3 stated in Sect. 3.1 pertaining to distance matrices with admixture \(\mathbf {D}^{(5)}\) constructed from random source matrices \(\mathbf {S}^{(4)}\).

From the linear mixture model in Eq. 5, the distance matrix for \(n=5\) taxa, including one admixed taxon \(t_5\) and two source taxa \(t_1\) and \(t_2\), is:

with positive non-diagonal entries (\(x_{ij} > 0\) for all *i* and *j* with \(i \ne j\)).

Using the matrix representation in Eq. 8, the Q-criterion in the first step in NJ with 5 taxa can be written as \(\mathbf {Q} = \mathbf {A}^{(5)} \mathbf {x}^{(5)}\), where \(\mathbf {A}^{(5)} \in \mathbb {R}^{10 \times 10}\), \(\mathbf {x}^{(5)} \in \mathbb {R}^{10 \times 1}\). We have

denoting by \(\mathbf {M}\) the matrix in Eq. 11, with \(\mathbf {s}^{(4)}=[x_{21}, x_{31}, x_{32}, x_{41}, x_{42}, x_{43}]^\intercal \).

Only six independent variables appear in the source distance vector \(\mathbf {s}^{(4)}\), and we can rewrite the Q-criterion as

where

and \(\tilde{\mathbf {a}}_{\varvec{i}}^{(5)}\)\((i=1, \ldots , 10)\) is the *i*th row vector in \(\tilde{\mathbf {A}}^{(5)}\). The pair that is agglomerated together is the pair corresponding to the row with the minimal value in \(\mathbf {Q}^{(5)}\).

### Proof of Proposition 1

We must show that \((t_1,t_2)\), \((t_3,t_5)\), and \((t_4,t_5)\) cannot be the minimum for the Q-criterion in the first step of NJ. We begin with \((t_1, t_2)\). By contradiction, suppose that \((t_1, t_2)\) has the minimum Q-criterion. This means that \(q_{21} = \tilde{\mathbf {a}}_{1}^{(5)} \cdot \mathbf {s}^{(4)}\) is less than or equal to \(q_{\ell m} = \tilde{\mathbf {a}}_{\varvec{i}}^{(5)} \cdot \mathbf {s}^{(4)}\) for all \(i=2,3,\ldots ,10\). In other words,

This linear system of 9 inequalities in the 6 entries of \(\mathbf {s}^{(4)}\) can be represented as a matrix \(\mathbf {B}^{(5)} \mathbf {s}^{(4)} \le \mathbf {0}\), where \(\mathbf {0} = [0,\ldots , 0]^\intercal \in \mathbb {R}^{9 \times 1}\) and the *i*th row of the \(9 \times 6\) matrix \(\mathbf {B}^{(5)}\) is \(\tilde{\mathbf {a}}_{1} - \tilde{\mathbf {a}}_{\varvec{i+1}}\):

We use Fourier–Motzkin elimination (FME, Dantzig and Eaves 1973; Schrijver 1986; Ziegler 1995) to prove that this system of linear inequalities \(\mathbf {B}^{(5)} \mathbf {s}^{(4)} \le \mathbf {0}\) with the constraint \(\mathbf {s}^{(4)} > \mathbf {0}\) has no solution, thereby showing that \((t_1, t_2)\) cannot have the lowest Q-criterion and hence cannot be the first pair to agglomerate in the NJ algorithm.

In FME, we sequentially eliminate variables from the system of linear inequalities, at each step transforming the system into a new system with an equivalent solution space. Eventually, we eliminate all variables and reach a set of trivially satisfied inequalities only involving constants, so that the system has solutions. Alternatively, we reach an inequality with no solution, and the system therefore has no solution.

Briefly, consider a system of inequalities \(\mathbf {Bx} \le \mathbf {c}\), where \(\mathbf {B} \in \mathbb {R}^{m \times n}\), \(\mathbf {x} \in \mathbb {R}^{n \times 1}\), and \(\mathbf {c} \in \mathbb {R}^{m \times 1}\). Each row of \(\mathbf {B}\) represents an inequality, and the solution space for the system is \(\mathcal {P}=\{\mathbf {x} \mid \mathbf {Bx} \le \mathbf {c}\} \subset \mathbb {R}^n\). FME eliminates the variable \(x_k\) by performing row operations on the augmented matrix \((\mathbf {B} \mid \mathbf {c})\). In Step 1, we rearrange the inequalities in \(\mathbf {Bx} \le \mathbf {c}\) into three sets, based on the signs of the *k*th column: \(I_+ = \{i \in \{1,2,\ldots , m\} \mid b_{ik} > 0 \}\), \(I_- = \{i \in \{1,2,\ldots , m\} \mid b_{ik} < 0 \}\), and \(I_0 = \{i \in \{1,2,\ldots , m\} \mid b_{ik} = 0\}\).

In Step 2, for each \(i \in I_+ \cup I_-\), we scale each row \((b_{i1}, \ldots , b_{in} \mid c_i)\) by \(|b_{ik}|\) so that elements in column *k* of \((\mathbf {B} \mid \mathbf {c})\) are either \(1, -1\), or 0. In Step 3, we then define a new set of inequalities:

Because the coefficient of \(x_k\) is normalized to \(+1\) for row *r* and \(-1\) for row *s*, the combined new set of inequalities does not have \(x_k\). In other words, column *k* has zeroes as entries. We now have a total of \(|I_+||I_-| + |I_0|\) new inequalities in \(x_1, x_2, \ldots , x_{k-1}, x_{k+1}, \ldots , x_n\). The system of inequalities has solutions if and only if FME can successively eliminate all the variables without generating a contradiction. Otherwise, if the system is inconsistent at one point during the elimination, then it has no solutions.

We first seek to eliminate the third variable, \(x_{32}\), represented in the third column:

Note that \(0< \alpha < 1\) and \(0< 1-\alpha < 1\). The rearrangement by the sign of \(x_{32}\) gives:

Once the rows have been rearranged, we normalize each row with a nonzero entry for \(x_{32}\) by the absolute value of the entry for \(x_{32}\):

Considering linear combinations of rows in \(I_{+}\) and \(I_{-}\), we then have a new system of inequalities:

The highlighted 13th row implies that if \(\mathbf {B}^{(5)} \mathbf {s}^{(4)} \le \mathbf {0}\), then \(\mathbf {B}^{(5)}_{\text {step}3} \mathbf {s}^{(4)} \le \mathbf {0}\). We would then have \(4x_{21} \le 0\). However, this inequality contradicts our assumption that all off-diagonal entries of distance matrices are positive, by which we must have \(x_{21} > 0\).

Similarly, we can show that pairs \((t_3, t_5)\) and \((t_4, t_5)\) cannot have the minimum Q-value and thus do not form cherries in the first step of NJ. For \((t_3,t_5)\), we start with a system of inequalities \((\tilde{\mathbf {a}}_{9}^{(5)} - \tilde{\mathbf {a}}_{\varvec{i}}^{(5)}) \cdot \mathbf {s}^{(4)} \le 0\) (\(i \ne 9\)) and eliminate variable \(x_{41}\) to reach the contradiction \(x_{21} \le 0\). For \((t_4,t_5)\), we start with a system of inequalities \((\tilde{\mathbf {a}}_{10}^{(5)} - \tilde{\mathbf {a}}_{\varvec{i}}^{(5)}) \cdot \mathbf {s}^{(4)} \le 0\) (\(i \ne 10\)) and eliminate variable \(x_{32}\) to reach the contradiction \(x_{21} \le 0\).

\(\square \)

Note that the fact that \((t_1, t_2)\), \((t_3, t_5)\), and \((t_4, t_5)\) cannot form cherries in the first step does not imply that they cannot form cherries in the final NJ tree, as they can agglomerate in subsequent iterations of the NJ algorithm. In Proposition 2, however, we show that the final NJ tree of \(\mathbf {D}^{(5)}\) cannot have \((t_1, t_2)\) as a cherry at all.

### Proof of Proposition 2

In Sect. 3.1, our simulations found that the three unrooted labeled binary tree topologies containing the source taxa \(t_1\) and \(t_2\) as a cherry are unattainable by NJ inference when distance matrices of the form of Eq. 6 are used. Such topologies are \(((t_4,t_5),t_3,(t_1,t_2))\), \(((t_3,t_5),t_4,(t_1,t_2))\), and \(((t_3,t_4),t_5,(t_1,t_2))\).

Because \((t_1, t_2)\), \((t_3, t_5)\) and \((t_4, t_5)\) cannot form a cherry in the first step of cherry picking (Proposition 1), we have so far shown that topologies \(((t_4,t_5),t_3,(t_1,t_2))\) and \(((t_3,t_5),t_4,(t_1,t_2))\) cannot be the final NJ tree. To show that \((t_1,t_2)\) is not a cherry in the NJ tree and therefore to prove the proposition, it remains to show that \(((t_3,t_4),t_5,(t_1,t_2))\), the last remaining topology containing \((t_1, t_2)\) as a cherry, is not accessible.

Consider a case in which pair \((t_3,t_4)\) agglomerates in the first iteration and we are left with four unclustered nodes, \(\{t_1, t_2, t_5, t_6 \}\). Here, \(t_6\) represents a clade containing \((t_3,t_4)\). Distances between the new node \(t_6\) and the remaining taxa are:

These distances can be written in matrix form as \(\tilde{\mathbf {x}} = \mathbf {R}\mathbf {x}^{(5)}\), where

Using the matrix representation in Eq. 8, the Q-criterion for the four remaining nodes is

where

As before, \(\tilde{\mathbf {a}}_{\varvec{i}}^{(4)}\)\((i=1, \ldots , 6)\) is defined as the *i*th row vector in \(\tilde{\mathbf {A}}^{(4)}\). Because \(\tilde{\mathbf {a}}_{\varvec{i}}^{(4)} = \tilde{\mathbf {a}}_{\varvec{7-i}}^{(4)}\) (\(i=1,2,3\)), pairs \(\{(t_2,t_1),(t_6, t_5)\}\) have the same Q-criterion, as do the pairs \(\{(t_5,t_1),(t_6, t_2)\}\) and \(\{(t_5,t_2),(t_6, t_1)\}\). To prove that \(t_1\) and \(t_2\) cannot form a cherry, it suffices to show that the Q-criterion for \((t_1, t_2)\), \(q_{21} = \tilde{\mathbf {a}}_{\varvec{1}}^{(4)} \cdot \mathbf {s}^{(4)}\), is not minimum, so that either \(\tilde{\mathbf {a}}_{2}^{(4)} \cdot \mathbf {s}^{(4)} < \tilde{\mathbf {a}}_{\varvec{1}}^{(4)} \cdot \mathbf {s}^{(4)}\) or \(\tilde{\mathbf {a}}_{3}^{(4)} \cdot \mathbf {s}^{(4)} < \tilde{\mathbf {a}}_{\varvec{1}}^{(4)} \cdot \mathbf {s}^{(4)}\) must hold. Equivalently, one of the following pair of inequalities must hold:

By the assumption that \((t_3,t_4)\) is the cherry from the first iteration of the NJ algorithm, the Q-criterion for \((t_3, t_4)\), \(q_{43} = \tilde{\mathbf {a}}_{\varvec{6}}^{(5)} \cdot \mathbf {s}^{(4)}\), from the first iteration of the NJ algorithm must be minimum:

for all *i* from 1 to 10 (\(i \ne 6\)) and \(\tilde{\mathbf {a}}_{\varvec{i}}^{(5)}\) as defined in Eq. 13.

To show that Eq. 15 follows from Eq. 16, suppose to the contrary that \((t_1, t_2)\) has the minimal Q-criterion in the second iteration of the NJ algorithm, so that both of the following hold:

We then have a system of 11 linear inequalities that can be represented in matrix form, \(\mathbf {B}^{(4)} \mathbf {s}^{(4)} \le \mathbf {0}\), where

We apply FME to \(\mathbf {B}^{(4)}\), noting again that \(0< \alpha < 1\). After eliminating variable \(x_{31}\), we reach the inequality \(x_{21} \le 0\), contradicting the assumption that \(x_{21} > 0\). This proves that the pair \((t_1,t_2)\) cannot have the minimal Q-criterion in the second NJ clustering step when \((t_3,t_4)\) forms a cherry in the first NJ clustering step, completing the proof of Proposition 2. \(\square \)

### Proof of Proposition 3

We must show that if a source distance matrix generates topology \(((t_1,t_2),(t_3,t_4))\), then its corresponding admixed neighbor-joining tree satisfies Properties 1 and 3 and contains \((t_1,t_5)\) or \((t_2,t_5)\) as a cherry.

The Q-criterion for four taxa \(\{t_1, t_2, t_3, t_4\}\) is \( \mathbf {Q}^{(4)} = \mathbf {A}^{(4)} \mathbf {s}^{(4)}\),

Because \(\mathbf {s}^{(4)} \in D_{((t_1,t_2),(t_3,t_4))}^{(4)}\), the Q-criterion \(q_{21}=\mathbf {a}_{\mathbf {1}}^{(4)} \cdot \mathbf {s}^{(4)}\) for \((t_1,t_2)\), or equivalently for \((t_3,t_4)\), \(q_{43}=\mathbf {a}_{\mathbf {6}}^{(4)} \cdot \mathbf {s}^{(4)}\), is minimal. That is, for \(i=2,3\),

Consider an admixed distance matrix \(\mathbf {D}^{(5)}\) constructed from source distance matrix \(\mathbf {S}^{(4)}\). The Q-criterion for pairs in \(\mathbf {D}^{(5)}\) follows Eq. 12. Proposition 1 excludes three pairs from being the first cherry for any input admixed distance matrix, so we are left with 7 potential pairs for the first cherry: \((t_1,t_3)\), \((t_2,t_3)\), \((t_1,t_4)\), \((t_2,t_4)\), \((t_3,t_4)\), \((t_1,t_5)\), and \((t_2,t_5)\). We claim that the first four pairs, \((t_1,t_3)\), \((t_2,t_3)\), \((t_1,t_4)\), and \((t_2,t_4)\), cannot cluster in the first iteration of the NJ algorithm when \(\mathbf {s}^{(4)} \in D_{((t_1,t_2),(t_3,t_4))}^{(4)}\).

Suppose for contradiction that \((t_1,t_3)\) is the first pair to agglomerate in constructing \(\mathcal {T}_{D}^{(5)}\), so that the Q-criterion for \((t_1,t_3)\), \(q_{31} = \tilde{\mathbf {a}}_{2}^{(5)} \cdot \mathbf {s}^{(4)}\), is minimal when \(\mathbf {s}^{(4)} \in D_{((t_1,t_2),(t_3,t_4))}^{(4)}\). Using Eq. 12 and 13, for all *i* from 1 to 10 (\(i \ne 2\)), the following must hold:

Considering this inequality along with Eqs. 18 and 19, we have a linear system of 11 inequalities in the \(x_{ij}\), \(\mathbf {B}^{(5)} \mathbf {s}^{(4)} \le \mathbf {0}\), where

Application of the FME procedure to \(\mathbf {B}^{(5)}\) to eliminate \(x_{31}\) and \(x_{41}\) results in \(x_{21} \le 0\), a contradiction. Therefore, \((t_1, t_3)\) cannot be the first cherry of \(\mathcal {T}_{D}^{(5)}\).

It follows by symmetry that \((t_2, t_3)\), \((t_1, t_4)\) and \((t_2, t_4)\) cannot cluster in the first step of the NJ algorithm when an inferred source NJ tree of \(\mathbf {S}^{(4)}\) has topology \(((t_1,t_2),(t_3,t_4))\). In the 4-taxon source distance matrix \(\mathbf {S}^{(4)}\) with \(t_1\) and \(t_2\) being source populations from which the admixed taxon \(t_5\) is created, the roles of \(t_1\) and \(t_2\) are interchangeable, as are the roles of \(t_3\) and \(t_4\).

We have so far proven that \((t_3,t_4)\), \((t_1,t_5)\), and \((t_2,t_5)\) are the only possible clusters in the first step of the construction of \(\mathcal {T}_{D}^{(5)}\) when \(\mathbf {s}^{(4)} \in D_{((t_1,t_2),(t_3,t_4))}\). If a pair \((t_3,t_4)\) clusters first, then four nodes, \(\{t_1, t_2, t_5, t_{34}\}\), are left to join in the second iteration. Because Proposition 2 says \((t_1,t_2)\) cannot be a cherry in any \(\mathcal {T}_{D}^{(5)}\), the only possible NJ tree topologies are \(((t_1,t_5),(t_2,t_{34}))\) and \(((t_2,t_5),(t_1,t_{34}))\). The node \(t_{34}\) represents the cluster \((t_3, t_4)\), so the final topologies are \(((t_1,t_5),t_2,(t_3, t_4))\) and \(((t_2,t_5),t_1,(t_3, t_4))\).

If a pair \((t_1,t_5)\) clusters first, then four nodes, \(\{t_2, t_3, t_4, t_{15}\}\), are left to join in the second iteration. The possible NJ tree topologies are \(((t_{15},t_2),(t_3,t_4))\), \(((t_{15},t_3),(t_2, t_4))\) and \(((t_{15},t_4),(t_2,t_3))\). Because the node \(t_{15}\) represents the cluster \((t_1, t_5)\), the final topologies are \(((t_1,t_5),t_2,(t_3, t_4))\), \(((t_1,t_5),t_3,(t_2, t_4))\) and \(((t_1,t_5),t_4,(t_2, t_3))\). By the same argument, possible final topologies when a pair \((t_2,t_5)\) is the first cherry are \(((t_2,t_5),t_1,(t_3, t_4))\), \(((t_2,t_5),t_3,(t_1, t_4))\), and \(((t_2,t_5),t_4,(t_1, t_3))\).

All 6 topologies of \(\mathcal {T}_{D}^{(5)}\) from three possible choices for the first cluster satisfy Property 3, and the procedures for their construction comply with Property 1. Also, they contain a cherry involving one of the source taxa and the admixed taxon. \(\square \)

## Rights and permissions

## About this article

### Cite this article

Kim, J., Disanto, F., Kopelman, N.M. *et al.* Mathematical and Simulation-Based Analysis of the Behavior of Admixed Taxa in the Neighbor-Joining Algorithm.
*Bull Math Biol* **81**, 452–493 (2019). https://doi.org/10.1007/s11538-018-0444-0

Received:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s11538-018-0444-0

### Keywords

- Admixture
- Neighbor-joining
- Phylogenetics
- Rogue taxon