Skip to main content

Mathematical and Simulation-Based Analysis of the Behavior of Admixed Taxa in the Neighbor-Joining Algorithm

Abstract

The neighbor-joining algorithm for phylogenetic inference (NJ) has been seen to have three specific properties when applied to distance matrices that contain an admixed taxon: (1) antecedence of clustering, in which the admixed taxon agglomerates with one of its source taxa before the two source taxa agglomerate with each other; (2) intermediacy of distances, in which the distance on an inferred NJ tree between an admixed taxon and either of its source taxa is smaller than the distance between the two source taxa; and (3) intermediacy of path lengths, in which the number of edges separating the admixed taxon and either of its source taxa is less than or equal to the number of edges between the source taxa. We examine the behavior of neighbor-joining on distance matrices containing an admixed group, investigating the occurrence of antecedence of clustering, intermediacy of distances, and intermediacy of path lengths. We first mathematically predict the frequency with which the properties are satisfied for a labeled unrooted binary tree selected uniformly at random in the absence of admixture. We then introduce a taxon constructed by a linear admixture of distances from two source taxa, examining three admixture scenarios by simulation: a model in which distance matrices are chosen at random, a model in which an admixed taxon is added to a set of taxa that reflect treelike evolution, and a model that introduces a perturbation of the treelike scenario. In contrast to previous conjectures, we observe that the three properties are sometimes violated by distance matrices that include an admixed taxon. However, we also find that they are satisfied more often than is expected by chance when the distance matrix contains an admixed taxon, especially when evolution among the non-admixed taxa is treelike. The results contribute to a deeper understanding of the nature of evolutionary trees constructed from data that do not necessarily reflect a treelike evolutionary process.

This is a preview of subscription content, access via your institution.

Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

References

  • Aldous DJ (2001) Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today. Statist Sci 16:23–34

    MathSciNet  Article  MATH  Google Scholar 

  • Atteson K (1999) The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica 25:251–278

    MathSciNet  Article  MATH  Google Scholar 

  • Blum MGB, François O (2006) Which random processes describe the tree of life? A large-scale study of phylogenetic tree imbalance. Syst Biol 55:685–691

    Article  Google Scholar 

  • Boca SM, Rosenberg NA (2011) Mathematical properties of \({F}_{st}\) between admixed populations and their parental source populations. Theor Popul Biol 80:208–216

    Article  MATH  Google Scholar 

  • Bowcock AM, Kidd JR, Mountain JL, Hebert JM, Carotenuto L, Kidd KK, Cavalli-Sforza LL (1991) Drift, admixture, and selection in human evolution: a study with DNA polymorphisms. Proc Nat Acad Sci USA 88:839–843

    Article  Google Scholar 

  • Bryant D (2005) On the uniqueness of the selection criterion in neighbor-joining. J Classif 22:3–15

    MathSciNet  Article  MATH  Google Scholar 

  • Buneman P (1974) A note on the metric properties of trees. J Combin Theory Ser B 17:48–50

    MathSciNet  Article  MATH  Google Scholar 

  • Cavalli-Sforza LL, Menozzi P, Piazza A (1994) The history and geography of human genes. Princeton University Press, Princeton

    MATH  Google Scholar 

  • Cueto MA, Matsen FA (2011) Polyhedral geometry of phylogenetic rogue taxa. Bull Math Biol 73:1202–1226

    MathSciNet  Article  MATH  Google Scholar 

  • Dantzig GB, Eaves BC (1973) Fourier-Motzkin elimination and its dual. J Comb Theory Ser A 14:288–297

    MathSciNet  Article  MATH  Google Scholar 

  • Davidson R, Rusinko J, Vernon Z, Xi J (2017) Modeling the distribution of distance data in Euclidean space. In: Harrington HA, Omar M, Wright M (eds) Algebraic and geometric methods in discrete mathematics. American Mathematical Society, Providence, pp 117–136

    Chapter  Google Scholar 

  • Eickmeyer K, Huggins P, Pachter L, Yoshida R (2008) On the optimality of the neighbor-joining algorithm. Algorithms Mol Biol 3:5

    Article  Google Scholar 

  • Eickmeyer K, Yoshida R (2008) The geometry of the neighbor-joining algorithm for small trees. In: Horimoto K, Regensburger G, Rosenkranz M, Yoshida H (eds) Algebraic Biology: AB 2008. Lecture Notes in Computer Science, vol 5147. Springer, Berlin, pp 81–95

  • Felsenstein J (1984) Distance methods for inferring phylogenies: a justification. Evolution 38:16–24

    Article  Google Scholar 

  • Felsenstein J (2004) Inferring phylogenies. Sinauer, Sunderland

    Google Scholar 

  • Gascuel O, Steel M (2006) Neighbor-joining revealed. Mol Biol Evol 23:1997–2000

    Article  Google Scholar 

  • Kopelman NM, Stone L, Gascuel O, Rosenberg NA (2013) The behavior of admixed populations in neighbor-joining inference of population trees. Pacific Symp Biocomput 18:273–284

    Google Scholar 

  • Mountain JL, Cavalli-Sforza LL (1994) Inference of human evolution through cladistic analysis of nuclear DNA restriction polymorphisms. Proc Nat Acad Sci USA 91:6515–6519

    Article  Google Scholar 

  • Nee S (2006) Birth-death models in macroevolution. Annu Rev Ecol Evol Syst 37:1–17

    Article  Google Scholar 

  • Robinson DF, Foulds LR (1981) Comparison of phylogenetic trees. Math Biosci 53:131–147

    MathSciNet  Article  MATH  Google Scholar 

  • Ruiz-Linares A, Minch E, Meyer D, Cavalli-Sforza LL (1995) Analysis of classical and DNA markers for reconstructing human population history. In: Brenner S, Hanihara K (eds) The origin and past of modern humans as viewed from DNA. World Scientific, Singapore, pp 123–148

    Google Scholar 

  • Saitou N, Nei M (1987) The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425

    Google Scholar 

  • Sanderson MJ, Shaffer HB (2002) Troubleshooting molecular phylogenetic analyses. Annu Rev Ecol Syst 33:49–72

    Article  Google Scholar 

  • Schrijver A (1986) Theory of linear and integer programming. Wiley, Chichester

    MATH  Google Scholar 

  • Steel M (2016) Phylogeny: discrete and random processes in evolution, Society for Industrial and Applied Mathematics, Philadelphia

  • Studier JA, Keppler KJ (1988) A note on the neighbor-joining algorithm of Saitou and Nei. Mol Biol Evol 5:729–731

    Google Scholar 

  • Thomson RC, Shaffer HB (2010) Sparse supermatrices for phylogenetic inference: taxonomy, alignment, rogue taxa, and the phylogeny of living turtles. Syst Biol 59:42–58

    Article  Google Scholar 

  • Westover KM, Rusinko JP, Hoin J, Neal M (2013) Rogue taxa phenomenon: a biological companion to simulation analysis. Mol Phylogenet Evol 69:1–3

    Article  Google Scholar 

  • Yule GU (1925) A mathematical theory of evolution, based on the conclusions of Dr. J. C. Willis, F.R.S., Philosophical Transactions of the Royal Society of London. Series B, Containing Papers of a Biological Character 213:21–87

    Google Scholar 

  • Ziegler GM (1995) Lectures on Polytopes. Springer-Verlag, New York, NY

Download references

Acknowledgements

This work has been supported by National Institutes of Health grant R01 GM117590, by National Science Foundation grant BCS-1515127, and by a 2014 Rita Levi Montalcini grant from the Ministero dell’Istruzione, dell’Università e della Ricerca.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jaehee Kim.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2131 KB)

Appendix

Appendix

A The Q-Criterion

For each agglomerative step of the NJ algorithm, the algorithm evaluates the Q-criterion (Bryant 2005; Gascuel and Steel 2006) and picks a pair of taxa i and j with the minimal q value, \(q_{ij} = (n-2)x_{ij} - \sum _{k=1}^n x_{ik} - \sum _{k=1}^n x_{kj}\). Here, \(i \ne j\) and ij range from 1 to n. The Q-criterion gives a linear transformation of the original input distance matrix \(\mathbf {D}^{(n)}\).

Capitalizing on the linearity to make use of matrix algebra (Eickmeyer and Yoshida 2008), we write \(\mathbf {Q} = \mathbf {A}^{(n)} \mathbf {x}^{(n)}\). In the matrix \(\mathbf {A}^{(n)}\), a and b represent taxon pairs, ranging from 1 to \({n \atopwithdelims ()2}\), and r, \(\ell \), s, and t represent taxa.

$$\begin{aligned} A_{ab}^{(n)} = A_{r \ell , st}^{(n)}= {\left\{ \begin{array}{ll} n-4 &{} \text {if } a=b \\ -\,1 &{} \text {if } a \ne b \text { and } \{r, \ell \} \cap \{s,t\} \ne \emptyset \\ 0 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(8)

For example, for \(n=4\), annotating rows and columns of \(\mathbf {A}^{(4)}\) by the order of the entries in \(\mathbf {x}^{(4)}\),

(9)

The process of cherry picking continues until only three nodes are left, at which point the three remaining taxon groups are joined to a shared node. NJ produces a particular tree topology given an input distance matrix \(\mathbf {D}^{(n)}\) if and only if the original distances \(x_{ij}\) satisfy an associated set of linear inequalities defined by the Q-criterion.

B Correction to Section 4.1 of Kopelman et al. (2013)

In the case of an additive n-taxon matrix \(\mathbf {D}^{(n)}\) with taxon \(t_n\) admixed between source taxa \(t_1\) and \(t_2\), Section 4.1 of Kopelman et al. (2013) demonstrates that in the NJ tree of all n taxa, (1) taxa \(t_1\), \(t_2\), and \(t_n\) must be collinear, with \(t_n\) between \(t_1\) and \(t_2\). Moreover, they showed that (2) each taxon \(t_i\) for \(i=3,4,\ldots ,n-1\) must be collinear with \(t_1\), \(t_2\), and \(t_n\), with taxon \(t_i\) exterior to the path from \(t_1\) to \(t_n\) to \(t_2\). Finally, they claimed that (3) the NJ tree structure can be characterized by a line from \(t_1\) to \(t_n\) to \(t_2\), with \(t_1\) and \(t_2\) each placed at a multifurcating node to one of which each taxon \(t_3,t_4,\ldots ,t_{n-1}\) must be connected by a single edge (depicted in their Fig. 3b).

We comment here that although points (1) and (2) are correct, the claim (3) does not follow from (1) and (2) as assumed by Kopelman et al. (2013). It is possible to place the various taxa \(t_i\) for \(i=3,4,\ldots ,n-1\) in relation to the line segment \(t_1 - t_n - t_2\) in a manner in which each \(t_i\) is collinear with and exterior to the segment, but the nodes for \(t_1\) and \(t_2\) are not necessarily multifurcating, and the \(t_i\) are not necessarily connected to one of those nodes by only a single edge. It follows from (1) and (2) merely that exterior to the segment \(t_1 - t_n - t_2\) are two possibly but not necessarily multifurcating trees, one rooted at \(t_1\) and the other rooted at \(t_2\), and that each taxon \(t_i\) for \(i=3,4,\ldots ,n-1\) is placed in one of those trees (Fig. 12).

Fig. 13
figure 12

The structure required for a tree corresponding to an additive n-taxon admixed distance matrix whose distances satisfy Eq. 5. Taxa \(t_1\) and \(t_2\) are source taxa, and \(t_n\) is the admixed taxon. Triangles represent subtrees, one of which contains taxon \(t_3\) either at an internal node or as a leaf node. Admixed taxon \(t_n\) must lie on the path connecting the source taxa, with no other node placed on edges \(t_1 - t_n\) and \(t_2 - t_n\). The remaining taxa \(t_3, t_4, \ldots , t_{n-1}\) can be in any subtrees connected to either \(t_1\) or \(t_2\) by edges external to the path \(t_1 - t_2\)

C Proofs for the 5-Taxon Case

In this appendix, we provide the proofs of Propositions 1, 2, and 3 stated in Sect. 3.1 pertaining to distance matrices with admixture \(\mathbf {D}^{(5)}\) constructed from random source matrices \(\mathbf {S}^{(4)}\).

From the linear mixture model in Eq. 5, the distance matrix for \(n=5\) taxa, including one admixed taxon \(t_5\) and two source taxa \(t_1\) and \(t_2\), is:

$$\begin{aligned} \mathbf {D}^{(5)} = \begin{bmatrix} 0&x_{21}&x_{31}&x_{41}&(1-\alpha )x_{21} \\ x_{21}&0&x_{32}&x_{42}&\alpha x_{21} \\ x_{31}&x_{32}&0&x_{43}&\alpha x_{31} + (1-\alpha ) x_{32} \\ x_{41}&x_{42}&x_{43}&0&\alpha x_{41} + (1-\alpha ) x_{42} \\ (1-\alpha )x_{21}&\alpha x_{21}&\alpha x_{31} + (1-\alpha ) x_{32}&\alpha x_{41} + (1-\alpha ) x_{42}&0 \end{bmatrix}, \end{aligned}$$

with positive non-diagonal entries (\(x_{ij} > 0\) for all i and j with \(i \ne j\)).

Using the matrix representation in Eq. 8, the Q-criterion in the first step in NJ with 5 taxa can be written as \(\mathbf {Q} = \mathbf {A}^{(5)} \mathbf {x}^{(5)}\), where \(\mathbf {A}^{(5)} \in \mathbb {R}^{10 \times 10}\), \(\mathbf {x}^{(5)} \in \mathbb {R}^{10 \times 1}\). We have

(10)
(11)

denoting by \(\mathbf {M}\) the matrix in Eq. 11, with \(\mathbf {s}^{(4)}=[x_{21}, x_{31}, x_{32}, x_{41}, x_{42}, x_{43}]^\intercal \).

Only six independent variables appear in the source distance vector \(\mathbf {s}^{(4)}\), and we can rewrite the Q-criterion as

$$\begin{aligned} \mathbf {Q}^{(5)}&= [q_{21},q_{31},q_{32},q_{41},q_{42},q_{43},q_{51},q_{52},q_{53},q_{54}]^\intercal \nonumber \\&= \mathbf {A}^{(5)} \mathbf {x}^{(5)} = \mathbf {A}^{(5)} \mathbf {M}\mathbf {s}^{(4)} = \tilde{\mathbf {A}}^{(5)} \mathbf {s}^{(4)}, \end{aligned}$$
(12)

where

(13)

and \(\tilde{\mathbf {a}}_{\varvec{i}}^{(5)}\)\((i=1, \ldots , 10)\) is the ith row vector in \(\tilde{\mathbf {A}}^{(5)}\). The pair that is agglomerated together is the pair corresponding to the row with the minimal value in \(\mathbf {Q}^{(5)}\).

Proof of Proposition 1

We must show that \((t_1,t_2)\), \((t_3,t_5)\), and \((t_4,t_5)\) cannot be the minimum for the Q-criterion in the first step of NJ. We begin with \((t_1, t_2)\). By contradiction, suppose that \((t_1, t_2)\) has the minimum Q-criterion. This means that \(q_{21} = \tilde{\mathbf {a}}_{1}^{(5)} \cdot \mathbf {s}^{(4)}\) is less than or equal to \(q_{\ell m} = \tilde{\mathbf {a}}_{\varvec{i}}^{(5)} \cdot \mathbf {s}^{(4)}\) for all \(i=2,3,\ldots ,10\). In other words,

$$\begin{aligned} (\tilde{\mathbf {a}}_{1}^{(5)} - \tilde{\mathbf {a}}_{\varvec{i}}^{(5)}) \cdot \mathbf {s}^{(4)} \le 0. \end{aligned}$$
(14)

This linear system of 9 inequalities in the 6 entries of \(\mathbf {s}^{(4)}\) can be represented as a matrix \(\mathbf {B}^{(5)} \mathbf {s}^{(4)} \le \mathbf {0}\), where \(\mathbf {0} = [0,\ldots , 0]^\intercal \in \mathbb {R}^{9 \times 1}\) and the ith row of the \(9 \times 6\) matrix \(\mathbf {B}^{(5)}\) is \(\tilde{\mathbf {a}}_{1} - \tilde{\mathbf {a}}_{\varvec{i+1}}\):

We use Fourier–Motzkin elimination (FME, Dantzig and Eaves 1973; Schrijver 1986; Ziegler 1995) to prove that this system of linear inequalities \(\mathbf {B}^{(5)} \mathbf {s}^{(4)} \le \mathbf {0}\) with the constraint \(\mathbf {s}^{(4)} > \mathbf {0}\) has no solution, thereby showing that \((t_1, t_2)\) cannot have the lowest Q-criterion and hence cannot be the first pair to agglomerate in the NJ algorithm.

In FME, we sequentially eliminate variables from the system of linear inequalities, at each step transforming the system into a new system with an equivalent solution space. Eventually, we eliminate all variables and reach a set of trivially satisfied inequalities only involving constants, so that the system has solutions. Alternatively, we reach an inequality with no solution, and the system therefore has no solution.

Briefly, consider a system of inequalities \(\mathbf {Bx} \le \mathbf {c}\), where \(\mathbf {B} \in \mathbb {R}^{m \times n}\), \(\mathbf {x} \in \mathbb {R}^{n \times 1}\), and \(\mathbf {c} \in \mathbb {R}^{m \times 1}\). Each row of \(\mathbf {B}\) represents an inequality, and the solution space for the system is \(\mathcal {P}=\{\mathbf {x} \mid \mathbf {Bx} \le \mathbf {c}\} \subset \mathbb {R}^n\). FME eliminates the variable \(x_k\) by performing row operations on the augmented matrix \((\mathbf {B} \mid \mathbf {c})\). In Step 1, we rearrange the inequalities in \(\mathbf {Bx} \le \mathbf {c}\) into three sets, based on the signs of the kth column: \(I_+ = \{i \in \{1,2,\ldots , m\} \mid b_{ik} > 0 \}\), \(I_- = \{i \in \{1,2,\ldots , m\} \mid b_{ik} < 0 \}\), and \(I_0 = \{i \in \{1,2,\ldots , m\} \mid b_{ik} = 0\}\).

In Step 2, for each \(i \in I_+ \cup I_-\), we scale each row \((b_{i1}, \ldots , b_{in} \mid c_i)\) by \(|b_{ik}|\) so that elements in column k of \((\mathbf {B} \mid \mathbf {c})\) are either \(1, -1\), or 0. In Step 3, we then define a new set of inequalities:

$$\begin{aligned} \frac{1}{b_{rk}}\sum _{i=1}^{n}b_{ri}x_i + \frac{1}{|b_{sk}|}\sum _{i=1}^{n}b_{si}x_i&\le \frac{1}{b_{rk}}c_r + \frac{1}{|b_{sk}|}c_s \quad \quad&(r,s)&\in I_+ \times I_- \\ \sum _{i=1}^{n}b_{\ell i}x_i&\le c_\ell \quad \quad&\ell&\in I_0 . \end{aligned}$$

Because the coefficient of \(x_k\) is normalized to \(+1\) for row r and \(-1\) for row s, the combined new set of inequalities does not have \(x_k\). In other words, column k has zeroes as entries. We now have a total of \(|I_+||I_-| + |I_0|\) new inequalities in \(x_1, x_2, \ldots , x_{k-1}, x_{k+1}, \ldots , x_n\). The system of inequalities has solutions if and only if FME can successively eliminate all the variables without generating a contradiction. Otherwise, if the system is inconsistent at one point during the elimination, then it has no solutions.

We first seek to eliminate the third variable, \(x_{32}\), represented in the third column:

Note that \(0< \alpha < 1\) and \(0< 1-\alpha < 1\). The rearrangement by the sign of \(x_{32}\) gives:

Once the rows have been rearranged, we normalize each row with a nonzero entry for \(x_{32}\) by the absolute value of the entry for \(x_{32}\):

Considering linear combinations of rows in \(I_{+}\) and \(I_{-}\), we then have a new system of inequalities:

The highlighted 13th row implies that if \(\mathbf {B}^{(5)} \mathbf {s}^{(4)} \le \mathbf {0}\), then \(\mathbf {B}^{(5)}_{\text {step}3} \mathbf {s}^{(4)} \le \mathbf {0}\). We would then have \(4x_{21} \le 0\). However, this inequality contradicts our assumption that all off-diagonal entries of distance matrices are positive, by which we must have \(x_{21} > 0\).

Similarly, we can show that pairs \((t_3, t_5)\) and \((t_4, t_5)\) cannot have the minimum Q-value and thus do not form cherries in the first step of NJ. For \((t_3,t_5)\), we start with a system of inequalities \((\tilde{\mathbf {a}}_{9}^{(5)} - \tilde{\mathbf {a}}_{\varvec{i}}^{(5)}) \cdot \mathbf {s}^{(4)} \le 0\) (\(i \ne 9\)) and eliminate variable \(x_{41}\) to reach the contradiction \(x_{21} \le 0\). For \((t_4,t_5)\), we start with a system of inequalities \((\tilde{\mathbf {a}}_{10}^{(5)} - \tilde{\mathbf {a}}_{\varvec{i}}^{(5)}) \cdot \mathbf {s}^{(4)} \le 0\) (\(i \ne 10\)) and eliminate variable \(x_{32}\) to reach the contradiction \(x_{21} \le 0\).

\(\square \)

Note that the fact that \((t_1, t_2)\), \((t_3, t_5)\), and \((t_4, t_5)\) cannot form cherries in the first step does not imply that they cannot form cherries in the final NJ tree, as they can agglomerate in subsequent iterations of the NJ algorithm. In Proposition 2, however, we show that the final NJ tree of \(\mathbf {D}^{(5)}\) cannot have \((t_1, t_2)\) as a cherry at all.

Proof of Proposition 2

In Sect. 3.1, our simulations found that the three unrooted labeled binary tree topologies containing the source taxa \(t_1\) and \(t_2\) as a cherry are unattainable by NJ inference when distance matrices of the form of Eq. 6 are used. Such topologies are \(((t_4,t_5),t_3,(t_1,t_2))\), \(((t_3,t_5),t_4,(t_1,t_2))\), and \(((t_3,t_4),t_5,(t_1,t_2))\).

Because \((t_1, t_2)\), \((t_3, t_5)\) and \((t_4, t_5)\) cannot form a cherry in the first step of cherry picking (Proposition 1), we have so far shown that topologies \(((t_4,t_5),t_3,(t_1,t_2))\) and \(((t_3,t_5),t_4,(t_1,t_2))\) cannot be the final NJ tree. To show that \((t_1,t_2)\) is not a cherry in the NJ tree and therefore to prove the proposition, it remains to show that \(((t_3,t_4),t_5,(t_1,t_2))\), the last remaining topology containing \((t_1, t_2)\) as a cherry, is not accessible.

Consider a case in which pair \((t_3,t_4)\) agglomerates in the first iteration and we are left with four unclustered nodes, \(\{t_1, t_2, t_5, t_6 \}\). Here, \(t_6\) represents a clade containing \((t_3,t_4)\). Distances between the new node \(t_6\) and the remaining taxa are:

$$\begin{aligned} x_{6i} = \frac{1}{2}(x_{3i} + x_{4i} - x_{43}), \, i\in \{1,2,5\}. \end{aligned}$$

These distances can be written in matrix form as \(\tilde{\mathbf {x}} = \mathbf {R}\mathbf {x}^{(5)}\), where

Using the matrix representation in Eq. 8, the Q-criterion for the four remaining nodes is

$$\begin{aligned} \mathbf {Q}^{(4)}&= [q_{21},q_{51},q_{52},q_{61},q_{62},q_{65}]^\intercal \\&= \mathbf {A}^{(4)} \tilde{\mathbf {x}} = \mathbf {A}^{(4)} \mathbf {R}\mathbf {x}^{(5)}= \mathbf {A}^{(4)} \mathbf {R}\mathbf {M}\mathbf {s}^{(4)} = \tilde{\mathbf {A}}^{(4)} \mathbf {s}^{(4)}, \end{aligned}$$

where

As before, \(\tilde{\mathbf {a}}_{\varvec{i}}^{(4)}\)\((i=1, \ldots , 6)\) is defined as the ith row vector in \(\tilde{\mathbf {A}}^{(4)}\). Because \(\tilde{\mathbf {a}}_{\varvec{i}}^{(4)} = \tilde{\mathbf {a}}_{\varvec{7-i}}^{(4)}\) (\(i=1,2,3\)), pairs \(\{(t_2,t_1),(t_6, t_5)\}\) have the same Q-criterion, as do the pairs \(\{(t_5,t_1),(t_6, t_2)\}\) and \(\{(t_5,t_2),(t_6, t_1)\}\). To prove that \(t_1\) and \(t_2\) cannot form a cherry, it suffices to show that the Q-criterion for \((t_1, t_2)\), \(q_{21} = \tilde{\mathbf {a}}_{\varvec{1}}^{(4)} \cdot \mathbf {s}^{(4)}\), is not minimum, so that either \(\tilde{\mathbf {a}}_{2}^{(4)} \cdot \mathbf {s}^{(4)} < \tilde{\mathbf {a}}_{\varvec{1}}^{(4)} \cdot \mathbf {s}^{(4)}\) or \(\tilde{\mathbf {a}}_{3}^{(4)} \cdot \mathbf {s}^{(4)} < \tilde{\mathbf {a}}_{\varvec{1}}^{(4)} \cdot \mathbf {s}^{(4)}\) must hold. Equivalently, one of the following pair of inequalities must hold:

$$\begin{aligned} (\tilde{\mathbf {a}}_{2}^{(4)} - \tilde{\mathbf {a}}_{\varvec{1}}^{(4)}) \cdot \mathbf {s}^{(4)}&< 0 \nonumber \\ (\tilde{\mathbf {a}}_{3}^{(4)} - \tilde{\mathbf {a}}_{\varvec{1}}^{(4)}) \cdot \mathbf {s}^{(4)}&< 0 . \end{aligned}$$
(15)

By the assumption that \((t_3,t_4)\) is the cherry from the first iteration of the NJ algorithm, the Q-criterion for \((t_3, t_4)\), \(q_{43} = \tilde{\mathbf {a}}_{\varvec{6}}^{(5)} \cdot \mathbf {s}^{(4)}\), from the first iteration of the NJ algorithm must be minimum:

$$\begin{aligned} (\tilde{\mathbf {a}}_{6}^{(5)} - \tilde{\mathbf {a}}_{\varvec{i}}^{(5)}) \cdot \mathbf {s}^{(4)} \le 0, \end{aligned}$$
(16)

for all i from 1 to 10 (\(i \ne 6\)) and \(\tilde{\mathbf {a}}_{\varvec{i}}^{(5)}\) as defined in Eq. 13.

To show that Eq. 15 follows from Eq. 16, suppose to the contrary that \((t_1, t_2)\) has the minimal Q-criterion in the second iteration of the NJ algorithm, so that both of the following hold:

$$\begin{aligned} (\tilde{\mathbf {a}}_{1}^{(4)} - \tilde{\mathbf {a}}_{\varvec{2}}^{(4)}) \cdot \mathbf {s}^{(4)}&\le 0 \nonumber \\ (\tilde{\mathbf {a}}_{1}^{(4)} - \tilde{\mathbf {a}}_{\varvec{3}}^{(4)}) \cdot \mathbf {s}^{(4)}&\le 0. \end{aligned}$$
(17)

We then have a system of 11 linear inequalities that can be represented in matrix form, \(\mathbf {B}^{(4)} \mathbf {s}^{(4)} \le \mathbf {0}\), where

We apply FME to \(\mathbf {B}^{(4)}\), noting again that \(0< \alpha < 1\). After eliminating variable \(x_{31}\), we reach the inequality \(x_{21} \le 0\), contradicting the assumption that \(x_{21} > 0\). This proves that the pair \((t_1,t_2)\) cannot have the minimal Q-criterion in the second NJ clustering step when \((t_3,t_4)\) forms a cherry in the first NJ clustering step, completing the proof of Proposition 2. \(\square \)

Proof of Proposition 3

We must show that if a source distance matrix generates topology \(((t_1,t_2),(t_3,t_4))\), then its corresponding admixed neighbor-joining tree satisfies Properties 1 and 3 and contains \((t_1,t_5)\) or \((t_2,t_5)\) as a cherry.

The Q-criterion for four taxa \(\{t_1, t_2, t_3, t_4\}\) is \( \mathbf {Q}^{(4)} = \mathbf {A}^{(4)} \mathbf {s}^{(4)}\),

Because \(\mathbf {s}^{(4)} \in D_{((t_1,t_2),(t_3,t_4))}^{(4)}\), the Q-criterion \(q_{21}=\mathbf {a}_{\mathbf {1}}^{(4)} \cdot \mathbf {s}^{(4)}\) for \((t_1,t_2)\), or equivalently for \((t_3,t_4)\), \(q_{43}=\mathbf {a}_{\mathbf {6}}^{(4)} \cdot \mathbf {s}^{(4)}\), is minimal. That is, for \(i=2,3\),

$$\begin{aligned} (\mathbf {a}_{\varvec{1}}^{(4)} - \mathbf {a}_{\varvec{2}}^{(4)}) \cdot \mathbf {s}^{(4)}&\le 0 \end{aligned}$$
(18)
$$\begin{aligned} (\mathbf {a}_{\varvec{1}}^{(4)} - \mathbf {a}_{\varvec{3}}^{(4)}) \cdot \mathbf {s}^{(4)}&\le 0. \end{aligned}$$
(19)

Consider an admixed distance matrix \(\mathbf {D}^{(5)}\) constructed from source distance matrix \(\mathbf {S}^{(4)}\). The Q-criterion for pairs in \(\mathbf {D}^{(5)}\) follows Eq. 12. Proposition 1 excludes three pairs from being the first cherry for any input admixed distance matrix, so we are left with 7 potential pairs for the first cherry: \((t_1,t_3)\), \((t_2,t_3)\), \((t_1,t_4)\), \((t_2,t_4)\), \((t_3,t_4)\), \((t_1,t_5)\), and \((t_2,t_5)\). We claim that the first four pairs, \((t_1,t_3)\), \((t_2,t_3)\), \((t_1,t_4)\), and \((t_2,t_4)\), cannot cluster in the first iteration of the NJ algorithm when \(\mathbf {s}^{(4)} \in D_{((t_1,t_2),(t_3,t_4))}^{(4)}\).

Suppose for contradiction that \((t_1,t_3)\) is the first pair to agglomerate in constructing \(\mathcal {T}_{D}^{(5)}\), so that the Q-criterion for \((t_1,t_3)\), \(q_{31} = \tilde{\mathbf {a}}_{2}^{(5)} \cdot \mathbf {s}^{(4)}\), is minimal when \(\mathbf {s}^{(4)} \in D_{((t_1,t_2),(t_3,t_4))}^{(4)}\). Using Eq. 12 and 13, for all i from 1 to 10 (\(i \ne 2\)), the following must hold:

$$\begin{aligned} (\tilde{\mathbf {a}}_{2}^{(5)} - \tilde{\mathbf {a}}_{\varvec{i}}^{(5)}) \cdot \mathbf {s}^{(4)} \le 0. \end{aligned}$$

Considering this inequality along with Eqs. 18 and 19, we have a linear system of 11 inequalities in the \(x_{ij}\), \(\mathbf {B}^{(5)} \mathbf {s}^{(4)} \le \mathbf {0}\), where

Application of the FME procedure to \(\mathbf {B}^{(5)}\) to eliminate \(x_{31}\) and \(x_{41}\) results in \(x_{21} \le 0\), a contradiction. Therefore, \((t_1, t_3)\) cannot be the first cherry of \(\mathcal {T}_{D}^{(5)}\).

It follows by symmetry that \((t_2, t_3)\), \((t_1, t_4)\) and \((t_2, t_4)\) cannot cluster in the first step of the NJ algorithm when an inferred source NJ tree of \(\mathbf {S}^{(4)}\) has topology \(((t_1,t_2),(t_3,t_4))\). In the 4-taxon source distance matrix \(\mathbf {S}^{(4)}\) with \(t_1\) and \(t_2\) being source populations from which the admixed taxon \(t_5\) is created, the roles of \(t_1\) and \(t_2\) are interchangeable, as are the roles of \(t_3\) and \(t_4\).

We have so far proven that \((t_3,t_4)\), \((t_1,t_5)\), and \((t_2,t_5)\) are the only possible clusters in the first step of the construction of \(\mathcal {T}_{D}^{(5)}\) when \(\mathbf {s}^{(4)} \in D_{((t_1,t_2),(t_3,t_4))}\). If a pair \((t_3,t_4)\) clusters first, then four nodes, \(\{t_1, t_2, t_5, t_{34}\}\), are left to join in the second iteration. Because Proposition 2 says \((t_1,t_2)\) cannot be a cherry in any \(\mathcal {T}_{D}^{(5)}\), the only possible NJ tree topologies are \(((t_1,t_5),(t_2,t_{34}))\) and \(((t_2,t_5),(t_1,t_{34}))\). The node \(t_{34}\) represents the cluster \((t_3, t_4)\), so the final topologies are \(((t_1,t_5),t_2,(t_3, t_4))\) and \(((t_2,t_5),t_1,(t_3, t_4))\).

If a pair \((t_1,t_5)\) clusters first, then four nodes, \(\{t_2, t_3, t_4, t_{15}\}\), are left to join in the second iteration. The possible NJ tree topologies are \(((t_{15},t_2),(t_3,t_4))\), \(((t_{15},t_3),(t_2, t_4))\) and \(((t_{15},t_4),(t_2,t_3))\). Because the node \(t_{15}\) represents the cluster \((t_1, t_5)\), the final topologies are \(((t_1,t_5),t_2,(t_3, t_4))\), \(((t_1,t_5),t_3,(t_2, t_4))\) and \(((t_1,t_5),t_4,(t_2, t_3))\). By the same argument, possible final topologies when a pair \((t_2,t_5)\) is the first cherry are \(((t_2,t_5),t_1,(t_3, t_4))\), \(((t_2,t_5),t_3,(t_1, t_4))\), and \(((t_2,t_5),t_4,(t_1, t_3))\).

All 6 topologies of \(\mathcal {T}_{D}^{(5)}\) from three possible choices for the first cluster satisfy Property 3, and the procedures for their construction comply with Property 1. Also, they contain a cherry involving one of the source taxa and the admixed taxon. \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kim, J., Disanto, F., Kopelman, N.M. et al. Mathematical and Simulation-Based Analysis of the Behavior of Admixed Taxa in the Neighbor-Joining Algorithm. Bull Math Biol 81, 452–493 (2019). https://doi.org/10.1007/s11538-018-0444-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11538-018-0444-0

Keywords

  • Admixture
  • Neighbor-joining
  • Phylogenetics
  • Rogue taxon