Abstract
We present a method of dimensional reduction for the general Markov model of sequence evolution on a phylogenetic tree. We show that taking certain linear combinations of the associated random variables (site pattern counts) reduces the dimensionality of the model from exponential in the number of extant taxa, to quadratic in the number of taxa, while retaining the ability to statistically identify phylogenetic divergence events. A key feature is the identification of an invariant subspace which depends only bilinearly on the model parameters, in contrast to the usual multi-linear dependence in the full space. We discuss potential applications including the computation of split (edge) weights on phylogenetic trees from observed sequence data.
Similar content being viewed by others
Notes
Not to be confused with “phylogenetic invariants”; the distinction will be discussed below.
Illustrative examples of flattenings are given in the introduction of Allman and Rhodes (2008).
If one prefers to use unit row-sum Markov matrices, an analogous construction is obtained by taking the transpose in what follows.
The meaning of this will be given in the proof of Theorem 1.
In fact, this the natural way to define \(\text {Aff}(k-1)\) in the first place.
Exactly how this affects the site pattern probabilities \(p_{i_1i_2\ldots i_L}\) is given in the Appendix (11).
References
Allman ES, Kubatko LS, Rhodes JA (2017) Split scores: a tool to quantify phylogenetic signal in genome-scale data. Syst Biol. doi:10.1093/sysbio/syw103
Allman ES, Rhodes JA (2008) Phylogenetic ideals and varieties for the general Markov model. Adv. Appl. Math. 40(2):127–148
Baker A (2012) Matrix groups: an introduction to Lie group theory. Springer Science & Business Media, New York
Bashford JD, Jarvis PD, Sumner JG, Steel MA (2004) U(1)\(\times \) U(1)\(\times \) U(1) symmetry of the Kimura 3ST model and phylogenetic branching processes. J Phys A Math Gen 37(8):L81
Bryant D (2009) Hadamard phylogenetic methods and the \(n\)-taxon process. Bull Math Biol 71(2):339–351
Casanellas M, Fernández-Sánchez J (2007) Performance of a new invariants method on homogeneous and nonhomogeneous quartet trees. Mol Biol Evol 24(1):288–293
Casanellas M, Fernández-Sánchez J (2011) Relevant phylogenetic invariants of evolutionary models. Journal de Mathématiques Pures et Appliquées 96(3):207–229
Cavender JA, Felsenstein J (1987) Invariants of phylogenies in a simple case with discrete states. J Classif 4(1):57–71
Chifman J, Kubatko L (2014) Quartet inference from SNP data under the coalescent model. Bioinformatics 30(23):3317–3324
Draisma J, Kuttler J (2009) On the ideals of equivariant tree models. Math Ann 344(3):619–644
Eriksson N (2005) Tree construction using singular value decomposition. In: Pachter L, Sturmfels B (eds) Algebraic statistics for computational biology, chapter 10. Cambridge University Press, New York, pp 347–358
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17(6):368–376
Felsenstein J (2004) Inferring phylogenies, vol 2. Sinauer Associates, Sunderland
Fernández-Sánchez J, Casanellas M (2016) Invariant versus classical quartet inference when evolution is heterogeneous across sites and lineages. Syst Biol 65(2):280–291
Fitch WM (1971) Toward defining the course of evolution: minimum change for a specific tree topology. Syst Biol 20(4):406–416
Francis AR (2014) An algebraic view of bacterial genome evolution. J Math Biol 69(6–7):1693–1718
Hagedorn TR (2000) A combinatorial approach to determining phylogenetic invariants for the general model. Technical report, CRM-2671
Hendy MD, Penny D, Steel MA (1994) A discrete fourier analysis for evolutionary trees. Proc Natl Acad Sci 91(8):3339–3343
Holland BR, Jarvis PD, Sumner JG (2013) Low-parameter phylogenetic inference under the general Markov model. Syst Biol 62(1):78–92
Jarvis PD, Sumner JG (2014) Adventures in invariant theory. ANZIAM J 56(02):105–115
Jarvis PD, Sumner JG (2016) Matrix group structure and Markov invariants in the strand symmetric phylogenetic substitution model. J Math Biol 73:259–282
Johnson JE (1985) Markov-type lie groups in \(\text{ GL }(n, r)\). J Math Phys 26(2):252–257
Lake JA (1987) A rate-independent technique for analysis of nucleic acid sequences: evolutionary parsimony. Mol Biol Evol 4(2):167–191
Semple C, Steel M (2003) Phylogenetics, vol 24. Oxford University Press, Oxford
Sturmfels B, Sullivant S (2005) Toric ideals of phylogenetic invariants. J Comput Biol 12(2):204–228
Sumner JG, Charleston MA, Jermiin LS, Jarvis PD (2008) Markov invariants, plethysms, and phylogenetics. J Theor Biol 253(3):601–615
Sumner JG, Fernández-Sánchez J, Jarvis PD (2012a) Lie Markov models. J Theor Biol 298:16–31
Sumner JG, Holland BR, Jarvis PD (2012b) The algebra of the general Markov model on phylogenetic trees and networks. Bull Math Biol 74(4):858–880
Sumner JG, Jarvis PD (2005) Entanglement invariants and phylogenetic branching. J Math Biol 51(1):18–36
Sumner JG, Jarvis PD (2009) Markov invariants and the isotropy subgroup of a quartet tree. J Theor Biol 258(2):302–310
Yang Z (2014) Molecular evolution: a statistical approach. Oxford University Press, Oxford
Acknowledgements
This work was inspired from a question Alexei Drummond put to Barbara Holland during her presentation at the New Zealand Phylogenetics Meeting, DOOM 2016. I would also like to thank the anonymous reviewer for their careful and substantive comments that lead to a greatly improved manuscript.
Funding This work was supported by the Australian Research Council Discovery Early Career Fellowship DE130100423.
Author information
Authors and Affiliations
Corresponding author
Appendix: Proof of Theorem 2
Appendix: Proof of Theorem 2
Our general approach to the proof will be to give conditions for when the rank of the subflattenings does or does not grow under phylogenetic divergence events. In particular, we will show that the rank of the subflattenings is unchanged after a phylogenetic divergence event which is “consistent” with the split under consideration (the precise meaning of this will become evident below). Although related, our proof method is different in conception from the approach taken in Eriksson (2005) for obtaining the analogous conditions for the ranks of the full flattenings.
Definition 2
Given a rooted tree \({\mathcal {T}}\), consider the subtrees consisting of a vertex in \({\mathcal {T}}\) together with all of its descendants (including the case where the subtree consists of a leaf vertex only). Given a subset A of leaves, we say such a subtree is A-consistent if its leaves are a subset of A. We say an A-consistent subtree is maximally A-consistent if it is not itself a subtree of an A-consistent subtree. Similarly, given a split A|B we say that a subtree is A|B-consistent if its leaves are a subset of A or B; together with the corresponding definition of maximally A|B-consistent.
An example is given in Fig. 2.
Lemma 1
If P is a pattern distribution arising from a tree \({\mathcal {T}}\) under the general Markov model, the rank of the subflattening \(\widehat{\mathrm{Flat}}'_{A|B}(P)\) is independent of the size and/or structure of any A|B-consistent subtrees of \({\mathcal {T}}\).
Proof
Consider the molecular state space \(\kappa =\{1,2,\ldots ,k\}\) and a site pattern probability distribution \(p_{i_1i_2\ldots i_L}\) on L taxa. Suppose this distribution arises under the general Markov model on the tree \({\mathcal {T}}\) and subsequently a time-instantaneous divergence event occurs causing, without loss of generality, a copy of the Lth taxon to be created. Under the usual assumptions of this model, this results in a new distribution \(P^+=(p^+_{i_1i_2\ldots i_{L}i_{L+1}})_{i_j\in \kappa }\) on an \(L+1\) taxon tree \({\mathcal {T}}^+\), with
Consider a split A|B and suppose taxon L is contained in B. Consider the new split \(A|B'\) where the new taxon \(L+1\) has been adjoined to B to produce \(B'=B\cup \{L+1\}\). We will show that the subflattening \(\widehat{\text {Flat}}'_{A|B'}(P^+)\) is obtained from \(\widehat{\text {Flat}}'_{A|B}(P)\) by simply repeating \(k-1\) columns.
Let S be any \(k\times k\) matrix consistent with the similarity transformation (4). In particular, this means that the kth row of S is constant, and, without loss of generality, we will assume this is a row of 1s, i.e. \(S_{kj}=1\) for \(j=1,2,\ldots , k\). We denote the application of this similarity transformation to the site pattern distribution as
We will refer to these quantities as the “q-coordinates”.
Now suppose, without loss of generality, \(A|B=\{1,2,\ldots ,m\}|\{m+1,m+2,\ldots ,L\}\), and write \(q_{i_1i_2\ldots i_m,j_1j_2\ldots j_n}\) to emphasize the flattening corresponding to this split. After locating the rows and columns which define the form (5), the \((m(k-1)+1)\times (n(k-1)+1)\) entries of the subflattening are seen to be given by
We now consider the effect of the divergence rule (10) on the q-coordinates. Again we suppose that the divergence event occurs on the Lth taxon. As a consequence of (10), a short computation shows that
To construct \(\widehat{\text {Flat}}'_{A|B}(P^+)\), we must consider three cases (recalling that we are assuming \(S_{kj}=1\) for each \(j=1,2,\ldots ,k\)):
-
(i)
Suppose \(j_n=j_{n+1}= k\). Then
$$\begin{aligned} q^+_{i_1i_2\ldots i_m,j_1j_2\ldots j_{n-1}kk }=q_{i_1i_2\ldots i_m,j_1j_2\ldots j_{n-1}k}. \end{aligned}$$ -
(ii)
Suppose \(j_n=k\) and \(j_{n+1}\ne k\). Then
$$\begin{aligned} q^+_{i_1i_2\ldots i_m,j_1j_2\ldots j_{n-1}kj_{n+1} }=q_{i_1i_2\ldots i_m,j_1j_2\ldots j_{n-1}j_{n+1}}. \end{aligned}$$ -
(iii)
Suppose \(j_n\ne k\) and \(j_{n+1}= k\). Then
$$\begin{aligned} q^+_{i_1i_2\ldots i_m,j_1j_2\ldots j_{n-1}j_nk }=q_{i_1i_2\ldots i_m,j_1j_2\ldots j_{n-1}j_{n}}. \end{aligned}$$
In particular, for each choice \(j=1,2,\ldots k-1\),
Comparing to the general form (12), we see that the subflattening \(\widehat{\text {Flat}}_{A|B'}'(P^+)\) is produced from the subflattening \(\widehat{\text {Flat}}_{A|B}'(P)\) by simply repeating \(k-1\) columns. This observation holds more generally, independently of which taxon the divergence event occurs on. The only modification needed is when the divergence happens on the left side of the split A|B, in which case the new subflattening is obtained from the old by a repetition of rows rather than columns. Thus, if we place a new taxon into the same side of the split as the taxon it diverged from, the rank of the subflattening is preserved.
We now apply Corollary 1 to conclude that, in the generic case, further application of (full rank) Markov matrices at the leaves of the phylogenetic tree \({\mathcal {T}}^+\) also does not affect the rank of the subflattening.
These observations establish the lemma. \(\square \)
Now in order to determine the rank of an arbitrary subflattening, we may repeatedly apply Lemma 1 to reduce to the case where each A|B-consistent subtree is a single leaf. Assuming this situation, each leaf is then either (i) not part of a cherry, or (ii) part of a cherry where the two leaves in the cherry lie on complementary sides of the split A|B. A key feature of this situation is that we can label the descendants of every vertex (excluding the root) with complementary binary labels such that the leaf labels are consistent with the split A|B. For our purposes, we then consider this reduced case as arising from a sequence of divergence events from the base two-taxa case where, after each divergence event at a leaf, the two descendants are placed into complementary sides of the target split A|B. An example illustrating that this process is always possible is given in Fig. 3.
We use this process to establish:
Lemma 2
Suppose \({\mathcal {T}}\) is a tree, suppose P is a distribution arising on \({\mathcal {T}}\) under the general Markov model, and suppose A|B is a split such that the maximally A|B-consistent subtrees are all leaves. Then, in the generic case, the subflattening \(\widehat{\mathrm{Flat}}'_{A|B}(P)\) has maximal rank.
Proof
Suppose such a reduced tree has q-coordinates \(q_{i_1i_2\ldots i_m,j_1j_2\ldots j_n}\), and the nth taxon in B diverges creating a new taxon which is adjoined to A to form the new split \(A'|B\). Analogous to the previous situation, we have the new q-coordinates
From this, we see that the additional \(k-1\) rows in the subflattening \(\widehat{\text {Flat}}'_{A'|B}(P^+)\) are obtained by setting \(i_1=i_2=\cdots =i_m=k\), and taking \(i_{m+1}=1,2,\ldots ,k-1\) in
where the columns are indexed by choosing \(b\in \{1,2,\ldots ,n\}\) so that at most a single \(j_b\ne k\) at a time. In particular, if we choose \(j_1\ne k\) and \(j_2=j_3=\cdots =j_n=k\) we have
Now for each choice \(i_{m+1}=1,2,\ldots , k-1\), this expression gives q-coordinates which do not appear in the subflattening \(\widehat{\text {Flat}}_{A|B}'(P)\) or any of the other rows of \(\widehat{\text {Flat}}'_{A'|B}(P^+)\). It follows that any linear dependencies between the new and remaining rows in \(\widehat{\text {Flat}}'_{A'|B}(P^+)\) would imply linear constraints on the q-coordinates on the original \(m+n\) taxon tree. In turn, this would imply the existence of linear phylogenetic invariants for the general Markov model, which are known not to exist (Hagedorn 2000). Therefore, the new rows appearing in \(\widehat{\text {Flat}}'_{A'|B}(P^+)\) are linearly independent from the rest.
To complete the proof, we use induction on the base case of a two-taxon tree. To establish this base case, we show that, in the generic case, the two-taxon subflattening on the split \(A|B=\{1\}|\{2\}\) has full rank \(k=(k-1)+1\). This follows easily since, in the two-taxon case, the subflattening is equal to the transformed flattening, that is
Thus, the subflattening is related by the similarity transformation S to the flattening \(\text {Flat}(P)_{\{1\}|\{2\}}\), which a standard argument shows can be expressed as
where \(D(\pi )\) is the diagonal matrix formed from the root distribution \(\pi =(\pi _i)_{i\in \kappa }\). Clearly this matrix is full rank if \(M_1\) and \(M_2\) are full rank and \(\pi \) has no zero entries. Thus, in the generic case, the two-taxon subflattening \(\widehat{\text {Flat}}'_{1|2}(P)\) has full rank.
Induction on this base case establishes the lemma. \(\square \)
With these results in hand, Theorem 2 follows for arbitrary trees and splits by the following three steps:
-
(i)
Apply Lemma 1 and clip off any A|B-consistent subtrees;
-
(ii)
Apply Lemma 2; and
-
(iii)
Use Fitch’s algorithm (Felsenstein 2004; Fitch 1971) to recognize that the minimum of the number of maximally A- and B-consistent subtrees is none other than the parsimony score for the split A|B considered as a binary character at the leaves of \({\mathcal {T}}\).
Rights and permissions
About this article
Cite this article
Sumner, J.G. Dimensional Reduction for the General Markov Model on Phylogenetic Trees. Bull Math Biol 79, 619–634 (2017). https://doi.org/10.1007/s11538-017-0249-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11538-017-0249-6