Skip to main content
Log in

Dimensional Reduction for the General Markov Model on Phylogenetic Trees

  • Original Article
  • Published:
Bulletin of Mathematical Biology Aims and scope Submit manuscript

Abstract

We present a method of dimensional reduction for the general Markov model of sequence evolution on a phylogenetic tree. We show that taking certain linear combinations of the associated random variables (site pattern counts) reduces the dimensionality of the model from exponential in the number of extant taxa, to quadratic in the number of taxa, while retaining the ability to statistically identify phylogenetic divergence events. A key feature is the identification of an invariant subspace which depends only bilinearly on the model parameters, in contrast to the usual multi-linear dependence in the full space. We discuss potential applications including the computation of split (edge) weights on phylogenetic trees from observed sequence data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. Not to be confused with “phylogenetic invariants”; the distinction will be discussed below.

  2. Illustrative examples of flattenings are given in the introduction of Allman and Rhodes (2008).

  3. If one prefers to use unit row-sum Markov matrices, an analogous construction is obtained by taking the transpose in what follows.

  4. The meaning of this will be given in the proof of Theorem 1.

  5. In fact, this the natural way to define \(\text {Aff}(k-1)\) in the first place.

  6. Exactly how this affects the site pattern probabilities \(p_{i_1i_2\ldots i_L}\) is given in the Appendix (11).

References

  • Allman ES, Kubatko LS, Rhodes JA (2017) Split scores: a tool to quantify phylogenetic signal in genome-scale data. Syst Biol. doi:10.1093/sysbio/syw103

  • Allman ES, Rhodes JA (2008) Phylogenetic ideals and varieties for the general Markov model. Adv. Appl. Math. 40(2):127–148

    Article  MathSciNet  MATH  Google Scholar 

  • Baker A (2012) Matrix groups: an introduction to Lie group theory. Springer Science & Business Media, New York

    Google Scholar 

  • Bashford JD, Jarvis PD, Sumner JG, Steel MA (2004) U(1)\(\times \) U(1)\(\times \) U(1) symmetry of the Kimura 3ST model and phylogenetic branching processes. J Phys A Math Gen 37(8):L81

    Article  MathSciNet  MATH  Google Scholar 

  • Bryant D (2009) Hadamard phylogenetic methods and the \(n\)-taxon process. Bull Math Biol 71(2):339–351

    Article  MathSciNet  MATH  Google Scholar 

  • Casanellas M, Fernández-Sánchez J (2007) Performance of a new invariants method on homogeneous and nonhomogeneous quartet trees. Mol Biol Evol 24(1):288–293

    Article  Google Scholar 

  • Casanellas M, Fernández-Sánchez J (2011) Relevant phylogenetic invariants of evolutionary models. Journal de Mathématiques Pures et Appliquées 96(3):207–229

    Article  MathSciNet  MATH  Google Scholar 

  • Cavender JA, Felsenstein J (1987) Invariants of phylogenies in a simple case with discrete states. J Classif 4(1):57–71

    Article  MATH  Google Scholar 

  • Chifman J, Kubatko L (2014) Quartet inference from SNP data under the coalescent model. Bioinformatics 30(23):3317–3324

    Article  Google Scholar 

  • Draisma J, Kuttler J (2009) On the ideals of equivariant tree models. Math Ann 344(3):619–644

    Article  MathSciNet  MATH  Google Scholar 

  • Eriksson N (2005) Tree construction using singular value decomposition. In: Pachter L, Sturmfels B (eds) Algebraic statistics for computational biology, chapter 10. Cambridge University Press, New York, pp 347–358

    Chapter  Google Scholar 

  • Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17(6):368–376

    Article  Google Scholar 

  • Felsenstein J (2004) Inferring phylogenies, vol 2. Sinauer Associates, Sunderland

    Google Scholar 

  • Fernández-Sánchez J, Casanellas M (2016) Invariant versus classical quartet inference when evolution is heterogeneous across sites and lineages. Syst Biol 65(2):280–291

    Article  Google Scholar 

  • Fitch WM (1971) Toward defining the course of evolution: minimum change for a specific tree topology. Syst Biol 20(4):406–416

    Article  Google Scholar 

  • Francis AR (2014) An algebraic view of bacterial genome evolution. J Math Biol 69(6–7):1693–1718

    Article  MathSciNet  MATH  Google Scholar 

  • Hagedorn TR (2000) A combinatorial approach to determining phylogenetic invariants for the general model. Technical report, CRM-2671

  • Hendy MD, Penny D, Steel MA (1994) A discrete fourier analysis for evolutionary trees. Proc Natl Acad Sci 91(8):3339–3343

    Article  MATH  Google Scholar 

  • Holland BR, Jarvis PD, Sumner JG (2013) Low-parameter phylogenetic inference under the general Markov model. Syst Biol 62(1):78–92

    Article  Google Scholar 

  • Jarvis PD, Sumner JG (2014) Adventures in invariant theory. ANZIAM J 56(02):105–115

    Article  MathSciNet  MATH  Google Scholar 

  • Jarvis PD, Sumner JG (2016) Matrix group structure and Markov invariants in the strand symmetric phylogenetic substitution model. J Math Biol 73:259–282

    Article  MathSciNet  MATH  Google Scholar 

  • Johnson JE (1985) Markov-type lie groups in \(\text{ GL }(n, r)\). J Math Phys 26(2):252–257

    Article  MathSciNet  MATH  Google Scholar 

  • Lake JA (1987) A rate-independent technique for analysis of nucleic acid sequences: evolutionary parsimony. Mol Biol Evol 4(2):167–191

    Google Scholar 

  • Semple C, Steel M (2003) Phylogenetics, vol 24. Oxford University Press, Oxford

    MATH  Google Scholar 

  • Sturmfels B, Sullivant S (2005) Toric ideals of phylogenetic invariants. J Comput Biol 12(2):204–228

    Article  Google Scholar 

  • Sumner JG, Charleston MA, Jermiin LS, Jarvis PD (2008) Markov invariants, plethysms, and phylogenetics. J Theor Biol 253(3):601–615

    Article  MathSciNet  Google Scholar 

  • Sumner JG, Fernández-Sánchez J, Jarvis PD (2012a) Lie Markov models. J Theor Biol 298:16–31

    Article  MathSciNet  Google Scholar 

  • Sumner JG, Holland BR, Jarvis PD (2012b) The algebra of the general Markov model on phylogenetic trees and networks. Bull Math Biol 74(4):858–880

    Article  MathSciNet  MATH  Google Scholar 

  • Sumner JG, Jarvis PD (2005) Entanglement invariants and phylogenetic branching. J Math Biol 51(1):18–36

    Article  MathSciNet  MATH  Google Scholar 

  • Sumner JG, Jarvis PD (2009) Markov invariants and the isotropy subgroup of a quartet tree. J Theor Biol 258(2):302–310

    Article  MathSciNet  Google Scholar 

  • Yang Z (2014) Molecular evolution: a statistical approach. Oxford University Press, Oxford

    Book  MATH  Google Scholar 

Download references

Acknowledgements

This work was inspired from a question Alexei Drummond put to Barbara Holland during her presentation at the New Zealand Phylogenetics Meeting, DOOM 2016. I would also like to thank the anonymous reviewer for their careful and substantive comments that lead to a greatly improved manuscript.

Funding This work was supported by the Australian Research Council Discovery Early Career Fellowship DE130100423.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jeremy G. Sumner.

Appendix: Proof of Theorem 2

Appendix: Proof of Theorem 2

Our general approach to the proof will be to give conditions for when the rank of the subflattenings does or does not grow under phylogenetic divergence events. In particular, we will show that the rank of the subflattenings is unchanged after a phylogenetic divergence event which is “consistent” with the split under consideration (the precise meaning of this will become evident below). Although related, our proof method is different in conception from the approach taken in Eriksson (2005) for obtaining the analogous conditions for the ranks of the full flattenings.

Definition 2

Given a rooted tree \({\mathcal {T}}\), consider the subtrees consisting of a vertex in \({\mathcal {T}}\) together with all of its descendants (including the case where the subtree consists of a leaf vertex only). Given a subset A of leaves, we say such a subtree is A-consistent if its leaves are a subset of A. We say an A-consistent subtree is maximally A-consistent if it is not itself a subtree of an A-consistent subtree. Similarly, given a split A|B we say that a subtree is A|B-consistent if its leaves are a subset of A or B; together with the corresponding definition of maximally A|B-consistent.

An example is given in Fig. 2.

Fig. 2
figure 2

A rooted tree with two maximally \(A|B=\{2,3,4,6\}|\{1,5,7,8,9,10\}\) consistent subtrees indicated. The subtree with leaf set \(\{3,4\}\) is A-consistent, but not maximally so

Lemma 1

If P is a pattern distribution arising from a tree \({\mathcal {T}}\) under the general Markov model, the rank of the subflattening \(\widehat{\mathrm{Flat}}'_{A|B}(P)\) is independent of the size and/or structure of any A|B-consistent subtrees of \({\mathcal {T}}\).

Proof

Consider the molecular state space \(\kappa =\{1,2,\ldots ,k\}\) and a site pattern probability distribution \(p_{i_1i_2\ldots i_L}\) on L taxa. Suppose this distribution arises under the general Markov model on the tree \({\mathcal {T}}\) and subsequently a time-instantaneous divergence event occurs causing, without loss of generality, a copy of the Lth taxon to be created. Under the usual assumptions of this model, this results in a new distribution \(P^+=(p^+_{i_1i_2\ldots i_{L}i_{L+1}})_{i_j\in \kappa }\) on an \(L+1\) taxon tree \({\mathcal {T}}^+\), with

$$\begin{aligned} \begin{aligned} p^+_{i_1i_2\ldots i_{L}i_{L+1}}= \left\{ \begin{array}{ll} p_{i_1i_2\ldots i_L}&{}\quad \text {if}\,i_{L}=i_{L+1};\\ 0,&{}\quad \text {otherwise.} \end{array} \right. \end{aligned} \end{aligned}$$
(10)

Consider a split A|B and suppose taxon L is contained in B. Consider the new split \(A|B'\) where the new taxon \(L+1\) has been adjoined to B to produce \(B'=B\cup \{L+1\}\). We will show that the subflattening \(\widehat{\text {Flat}}'_{A|B'}(P^+)\) is obtained from \(\widehat{\text {Flat}}'_{A|B}(P)\) by simply repeating \(k-1\) columns.

Let S be any \(k\times k\) matrix consistent with the similarity transformation (4). In particular, this means that the kth row of S is constant, and, without loss of generality, we will assume this is a row of 1s, i.e. \(S_{kj}=1\) for \(j=1,2,\ldots , k\). We denote the application of this similarity transformation to the site pattern distribution as

$$\begin{aligned} \begin{aligned} q_{i_1i_2\ldots i_L}:=\sum _{j_1,j_2\ldots ,j_L\in \kappa }S_{i_1j_1}S_{i_2j_2}\ldots S_{i_Lj_L}p_{j_1j_2\ldots j_L}. \end{aligned} \end{aligned}$$
(11)

We will refer to these quantities as the “q-coordinates”.

Now suppose, without loss of generality, \(A|B=\{1,2,\ldots ,m\}|\{m+1,m+2,\ldots ,L\}\), and write \(q_{i_1i_2\ldots i_m,j_1j_2\ldots j_n}\) to emphasize the flattening corresponding to this split. After locating the rows and columns which define the form (5), the \((m(k-1)+1)\times (n(k-1)+1)\) entries of the subflattening are seen to be given by

(12)

We now consider the effect of the divergence rule (10) on the q-coordinates. Again we suppose that the divergence event occurs on the Lth taxon. As a consequence of (10), a short computation shows that

$$\begin{aligned} q^+_{i_1i_2\ldots i_m, j_{1}j_2\ldots j_{n-1}j_nj_{n+1} } =\sum _{j,j'\in \kappa }S_{j_n j}S_{j_{n+1}j}S^{-1}_{jj'}q_{i_1i_2\ldots i_m, j_1j_2\ldots j_{n-1}j'}. \end{aligned}$$

To construct \(\widehat{\text {Flat}}'_{A|B}(P^+)\), we must consider three cases (recalling that we are assuming \(S_{kj}=1\) for each \(j=1,2,\ldots ,k\)):

  1. (i)

    Suppose \(j_n=j_{n+1}= k\). Then

    $$\begin{aligned} q^+_{i_1i_2\ldots i_m,j_1j_2\ldots j_{n-1}kk }=q_{i_1i_2\ldots i_m,j_1j_2\ldots j_{n-1}k}. \end{aligned}$$
  2. (ii)

    Suppose \(j_n=k\) and \(j_{n+1}\ne k\). Then

    $$\begin{aligned} q^+_{i_1i_2\ldots i_m,j_1j_2\ldots j_{n-1}kj_{n+1} }=q_{i_1i_2\ldots i_m,j_1j_2\ldots j_{n-1}j_{n+1}}. \end{aligned}$$
  3. (iii)

    Suppose \(j_n\ne k\) and \(j_{n+1}= k\). Then

    $$\begin{aligned} q^+_{i_1i_2\ldots i_m,j_1j_2\ldots j_{n-1}j_nk }=q_{i_1i_2\ldots i_m,j_1j_2\ldots j_{n-1}j_{n}}. \end{aligned}$$

In particular, for each choice \(j=1,2,\ldots k-1\),

Comparing to the general form (12), we see that the subflattening \(\widehat{\text {Flat}}_{A|B'}'(P^+)\) is produced from the subflattening \(\widehat{\text {Flat}}_{A|B}'(P)\) by simply repeating \(k-1\) columns. This observation holds more generally, independently of which taxon the divergence event occurs on. The only modification needed is when the divergence happens on the left side of the split A|B, in which case the new subflattening is obtained from the old by a repetition of rows rather than columns. Thus, if we place a new taxon into the same side of the split as the taxon it diverged from, the rank of the subflattening is preserved.

We now apply Corollary 1 to conclude that, in the generic case, further application of (full rank) Markov matrices at the leaves of the phylogenetic tree \({\mathcal {T}}^+\) also does not affect the rank of the subflattening.

These observations establish the lemma. \(\square \)

Now in order to determine the rank of an arbitrary subflattening, we may repeatedly apply Lemma 1 to reduce to the case where each A|B-consistent subtree is a single leaf. Assuming this situation, each leaf is then either (i) not part of a cherry, or (ii) part of a cherry where the two leaves in the cherry lie on complementary sides of the split A|B. A key feature of this situation is that we can label the descendants of every vertex (excluding the root) with complementary binary labels such that the leaf labels are consistent with the split A|B. For our purposes, we then consider this reduced case as arising from a sequence of divergence events from the base two-taxa case where, after each divergence event at a leaf, the two descendants are placed into complementary sides of the target split A|B. An example illustrating that this process is always possible is given in Fig. 3.

Fig. 3
figure 3

Given a tree \({\mathcal {T}}\) and a split A|B on its leaf set, leaves belonging to A are labelled by “\(+\)” and leaves in B are labelled by “−”. The tree is reduced by removing any A|B-consistent subtrees, and binary labels are attached to the vertices (excluding the root) such that the descendants of each vertex obtain complementary labels and the leaf labels are consistent with the split A|B. In the case illustrated, the second step follows as a consequence of the two leaves that are not part of a cherry. Step 1 Reduce each maximally A|B consistent subtree to a leaf. Step 2 Label each internal vertex (excluding the root) consistently so descendants of internal vertices are distinctly labelled. Step 3 Arbitrarily resolve any remaining ambiguities

We use this process to establish:

Lemma 2

Suppose \({\mathcal {T}}\) is a tree, suppose P is a distribution arising on \({\mathcal {T}}\) under the general Markov model, and suppose A|B is a split such that the maximally A|B-consistent subtrees are all leaves. Then, in the generic case, the subflattening \(\widehat{\mathrm{Flat}}'_{A|B}(P)\) has maximal rank.

Proof

Suppose such a reduced tree has q-coordinates \(q_{i_1i_2\ldots i_m,j_1j_2\ldots j_n}\), and the nth taxon in B diverges creating a new taxon which is adjoined to A to form the new split \(A'|B\). Analogous to the previous situation, we have the new q-coordinates

$$\begin{aligned} q^+_{i_1i_2\ldots i_mi_{m+1},j_1j_2\ldots j_n}=\sum _{j,j'=1}^kS_{i_{m+1}j}S_{j_{n}j}S_{jj'}^{-1}q_{i_1i_2\ldots i_{m},j_1j_2\ldots j_{n-1}j'}. \end{aligned}$$

From this, we see that the additional \(k-1\) rows in the subflattening \(\widehat{\text {Flat}}'_{A'|B}(P^+)\) are obtained by setting \(i_1=i_2=\cdots =i_m=k\), and taking \(i_{m+1}=1,2,\ldots ,k-1\) in

$$\begin{aligned} q^+_{kk\ldots ki_{m+1},j_1j_2\ldots j_n}=\sum _{j,j'=1}^kS_{i_{m+1}j}S_{j_{n}j}S_{jj'}^{-1}q_{kk\ldots k,j_1j_2\ldots j_{n-1}j'}, \end{aligned}$$

where the columns are indexed by choosing \(b\in \{1,2,\ldots ,n\}\) so that at most a single \(j_b\ne k\) at a time. In particular, if we choose \(j_1\ne k\) and \(j_2=j_3=\cdots =j_n=k\) we have

$$\begin{aligned} q^+_{kk\ldots ki_{m+1},j_1kk\ldots k}=q_{kk\ldots k,j_1kk\ldots ki_{m+1}}. \end{aligned}$$

Now for each choice \(i_{m+1}=1,2,\ldots , k-1\), this expression gives q-coordinates which do not appear in the subflattening \(\widehat{\text {Flat}}_{A|B}'(P)\) or any of the other rows of \(\widehat{\text {Flat}}'_{A'|B}(P^+)\). It follows that any linear dependencies between the new and remaining rows in \(\widehat{\text {Flat}}'_{A'|B}(P^+)\) would imply linear constraints on the q-coordinates on the original \(m+n\) taxon tree. In turn, this would imply the existence of linear phylogenetic invariants for the general Markov model, which are known not to exist (Hagedorn 2000). Therefore, the new rows appearing in \(\widehat{\text {Flat}}'_{A'|B}(P^+)\) are linearly independent from the rest.

To complete the proof, we use induction on the base case of a two-taxon tree. To establish this base case, we show that, in the generic case, the two-taxon subflattening on the split \(A|B=\{1\}|\{2\}\) has full rank \(k=(k-1)+1\). This follows easily since, in the two-taxon case, the subflattening is equal to the transformed flattening, that is

$$\begin{aligned} \widehat{\text {Flat}}'_{\{1\}|\{2\}}(P)= \text {Flat}'_{\{1\}|\{2\}}(P)=S\text {Flat}(P)_{\{1\}|\{2\}}S^{-1}. \end{aligned}$$

Thus, the subflattening is related by the similarity transformation S to the flattening \(\text {Flat}(P)_{\{1\}|\{2\}}\), which a standard argument shows can be expressed as

$$\begin{aligned} \text {Flat}_{\{1\}|\{2\}}(P)=M_1D(\pi )M_2^T, \end{aligned}$$

where \(D(\pi )\) is the diagonal matrix formed from the root distribution \(\pi =(\pi _i)_{i\in \kappa }\). Clearly this matrix is full rank if \(M_1\) and \(M_2\) are full rank and \(\pi \) has no zero entries. Thus, in the generic case, the two-taxon subflattening \(\widehat{\text {Flat}}'_{1|2}(P)\) has full rank.

Induction on this base case establishes the lemma. \(\square \)

With these results in hand, Theorem 2 follows for arbitrary trees and splits by the following three steps:

  1. (i)

    Apply Lemma 1 and clip off any A|B-consistent subtrees;

  2. (ii)

    Apply Lemma 2; and

  3. (iii)

    Use Fitch’s algorithm (Felsenstein 2004; Fitch 1971) to recognize that the minimum of the number of maximally A- and B-consistent subtrees is none other than the parsimony score for the split A|B considered as a binary character at the leaves of \({\mathcal {T}}\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sumner, J.G. Dimensional Reduction for the General Markov Model on Phylogenetic Trees. Bull Math Biol 79, 619–634 (2017). https://doi.org/10.1007/s11538-017-0249-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11538-017-0249-6

Keywords

Navigation