Dimensional Reduction for the General Markov Model on Phylogenetic Trees

Sumner, Jeremy G.

doi:10.1007/s11538-017-0249-6

Dimensional Reduction for the General Markov Model on Phylogenetic Trees

Original Article
Published: 10 February 2017

Volume 79, pages 619–634, (2017)
Cite this article

Bulletin of Mathematical Biology Aims and scope Submit manuscript

Jeremy G. Sumner ORCID: orcid.org/0000-0001-9820-0235¹

356 Accesses
3 Citations
3 Altmetric
Explore all metrics

Abstract

We present a method of dimensional reduction for the general Markov model of sequence evolution on a phylogenetic tree. We show that taking certain linear combinations of the associated random variables (site pattern counts) reduces the dimensionality of the model from exponential in the number of extant taxa, to quadratic in the number of taxa, while retaining the ability to statistically identify phylogenetic divergence events. A key feature is the identification of an invariant subspace which depends only bilinearly on the model parameters, in contrast to the usual multi-linear dependence in the full space. We discuss potential applications including the computation of split (edge) weights on phylogenetic trees from observed sequence data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Forest construction of Gaussian and discrete variables with the application of Watanabe Bayesian Information Criterion

Article 12 April 2024

Sum-of-Squares Relaxations for Information Theory and Variational Inference

Article 05 April 2024

Principles of Taxonomy and Classification: Current Procedures for Naming and Classifying Organisms

Notes

Not to be confused with “phylogenetic invariants”; the distinction will be discussed below.
Illustrative examples of flattenings are given in the introduction of Allman and Rhodes (2008).
If one prefers to use unit row-sum Markov matrices, an analogous construction is obtained by taking the transpose in what follows.
The meaning of this will be given in the proof of Theorem 1.
In fact, this the natural way to define $\text {Aff}(k-1)$ in the first place.
Exactly how this affects the site pattern probabilities $p_{i_1i_2\ldots i_L}$ is given in the Appendix (11).

References

Allman ES, Kubatko LS, Rhodes JA (2017) Split scores: a tool to quantify phylogenetic signal in genome-scale data. Syst Biol. doi:10.1093/sysbio/syw103
Allman ES, Rhodes JA (2008) Phylogenetic ideals and varieties for the general Markov model. Adv. Appl. Math. 40(2):127–148
Article MathSciNet MATH Google Scholar
Baker A (2012) Matrix groups: an introduction to Lie group theory. Springer Science & Business Media, New York
Google Scholar
Bashford JD, Jarvis PD, Sumner JG, Steel MA (2004) U(1)$\times $ U(1)$\times $ U(1) symmetry of the Kimura 3ST model and phylogenetic branching processes. J Phys A Math Gen 37(8):L81
Article MathSciNet MATH Google Scholar
Bryant D (2009) Hadamard phylogenetic methods and the $n$-taxon process. Bull Math Biol 71(2):339–351
Article MathSciNet MATH Google Scholar
Casanellas M, Fernández-Sánchez J (2007) Performance of a new invariants method on homogeneous and nonhomogeneous quartet trees. Mol Biol Evol 24(1):288–293
Article Google Scholar
Casanellas M, Fernández-Sánchez J (2011) Relevant phylogenetic invariants of evolutionary models. Journal de Mathématiques Pures et Appliquées 96(3):207–229
Article MathSciNet MATH Google Scholar
Cavender JA, Felsenstein J (1987) Invariants of phylogenies in a simple case with discrete states. J Classif 4(1):57–71
Article MATH Google Scholar
Chifman J, Kubatko L (2014) Quartet inference from SNP data under the coalescent model. Bioinformatics 30(23):3317–3324
Article Google Scholar
Draisma J, Kuttler J (2009) On the ideals of equivariant tree models. Math Ann 344(3):619–644
Article MathSciNet MATH Google Scholar
Eriksson N (2005) Tree construction using singular value decomposition. In: Pachter L, Sturmfels B (eds) Algebraic statistics for computational biology, chapter 10. Cambridge University Press, New York, pp 347–358
Chapter Google Scholar
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17(6):368–376
Article Google Scholar
Felsenstein J (2004) Inferring phylogenies, vol 2. Sinauer Associates, Sunderland
Google Scholar
Fernández-Sánchez J, Casanellas M (2016) Invariant versus classical quartet inference when evolution is heterogeneous across sites and lineages. Syst Biol 65(2):280–291
Article Google Scholar
Fitch WM (1971) Toward defining the course of evolution: minimum change for a specific tree topology. Syst Biol 20(4):406–416
Article Google Scholar
Francis AR (2014) An algebraic view of bacterial genome evolution. J Math Biol 69(6–7):1693–1718
Article MathSciNet MATH Google Scholar
Hagedorn TR (2000) A combinatorial approach to determining phylogenetic invariants for the general model. Technical report, CRM-2671
Hendy MD, Penny D, Steel MA (1994) A discrete fourier analysis for evolutionary trees. Proc Natl Acad Sci 91(8):3339–3343
Article MATH Google Scholar
Holland BR, Jarvis PD, Sumner JG (2013) Low-parameter phylogenetic inference under the general Markov model. Syst Biol 62(1):78–92
Article Google Scholar
Jarvis PD, Sumner JG (2014) Adventures in invariant theory. ANZIAM J 56(02):105–115
Article MathSciNet MATH Google Scholar
Jarvis PD, Sumner JG (2016) Matrix group structure and Markov invariants in the strand symmetric phylogenetic substitution model. J Math Biol 73:259–282
Article MathSciNet MATH Google Scholar
Johnson JE (1985) Markov-type lie groups in $\text{ GL }(n, r)$. J Math Phys 26(2):252–257
Article MathSciNet MATH Google Scholar
Lake JA (1987) A rate-independent technique for analysis of nucleic acid sequences: evolutionary parsimony. Mol Biol Evol 4(2):167–191
Google Scholar
Semple C, Steel M (2003) Phylogenetics, vol 24. Oxford University Press, Oxford
MATH Google Scholar
Sturmfels B, Sullivant S (2005) Toric ideals of phylogenetic invariants. J Comput Biol 12(2):204–228
Article Google Scholar
Sumner JG, Charleston MA, Jermiin LS, Jarvis PD (2008) Markov invariants, plethysms, and phylogenetics. J Theor Biol 253(3):601–615
Article MathSciNet Google Scholar
Sumner JG, Fernández-Sánchez J, Jarvis PD (2012a) Lie Markov models. J Theor Biol 298:16–31
Article MathSciNet Google Scholar
Sumner JG, Holland BR, Jarvis PD (2012b) The algebra of the general Markov model on phylogenetic trees and networks. Bull Math Biol 74(4):858–880
Article MathSciNet MATH Google Scholar
Sumner JG, Jarvis PD (2005) Entanglement invariants and phylogenetic branching. J Math Biol 51(1):18–36
Article MathSciNet MATH Google Scholar
Sumner JG, Jarvis PD (2009) Markov invariants and the isotropy subgroup of a quartet tree. J Theor Biol 258(2):302–310
Article MathSciNet Google Scholar
Yang Z (2014) Molecular evolution: a statistical approach. Oxford University Press, Oxford
Book MATH Google Scholar

Download references

Acknowledgements

This work was inspired from a question Alexei Drummond put to Barbara Holland during her presentation at the New Zealand Phylogenetics Meeting, DOOM 2016. I would also like to thank the anonymous reviewer for their careful and substantive comments that lead to a greatly improved manuscript.

Funding This work was supported by the Australian Research Council Discovery Early Career Fellowship DE130100423.

Author information

Authors and Affiliations

School of Physical Sciences, Mathematics, University of Tasmania, Private Bag 37, GPO, Hobart, TAS, 7001, Australia
Jeremy G. Sumner

Authors

Jeremy G. Sumner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jeremy G. Sumner.

Appendix: Proof of Theorem 2

Our general approach to the proof will be to give conditions for when the rank of the subflattenings does or does not grow under phylogenetic divergence events. In particular, we will show that the rank of the subflattenings is unchanged after a phylogenetic divergence event which is “consistent” with the split under consideration (the precise meaning of this will become evident below). Although related, our proof method is different in conception from the approach taken in Eriksson (2005) for obtaining the analogous conditions for the ranks of the full flattenings.

Definition 2

Given a rooted tree ${\mathcal {T}}$, consider the subtrees consisting of a vertex in ${\mathcal {T}}$ together with all of its descendants (including the case where the subtree consists of a leaf vertex only). Given a subset A of leaves, we say such a subtree is A-consistent if its leaves are a subset of A. We say an A-consistent subtree is maximally A-consistent if it is not itself a subtree of an A-consistent subtree. Similarly, given a split A|B we say that a subtree is A|B-consistent if its leaves are a subset of A or B; together with the corresponding definition of maximally A|B-consistent.

An example is given in Fig. 2.

Lemma 1

If P is a pattern distribution arising from a tree ${\mathcal {T}}$ under the general Markov model, the rank of the subflattening $\widehat{\mathrm{Flat}}'_{A|B}(P)$ is independent of the size and/or structure of any A|B-consistent subtrees of ${\mathcal {T}}$.

Proof

Consider the molecular state space $\kappa =\{1,2,\ldots ,k\}$ and a site pattern probability distribution $p_{i_1i_2\ldots i_L}$ on L taxa. Suppose this distribution arises under the general Markov model on the tree ${\mathcal {T}}$ and subsequently a time-instantaneous divergence event occurs causing, without loss of generality, a copy of the Lth taxon to be created. Under the usual assumptions of this model, this results in a new distribution $P^+=(p^+_{i_1i_2\ldots i_{L}i_{L+1}})_{i_j\in \kappa }$ on an $L+1$ taxon tree ${\mathcal {T}}^+$, with

$$\begin{aligned} \begin{aligned} p^+_{i_1i_2\ldots i_{L}i_{L+1}}= \left\{ \begin{array}{ll} p_{i_1i_2\ldots i_L}&{}\quad \text {if}\,i_{L}=i_{L+1};\\ 0,&{}\quad \text {otherwise.} \end{array} \right. \end{aligned} \end{aligned}$$

(10)

Consider a split A|B and suppose taxon L is contained in B. Consider the new split $A|B'$ where the new taxon $L+1$ has been adjoined to B to produce $B'=B\cup \{L+1\}$. We will show that the subflattening $\widehat{\text {Flat}}'_{A|B'}(P^+)$ is obtained from $\widehat{\text {Flat}}'_{A|B}(P)$ by simply repeating $k-1$ columns.

Let S be any $k\times k$ matrix consistent with the similarity transformation (4). In particular, this means that the kth row of S is constant, and, without loss of generality, we will assume this is a row of 1s, i.e. $S_{kj}=1$ for $j=1,2,\ldots , k$. We denote the application of this similarity transformation to the site pattern distribution as

$$\begin{aligned} \begin{aligned} q_{i_1i_2\ldots i_L}:=\sum _{j_1,j_2\ldots ,j_L\in \kappa }S_{i_1j_1}S_{i_2j_2}\ldots S_{i_Lj_L}p_{j_1j_2\ldots j_L}. \end{aligned} \end{aligned}$$

(11)

We will refer to these quantities as the “q-coordinates”.

Now suppose, without loss of generality, $A|B=\{1,2,\ldots ,m\}|\{m+1,m+2,\ldots ,L\}$, and write $q_{i_1i_2\ldots i_m,j_1j_2\ldots j_n}$ to emphasize the flattening corresponding to this split. After locating the rows and columns which define the form (5), the $(m(k-1)+1)\times (n(k-1)+1)$ entries of the subflattening are seen to be given by

(12)

We now consider the effect of the divergence rule (10) on the q-coordinates. Again we suppose that the divergence event occurs on the Lth taxon. As a consequence of (10), a short computation shows that

$$\begin{aligned} q^+_{i_1i_2\ldots i_m, j_{1}j_2\ldots j_{n-1}j_nj_{n+1} } =\sum _{j,j'\in \kappa }S_{j_n j}S_{j_{n+1}j}S^{-1}_{jj'}q_{i_1i_2\ldots i_m, j_1j_2\ldots j_{n-1}j'}. \end{aligned}$$

To construct $\widehat{\text {Flat}}'_{A|B}(P^+)$, we must consider three cases (recalling that we are assuming $S_{kj}=1$ for each $j=1,2,\ldots ,k$):

(i)
Suppose $j_n=j_{n+1}= k$. Then
$$\begin{aligned} q^+_{i_1i_2\ldots i_m,j_1j_2\ldots j_{n-1}kk }=q_{i_1i_2\ldots i_m,j_1j_2\ldots j_{n-1}k}. \end{aligned}$$
(ii)
Suppose $j_n=k$ and $j_{n+1}\ne k$. Then
$$\begin{aligned} q^+_{i_1i_2\ldots i_m,j_1j_2\ldots j_{n-1}kj_{n+1} }=q_{i_1i_2\ldots i_m,j_1j_2\ldots j_{n-1}j_{n+1}}. \end{aligned}$$
(iii)
Suppose $j_n\ne k$ and $j_{n+1}= k$. Then
$$\begin{aligned} q^+_{i_1i_2\ldots i_m,j_1j_2\ldots j_{n-1}j_nk }=q_{i_1i_2\ldots i_m,j_1j_2\ldots j_{n-1}j_{n}}. \end{aligned}$$

In particular, for each choice $j=1,2,\ldots k-1$,

Comparing to the general form (12), we see that the subflattening $\widehat{\text {Flat}}_{A|B'}'(P^+)$ is produced from the subflattening $\widehat{\text {Flat}}_{A|B}'(P)$ by simply repeating $k-1$ columns. This observation holds more generally, independently of which taxon the divergence event occurs on. The only modification needed is when the divergence happens on the left side of the split A|B, in which case the new subflattening is obtained from the old by a repetition of rows rather than columns. Thus, if we place a new taxon into the same side of the split as the taxon it diverged from, the rank of the subflattening is preserved.

We now apply Corollary 1 to conclude that, in the generic case, further application of (full rank) Markov matrices at the leaves of the phylogenetic tree ${\mathcal {T}}^+$ also does not affect the rank of the subflattening.

These observations establish the lemma. $\square $

Now in order to determine the rank of an arbitrary subflattening, we may repeatedly apply Lemma 1 to reduce to the case where each A|B-consistent subtree is a single leaf. Assuming this situation, each leaf is then either (i) not part of a cherry, or (ii) part of a cherry where the two leaves in the cherry lie on complementary sides of the split A|B. A key feature of this situation is that we can label the descendants of every vertex (excluding the root) with complementary binary labels such that the leaf labels are consistent with the split A|B. For our purposes, we then consider this reduced case as arising from a sequence of divergence events from the base two-taxa case where, after each divergence event at a leaf, the two descendants are placed into complementary sides of the target split A|B. An example illustrating that this process is always possible is given in Fig. 3.

We use this process to establish:

Lemma 2

Suppose ${\mathcal {T}}$ is a tree, suppose P is a distribution arising on ${\mathcal {T}}$ under the general Markov model, and suppose A|B is a split such that the maximally A|B-consistent subtrees are all leaves. Then, in the generic case, the subflattening $\widehat{\mathrm{Flat}}'_{A|B}(P)$ has maximal rank.

Proof

Suppose such a reduced tree has q-coordinates $q_{i_1i_2\ldots i_m,j_1j_2\ldots j_n}$, and the nth taxon in B diverges creating a new taxon which is adjoined to A to form the new split $A'|B$. Analogous to the previous situation, we have the new q-coordinates

$$\begin{aligned} q^+_{i_1i_2\ldots i_mi_{m+1},j_1j_2\ldots j_n}=\sum _{j,j'=1}^kS_{i_{m+1}j}S_{j_{n}j}S_{jj'}^{-1}q_{i_1i_2\ldots i_{m},j_1j_2\ldots j_{n-1}j'}. \end{aligned}$$

From this, we see that the additional $k-1$ rows in the subflattening $\widehat{\text {Flat}}'_{A'|B}(P^+)$ are obtained by setting $i_1=i_2=\cdots =i_m=k$, and taking $i_{m+1}=1,2,\ldots ,k-1$ in

$$\begin{aligned} q^+_{kk\ldots ki_{m+1},j_1j_2\ldots j_n}=\sum _{j,j'=1}^kS_{i_{m+1}j}S_{j_{n}j}S_{jj'}^{-1}q_{kk\ldots k,j_1j_2\ldots j_{n-1}j'}, \end{aligned}$$

where the columns are indexed by choosing $b\in \{1,2,\ldots ,n\}$ so that at most a single $j_b\ne k$ at a time. In particular, if we choose $j_1\ne k$ and $j_2=j_3=\cdots =j_n=k$ we have

$$\begin{aligned} q^+_{kk\ldots ki_{m+1},j_1kk\ldots k}=q_{kk\ldots k,j_1kk\ldots ki_{m+1}}. \end{aligned}$$

Now for each choice $i_{m+1}=1,2,\ldots , k-1$, this expression gives q-coordinates which do not appear in the subflattening $\widehat{\text {Flat}}_{A|B}'(P)$ or any of the other rows of $\widehat{\text {Flat}}'_{A'|B}(P^+)$. It follows that any linear dependencies between the new and remaining rows in $\widehat{\text {Flat}}'_{A'|B}(P^+)$ would imply linear constraints on the q-coordinates on the original $m+n$ taxon tree. In turn, this would imply the existence of linear phylogenetic invariants for the general Markov model, which are known not to exist (Hagedorn 2000). Therefore, the new rows appearing in $\widehat{\text {Flat}}'_{A'|B}(P^+)$ are linearly independent from the rest.

To complete the proof, we use induction on the base case of a two-taxon tree. To establish this base case, we show that, in the generic case, the two-taxon subflattening on the split $A|B=\{1\}|\{2\}$ has full rank $k=(k-1)+1$. This follows easily since, in the two-taxon case, the subflattening is equal to the transformed flattening, that is

$$\begin{aligned} \widehat{\text {Flat}}'_{\{1\}|\{2\}}(P)= \text {Flat}'_{\{1\}|\{2\}}(P)=S\text {Flat}(P)_{\{1\}|\{2\}}S^{-1}. \end{aligned}$$

Thus, the subflattening is related by the similarity transformation S to the flattening $\text {Flat}(P)_{\{1\}|\{2\}}$, which a standard argument shows can be expressed as

$$\begin{aligned} \text {Flat}_{\{1\}|\{2\}}(P)=M_1D(\pi )M_2^T, \end{aligned}$$

where $D(\pi )$ is the diagonal matrix formed from the root distribution $\pi =(\pi _i)_{i\in \kappa }$. Clearly this matrix is full rank if $M_1$ and $M_2$ are full rank and $\pi $ has no zero entries. Thus, in the generic case, the two-taxon subflattening $\widehat{\text {Flat}}'_{1|2}(P)$ has full rank.

Induction on this base case establishes the lemma. $\square $

With these results in hand, Theorem 2 follows for arbitrary trees and splits by the following three steps:

(i)
Apply Lemma 1 and clip off any A|B-consistent subtrees;
(ii)
Apply Lemma 2; and
(iii)
Use Fitch’s algorithm (Felsenstein 2004; Fitch 1971) to recognize that the minimum of the number of maximally A- and B-consistent subtrees is none other than the parsimony score for the split A|B considered as a binary character at the leaves of ${\mathcal {T}}$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sumner, J.G. Dimensional Reduction for the General Markov Model on Phylogenetic Trees. Bull Math Biol 79, 619–634 (2017). https://doi.org/10.1007/s11538-017-0249-6

Download citation

Received: 11 August 2016
Accepted: 19 January 2017
Published: 10 February 2017
Issue Date: March 2017
DOI: https://doi.org/10.1007/s11538-017-0249-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dimensional Reduction for the General Markov Model on Phylogenetic Trees

Abstract

Access this article

Similar content being viewed by others

Forest construction of Gaussian and discrete variables with the application of Watanabe Bayesian Information Criterion

Sum-of-Squares Relaxations for Information Theory and Variational Inference

Principles of Taxonomy and Classification: Current Procedures for Naming and Classifying Organisms

Notes

References

Acknowledgements