Maximum Likelihood Estimation of Symmetric GroupBased Models via Numerical Algebraic Geometry
Abstract
Phylogenetic models admit polynomial parametrization maps in terms of the root distribution and transition probabilities along the edges of the phylogenetic tree. For symmetric continuoustime groupbased models, Matsen studied the polynomial inequalities that characterize the joint probabilities in the image of these parametrizations (Matsen in IEEE/ACM Trans Comput Biol Bioinform 6:89–95, 2009). We employ this description for maximum likelihood estimation via numerical algebraic geometry. In particular, we explore an example where the maximum likelihood estimate does not exist, which would be difficult to discover without using algebraic methods.
Keywords
Phylogenetics Groupbased models Maximum likelihood estimation Real algebraic geometry Numerical algebraic geometry Algebraic statistics1 Introduction
A phylogenetic tree is a rooted tree that depicts evolutionary relationships between species. A phylogenetic model is a statistical model describing the evolution of species on a phylogenetic tree. There is a discrete random variable associated with every vertex of the tree. The random variables associated with interior vertices are hidden and correspond to extinct species; the random variables associated with leaves are observed and correspond to extant species. The model parameters are the root distribution and the rate or transition matrices at the edges of the phylogenetic tree. There are different constraints on the model parameters depending on the phylogenetic model. The joint probabilities of random variables associated with leaves (leaf probabilities) are polynomials in the model parameters.
Cavender and Felsenstein (1987), and, separately, Lake (1987), introduced an algebraic approach to study phylogenetic models focusing on the search for phylogenetic invariants. A phylogenetic invariant of the model is a polynomial in the leaf probabilities which vanishes for every choice of model parameters. However, phylogenetic invariants alone do not describe the image of the parametrization map. One needs to include inequalities in order to obtain a complete description of the set of leaf probabilities corresponding to phylogenetic tree models.
This paper focuses on the study of continuoustime groupbased models. In the rest of the paper, a phylogenetic model is always continuoustime unless written otherwise. Transition matrices of continuoustime phylogenetic models come from continuoustime Markov processes and they are matrix exponentials of rate matrices. Rate matrices of groupbased models have a special structure that is determined by an abelian group. A symmetric groupbased model assumes that the rate matrices along every edge are symmetric. In particular, a symmetric groupbased model can be a submodel of a nonsymmetric groupbased model with extra symmetricity conditions on rate matrices. The precise definitions are given in Sect. 2.
Generating sets for phylogenetic invariants for groupbased models are described in Sturmfels and Sullivant (2005), Casanellas et al. (2015). These papers consider discretetime groupbased models that require transition matrices to have a special structure determined by an abelian group, but they do not require transition matrices to be matrix exponentials of rate matrices. Generating sets derived in these papers are also valid under the continuoustime approach. However, inequalities defining both models differ, because the set of transition matrices is smaller under the continuoustime approach. A method for deriving the inequalities under the continuoustime approach is given in Matsen (Matsen 2009, Proposition 3.5). We will explicitly derive the semialgebraic description of the leaf probabilities of the CFN model on the tripod tree \(K_{1,3}\).
Identifying the equation and inequality characterization of the leaf probabilities is only one part of the problem. The maximum likelihood estimation aims to find parameters that maximize the likelihood of observing the data for the given phylogenetic tree and phylogenetic model. Estimating the tree topology is another part of phylogenetic inference not considered here, see for example Dhar and Minin (2016) for a general overview on phylogenetic inference. Standard methods for the maximum likelihood estimation of the model parameters are the Newton–Raphson method (Schadt et al. 1998; Kenney and Gu 2012), quasiNewton methods Olsen et al. (1994) and the EM algorithm (Felsenstein 1981; Friedman et al. 2002; Holmes and Rubin 2002; Hobolth and Jensen 2005). It is shown in Steel (1994), Chor et al. (2000) that likelihood functions on phylogenetic trees can have multiple local and global maxima, and thus none of the above methods can guarantee finding the global MLE as these methods are hillclimbing methods. It is stated in Dhar and Minin (2016) that currently no optimization method can guarantee to solve the optimization of the likelihood function over model parameters.
We suggest an alternative method that theoretically gives the solution to the maximum likelihood estimation problem with probability one. This method is based on numerical algebraic geometry (Sommese and Wampler 2005; Bates et al. 2013). The main idea behind this method is to use a numerical algebraic geometry package to compute all critical points of a likelihood function and then choose the critical point with the highest likelihood value. A similar method has been previously applied in optimal control (Rostalski et al. 2011) and in the life sciences (Gross et al. 2016).
Since phylogenetic models are not necessarily compact, the MLE might not even exist. We will use the proposed method to study an example for which the MLE does not exist for the CFN model on the tripod \(K_{1,3}\) and a particular data vector. In this example, the global maximum is achieved when one of the model parameters goes to infinity. The nonexistence of the MLE would be very difficult to discover without the algebraic methods that we use in this paper, because standard numerical solvers output a solution close to the boundary of the model as we will demonstrate by solving the same MLE problem in Mathematica. One should see the example for the CFN model on the tripod \(K_{1,3}\) as an illustration of a concept. It will be the subject of future work to develop a package that automatizes the computation in the phylogenetics setting, so that it can be easily used for studying further examples.
In Sect. 2, we introduce the preliminaries of phylogenetic models and present tools from Matsen (2009). Based on Matsen (2009), we state in Sect. 3 Proposition 3 that gives an algorithm for deriving the semialgebraic description of the leaf probabilities of a symmetric groupbased model. A proof of Proposition 3 is given in “Appendix A”. Algorithm 1 in Sect. 4 outlines how to use numerical algebraic geometry to theoretically give the MLE with probability one. This algorithm is applied on the CFN model on the tripod in Example 5.
2 Preliminaries of GroupBased Models
The exposition in this section largely follows Matsen (2009). A phylogenetic tree T is a rooted tree with n labeled leaves and it represents the evolutionary relationship between different species. Its leaves correspond to current species and the internal nodes correspond to common ancestors. There is a discrete random variable \(X_v\) taking \(k \in \mathbb {N}\) possible values associated to each vertex v of the tree T. Typical values for k are two, four or twenty, corresponding to a binary feature, the number of nucleotides and the number of amino acids. For example, if \(k=4\), the random variable at a leaf represents the probability of observing A, C, G or T in the DNA of the species corresponding to the leaf.
To define a groupbased phylogenetic model, we first fix an abelian group \(\mathcal {G}\), a finite set of labels \(\mathcal {L}\) and a labeling function \(L:\mathcal {G} \rightarrow \mathcal {L}\). Let \(k=\mathcal {G}\). A rate matrix \(Q^{(e)}\) is a rate matrix in the groupbased model if it satisfies \(Q^{(e)}_{g,h}=\psi ^{(e)}(hg)\) for a vector \(\psi ^{(e)} \in \mathbb {R}^{{\mathcal {G}}}\) with \(\psi ^{(e)}(g_1)=\psi ^{(e)}(g_2)\) whenever \(L(g_1)=L(g_2)\). Hence transition matrices of the groupbased model form a subset of all the transition matrices that satisfy \(P^{(e)}_{g,h}=f^{(e)}(hg)\) for a probability vector \(f^{(e)} \in \mathbb {R}^{{\mathcal {G}}}\) with \(f^{(e)}(g_1)=f^{(e)}(g_2)\) whenever \(L(g_1)=L(g_2)\). This is because the matrix exponentiation is defined as \(e^M=\sum _{i=0}^{\infty } \frac{1}{i!} M^i\) and if a matrix M has the structure given by \(\mathcal {G},\mathcal {L}\) and L, then one can check that also \(M^i\) has the structure given by \(\mathcal {G},\mathcal {L}\) and L for all \(i \in \mathbb {N}\). The phylogenetic models we consider are symmetric, which means \(Q^{(e)}_{g,h}=Q^{(e)}_{h,g}\). In the case of groupbased models, this is equivalent to \(L(g)=L(g)\) for all \(g \in \mathcal {G}\).
We will assume that the root distribution \(\pi \) of a groupbased model is uniform or the root distribution \(\pi \) is such that the matrix \(P \in \mathbb {R}^{\mathcal {G} \times \mathcal {G}}\) defined by \(P_{g,h}:=\pi (hg)\) is a transition matrix in the groupbased model (i.e., it is exponential of a rate matrix in the groupbased model). In the latter case, we add a new edge starting from the root and reroot the tree at the additional leaf. Instead of the previous root distribution, we use a new root distribution that puts all the mass at the identity and a new transition matrix which is the transition matrix P defined above. We will consider the new leaf as a hidden vertex while other leaves are considered as observed vertices. The same rerooting procedure is used in Sturmfels and Sullivant (2005), Matsen (2009). This approach does not allow completely arbitrary root distributions. In particular, a root distribution has to satisfy \(\pi (g_1)=\pi (g_2)\) whenever \(L(g_1)=L(g_2)\) and it has to satisfy inequalities that guarantee that the transition matrix P defined by \(P_{g,h}:=\pi (hg)\) is a matrix exponential of a rate matrix. The latter problem is called the embedding problem and is studied for \(2 \times 2\) matrices in Kingman (1962) and for the Kimura 3parameter model in RocaLacostena and FernándezSánchez (2017). In (Sturmfels and Sullivant (2005), Section 6), a workaround is described for deriving phylogenetic invariants for arbitrary root distributions for discretetime groupbased models. We will describe a workaround for deriving inequalities describing the CFN model for arbitrary root distributions; however, we do not know how to generalize this approach to other models.
The joint probability distributions \(p_{i_1,\ldots ,i_n}=\text {Pr}(X_1=i_1,\ldots ,X_n=i_n)\) at the n leaves can be written as polynomials in the root probabilities and in the entries of the transition matrices. Denote by \(\mathbf{p}\) the vector of joint probabilities \(p_{i_1,\ldots ,i_n}\). As it is common in phylogenetic algebraic geometry, we will use the discrete Fourier transform for the groups \(\mathcal {G}\) and \(\mathcal {G}^n\) to study the set of transition matrices and the set of joint probabilities at the leaves for a given phylogenetic tree and a groupbased model. The reason for this is that phylogenetic invariants are considerably simpler in the Fourier coordinates (see Sturmfels and Sullivant 2005).

The map from \(\{\psi ^{(e)}\}_{e \in E}\) to \(\{\check{\psi }^{(e)}\}_{e \in E}\) is given by the discrete Fourier transform of \(\mathcal {G}\). It is an invertible linear transformation given by the matrix K.
 The map from \(\{\check{\psi }^{(e)}\}_{e \in E}\) to \(\{\check{f}^{(e)}\}_{e \in E}\) is given byby (Matsen 2009, Lemma 2.2). It is an isomorphism between \(\mathbb {R}^{E \times \mathcal {G}}\) and \(\mathbb {R}_{>0}^{E \times \mathcal {G}}\).$$\begin{aligned} \check{f}^{(e)}(g)=\exp (\check{\psi }^{(e)}(g)) \end{aligned}$$(2)
 In the case when root distribution puts all the mass at the identity, the map from \(\{\check{f}^{(e)}\}_{e \in E}\) to \(\mathbf{q}\) is given byby (Székely et al. 1993, Theorem 3), where \(^*g_e=\sum _{i \in \Lambda (e)} g_i\) and \(\Lambda (e)\) is the set of observed leaves below e. See also (Sturmfels and Sullivant 2005, Sections 2 and 3) for a nice exposition of this result.$$\begin{aligned} q_{\mathbf{g}}=\prod _{e \in E} \check{f}^{e}(^*g_e) \end{aligned}$$(3)
In the case of the uniform root distribution, the identity (3) holds whenever \(g_1+\cdots +g_n=0\). Otherwise \(q_{\mathbf{g}}=0\). This follows from (Sturmfels and Sullivant 2005, Lemma 4 and formula (12)).
On the domain \(\mathbb {R}_{>0}^{E \times \mathcal {G}}\), this map is injective: (Matsen 2009, Proposition 3.3 and Proposition 3.4) give a map from \(\mathbf{q}\) to \(\{[\check{f}^{(e)}]^2\}_{e \in E}\). Taking nonnegative square roots results in a left inverse to the map (3).

The map from \(\mathbf{q}\) to \(\mathbf{p}\) is given by the inverse of the discrete Fourier transform of \(\mathcal {G}^n\). It is an invertible linear transformation given by the matrix \(H^{1}\), where H is the nfold Kronecker product of the matrix K.
Example 1
3 Implicit Descriptions of Symmetric GroupBased Models
Phylogenetic invariants are polynomials that vanish at joint probabilities at leaves for a given model and tree. They were introduced in Cavender and Felsenstein (1987) and Lake (1987) and have been characterized for groupbased phylogenetic models in (Sturmfels and Sullivant 2005, Theorem 1). Phylogenetic varieties are algebraic varieties derived from phylogenetic models and were first introduced in Allman and Rhodes (2003, 2004). In this paper, an algebraic variety is not necessarily irreducible. Phylogenetic invariants are elements of the ideal of a phylogenetic variety. Specifying a system of generators of the ideal of a phylogenetic variety is an important problem in phylogenetic algebraic geometry. However, the set of probability distributions forms only a (real, semialgebraic) subset of the phylogenetic variety, therefore providing a complete system of generators might have no biological interest. In Casanellas et al. (2015), a minimal set of phylogenetic invariants is constructed that defines the intersection of a phylogenetic variety with a Zariski open set. In the case of the Kimura 3parameter model, all the leaf probabilities that are images of real parameters in the phylogenetic model (not in the complexification of the model) lie in this Zariski open set. The number of polynomials in this set is equal to the codimension of the phylogenetic variety and each polynomial has degree at most \(\mathcal {G}\). This reduces drastically the number of phylogenetic invariants used: For the Kimura 3parameter model on a quartet tree, it drops from 8002 generators of the ideal to the 48 polynomials described in (Casanellas and FernándezSánchez 2008, Example 4.9).
Besides phylogenetic invariants, polynomial inequalities are needed to give an exact characterization of joint probabilities at leaves for a given model and a tree. For general symmetric groupbased models, polynomial inequalities that describe joint probabilities at leaves are studied in Matsen (2009). We recall (Matsen 2009, Propositions 3.3 and 3.4) that give the left inverse to the map (3) on the domain \(\mathbb {R}^{E \times \mathcal {G}}_{>0}\).
Proposition 1
Proposition 2
The next proposition will summarize the procedure in Matsen (2009) to construct inequalities that describe joint probabilities. We will denote by \((K^{1})_{g,:}\) the row of the matrix \(K^{1}\) labeled by g and by \((\check{f}^{(e)})^{(K^{1})_{g,:}}\) the Laurent monomial \(\prod _{h \in \mathcal {G}} (\check{f}^{(e)}(h))^{(K^{1})_{g,h}}\).
Proposition 3
 (i)
The constraints for \(\{\check{\psi }^{(e)}\}_{e \in E}\) are obtained by substituting \(\psi ^{(e)}\) by \(K^{1} \check{\psi }^{(e)}\) in the constraints for \(\{\psi ^{(e)}\}_{e \in E}\). In particular, this gives \(\check{\psi }^{(e)}(0)=0\), \((K^{1} \check{\psi }^{(e)})(g_1)=(K^{1} \check{\psi }^{(e)})(g_2)\) whenever \(L(g_1)=L(g_2)\) and \((K^{1} \check{\psi }^{(e)})(g) \ge 0\) for all nonzero \(g \in \mathcal {G}\).
 (ii)
The constraints for \(\{\check{f}^{(e)}\}_{e \in E}\) are \(\check{f}^{(e)}(0)=1\), \((\check{f}^{(e)})^{(K^{1})_{g_1,:}}=(\check{f}^{(e)})^{(K^{1})_{g_2,:}}\) whenever \(L(g_1)=L(g_2)\), \((\check{f}^{(e)})^{(K^{1})_{g,:}} \ge 1\) for all nonzero \(g \in \mathcal {G}\) and \(\check{f}^{(e)}(g) > 0\) for all \(g \in \mathcal {G}\). This equation and inequalities are equivalent to \(\check{f}^{(e)}(0)=1\), \((\check{f}^{(e)})^{(K^{1})_{g_1,:}}=(\check{f}^{(e)})^{(K^{1})_{g_2,:}}\) whenever \(L(g_1)=L(g_2)\), \((\check{f}^{(e)})^{2(K^{1})_{g,:}} \ge 1\) for all nonzero \(g \in \mathcal {G}\) and \(\check{f}^{(e)}(g) > 0\) for all \(g \in \mathcal {G}\). Here we have squared the inequalities \((\check{f}^{(e)})^{(K^{1})_{g,:}} \ge 1\).
 (iii)
The constraints for \(\mathbf{q}\) are given by phylogenetic invariants, equation \(q_{00\ldots 0}=1\), inequalities \(\mathbf{q} > 0\) and inequalities that are obtained by substituting expressions for \([\check{f}^{(e)}]^2\) in Propositions 1 and 2 to inequalities \((\check{f}^{(e)})^{2(K^{1})_{g,:}} \ge 1\) in the previous item.
 (iv)
The constraints for \(\mathbf{p}\) are obtained by substituting \(\mathbf{q}\) by \(H \mathbf{p}\) in the constraints for \(\mathbf{q}\).
For the sake of completeness, a proof of Proposition 3 is given in “Appendix A”.
Remark 1
In Proposition 3 item (iii), one applies Propositions 1 and 2 to obtain inequalities in the Fourier coordinates. However, in Propositions 1 and 2 one has a choice in choosing the leaf vertices. Since the Fourier coordinates are strictly positive, then any choice of leaf vertices in Propositions 1 and 2 gives equivalent inequalities in Proposition 3 item (iii) and it does not matter which choice we make.
Example 2
Remark 2
Identifiability of parameters of a phylogenetic model means that if for a fixed tree two sets of parameters map to the same joint probabilities at leaves, then these sets of parameters must be equal. Generic identifiability means that this statement is true with probability one. The identifiability of the CFN model was shown in (Hendy 1991, Theorem 1), of the Kimura 3parameter model in (Steel et al. 1998, Theorem 7) and the generic identifiability of the general Markov model in Chang (1996). The identifiability of any groupbased model follows also from the proof of Proposition 3, since each of the maps in (1) is an isomorphism in the region we are interested in.
Corollary 1
Consider a symmetric groupbased model. Any \(\mathbf{p}\) satisfying the equations and inequalities described in Proposition 3 that satisfies one of the inequalities with equality comes from a parametrization with an offdiagonal zero in the rate matrix \(Q^{(e)}\) for some \(e \in E\).
Proof
There are two different kinds of inequalities in item (4) of Proposition 3. The strict inequalities can never be satisfied with equality. The nonstrict inequalities in each step are obtained by substituting the inverse map to the inequalities in the previous step. Hence \(\mathbf{p}\) satisfies one of the nonstrict inequalities with equality if and only if it has a preimage \(\{\psi ^{(e)}\}_{e \in E}\) that satisfies one of the inequalities \(\psi ^{(e)}(g) \ge 0\) with equality. \(\square \)
Example 3
We consider the CFN model. A joint probability vector \(\mathbf{p}\) satisfying the assumptions of Corollary 1 has in its parametrization the rate matrix \(Q^{(e)}=\begin{pmatrix} 0 &{} 0\\ 0 &{} 0 \end{pmatrix}\) for some \(e \in E\). The transition matrix corresponding to the same edge is \(P^{(e)}=\begin{pmatrix} 1 &{} 0\\ 0 &{} 1 \end{pmatrix}\).
4 Maximum Likelihood Estimation via Numerical Algebraic Geometry
Example 4
In (Hosten et al. 2005, Example 14), maximum likelihood estimation on the Zariski closure of the CFN model on \(K_{1,3}\) is considered. This is the model that is defined by the equations in Example 2. For generic data, the number of complex critical points of the likelihood function on the Zariski closure of a model is called the ML degree. It is shown in (Hosten et al. 2005, Example 14) that the ML degree of the CFN model on \(K_{1,3}\) is 92. Using tools from numerical algebraic geometry, one can compute the 92 critical points and among the real critical points choose the one that gives the maximal value of the loglikelihood function.
However, the MLE can lie on the boundary of a statistical model or even not exist. Neither of this can be detected by considering only the Zariski closure of the model. We will see the latter happening for the CFN model on \(K_{1,3}\) in Example 5.
In practice, the MLE is solved using numerical methods such as the Newton–Raphson method (Schadt et al. 1998; Kenney and Gu 2012), quasiNewton methods Olsen et al. (1994) and the EM algorithm (Felsenstein 1981; Friedman et al. 2002; Holmes and Rubin 2002; Hobolth and Jensen 2005). However, since these methods are hillclimbing methods and the likelihood function on phylogenetic trees can have multiple local maxima (Steel 1994; Chor et al. 2000), they are only guaranteed to give a local maximum or a saddle point of the loglikelihood function and not necessarily the global maximum. Usually one uses a heuristic to find a good initialization for these methods or runs them for different starting points and chooses the output that maximizes the loglikelihood function.
We suggest a global method based on numerical algebraic geometry that theoretically gives the solution to the maximum likelihood estimation problem on phylogenetic trees with probability one. The main idea behind numerical algebraic geometry is homotopy continuation. Homotopy continuation finds isolated complex solutions of a system of polynomial equations starting from the known solutions of another system of polynomial equations. Numerical algebraic geometry methods give theoretically correct results with probability one, meaning that bad phenomena can happen when certain parameters are chosen from a measure zero set. An introduction to numerical algebraic geometry can be found in Sommese and Wampler (2005), Bates et al. (2013). In our context, the system of polynomial equations that we wish to solve comes from the Karush–Kuhn–Tucker (KKT) conditions (Karush 1939; Kuhn and Tucker 1951) for the optimization problem that maximizes the likelihood function on a phylogenetic model. The set of solutions of this polynomial system contains all the critical points of the likelihood function. The global maximum of the likelihood function is the solution of the polynomial system that maximizes the likelihood function among all the solutions that lie in the model.
This global approach for solving a nonconvex optimization problem on a set that is described by polynomial equations and inequalities has been previously employed in optimal control Rostalski et al. (2011) and in the life sciences (Gross et al. 2016). Our setup and algorithm are similar to those in Rostalski et al. (2011), although we provide further lemmas that allow us to decompose the system of polynomial equations that we want to solve to simpler systems of polynomial equations. The article Gross et al. (2016) uses Fritz John conditions instead of KKT conditions and focuses mostly on optimization problems on sets that are described by polynomial equations only. Sets that are described by polynomial equations and inequalities are considered in Section 3 of the supplementary material of Gross et al. (2016). In particular, the ideas for Theorem 1 and Remark 3 appear there.
Theorem 1
The idea behind Theorem 1 is that instead of optimizing a function over a semialgebraic set, one can optimize the function over the Zariski closure of the semialgebraic set and the Zariski closures of each of the boundaries of the semialgebraic set. This concept is discussed in Section 3 of the supplementary material of Gross et al. (2016).
Proof
First take an element \((\mu , \lambda , x)\) of V(L). Let S be such that \(G_i(x)=0\) for all \(i \in S\). Then \((\mu _S, \lambda , x) \in V(L_S)\), where \(\mu _S\) is the projection of \(\mu \) to the coordinates in S. Conversely, let \((\mu _S, \lambda , x) \in V(L_S)\). Let \(\mu \in \mathbb {C}^m\) be such that \(\mu _i=(\mu _S)_i\) for \(i\in S\) and \(\mu _i=0\) otherwise. Then \((\mu , \lambda , x) \in V(L)\).
We have shown that \(\pi _x (V(L))=\cup \pi _x(V(L_S))\), where \(\pi _x\) is the projection of \((\mu , \lambda , x)\) or \((\mu _S, \lambda , x)\) on x. By the Closure Theorem (Cox et al. 1992, Theorem 3.2.3), \(V(L \cap \mathbb {C}[x])\) is the smallest algebraic variety containing \(\pi _x (V(L))\) and \(V(L_S \cap \mathbb {C}[x])\) is the smallest algebraic variety containing \(\pi _x(V(L_S))\). The inclusion \(V(L \cap \mathbb {C}[x]) \subseteq \cup V(L_S \cap \mathbb {C}[x])\) holds, because the righthand side is a variety and contains \(\cup \pi _x(V(L_S))\) and hence \(\pi _x (V(L))\). On the other hand, since \(\pi _x(V(L_S)) \subseteq \pi _x (V(L))\) for every S, also \(V(L_S \cap \mathbb {C}[x]) \subseteq V(L \cap \mathbb {C}[x])\) for every S. Hence \(V(L \cap \mathbb {C}[x])=\cup V(L_S \cap \mathbb {C}[x])\). \(\square \)
Corollary 2
If V(L) is finite and the global maxima of the optimization problem (49) satisfy CRCQ, then Algorithm 1 outputs the global maxima.
Proof
Theorem 1 implies that \(V(L \cap \mathbb {C}[x])=\cup V(L_S \cap \mathbb {C}[x])\). The variety V(L) being finite implies that \(V(L \cap \mathbb {C}[x])\) and hence all \(V(L_S \cap \mathbb {C}[x])\) are finite. Hence after Step 2, the list \(\mathcal {C}\) contains all solutions of Eqs. (50), (52) and (54) in the KKT conditions. Since the global maxima satisfy the CRCQ, they must be solutions of these equations. By choosing among the real solutions that satisfy inequalities (51) and (53) in the KKT conditions the ones that maximize the value of the cost function F, we get the global maxima. \(\square \)
One of the reasons why the variety \(V(L_S)\) in Step 2 of Algorithm 1 might not be finite is that the Lagrange conditions for MLE might be satisfied by higherdimensional components where some variable is identically zero. For MLE, Gross and Rodriguez have defined a modification of the Lagrange conditions, known as Lagrange likelihood equations (Gross and Rodriguez 2014, Definition 2), whose solution set does not contain solutions with some variable equal to zero if the original data does not contain zeros (Gross and Rodriguez 2014, Proposition 1). However, the Lagrange likelihood equations can be applied only to homogeneous prime ideals. This motivates us to study Lagrange conditions for decompositions of ideals.
Lemma 1
Assume that the ideal \(J=\langle G_i: i = 1,\ldots , m \rangle \) decomposes as \(J=J_1 \cap J_2\), where \(J_1=\langle G^{(1)}_j: j = 1,\ldots , m_1 \rangle \) and \(J_2=\langle G^{(2)}_k: k = 1,\ldots , m_2 \rangle \). If \(x^*\) satisfies the Lagrange conditions for the optimization problem max F(x) subject to \(G_i(x)=0\) for \(i=1,\ldots ,m\), then \(x^*\) satisfies the Lagrange conditions for the optimization problem max F(x) subject to \(G^{(1)}_j(x)=0\) for \(j=1,\ldots ,m_1\) or for the optimization problem max F(x) subject to \(G^{(2)}_k(x)=0\) for \(k=1,\ldots ,m_2\).
Proof
Lemma 2
Let \(J=J_1 \cap J_2\) and \(K=K_1 \cap K_2\). If \(x^*\) satisfies the Lagrange conditions for the optimization problem max F(x) subject to the generators of \(J+K\), then \(x^*\) satisfies the Lagrange conditions for one of the optimization problems max F(x) subject to the generators of \(J_j+K_k\), where \(j,k \in \{1,2\}\).
Proof
Lemma 1 suggests that if S is a singleton in Step 2 of Algorithm 1, then we can replace the ideal \(L_S\) of Lagrange conditions for \(I_S\) in Step 2 of Algorithm 1 by the ideals of Lagrange conditions for minimal primes of \(I_S\). If \(S=\{i_1,\ldots ,i_{S}\}\), then \(I_S=I_{\{i_1\}}+\ldots +I_{\{i_{S}\}}\). Hence by Lemmas 1 and 2, we can replace the ideal \(L_S\) by the ideals of Lagrange conditions for the sum of minimal primes of \(I_{\{i_j\}}\), where \(1\le j \le S\).
Remark 3
As discussed in Section 3.2 of the supplementary material to Gross et al. (2016), one can ignore all the components where one of the constraints is \(x_k=0\) or the sum of some variables is zero. If one of the variables is zero, then the value of the loglikelihood function is \(\infty \). If the sum of some variables is zero, then all of them have to be zero, because none of them can be negative.
Remark 4
In practice, it is crucial to know the degrees of the ideals \(L_S\) of Lagrange conditions. We recall that these degrees are also known as ML degrees. Although in theory, polynomial homotopy continuation finds all solutions of a system of polynomial equations with probability one, in practice, this can depend on the settings of the program. Without knowing the ML degree, there is no guarantee that any numerical method finds all critical points. For the CFN model on \(K_{1,3}\), we experimented with Bertini Bates et al. (2006), NumericalAlgebraicGeometry package in Macaulay2 Leykin (2011) and PHCpack Verschelde (1999). We ran these three programs with default settings to find the critical points of the loglikelihood function on the Zariski closure of the CFN model on \(K_{1,3}\). For our example, only PHCpack found all 92 critical points discussed in Example 4.
Example 5
To find the MLE, we have to consider three different optimization problems corresponding to the three different cases in Example 2. In each of the cases, we relax the implicit characterization given in Example 2 by replacing strict inequalities with nonstrict inequalities. Specifically, in the first case, the polynomials \(G_i\) are given by the lefthand sides of the inequalities (9)–(19) and the polynomials \(H_j\) are given by the lefthand sides of Eqs. (5)–(8); in the second case, the polynomials \(G_i\) are given by the lefthand sides of the inequalities (24)–(34) and the polynomials \(H_j\) are given by the lefthand sides of Eqs. (20)–(23); in the third case, the polynomials \(G_i\) are given by the lefthand sides of the inequalities (41), (43), (44), (46)–(48) and the polynomials \(H_j\) are given by the lefthand sides of Eqs. (35)–(40), (42) and (45). We apply the modified version of Algorithm 1 that uses the output of Algorithm 2 in Step 2. It is enough to run Algorithm 2 and Step 2 of Algorithm 1 for the first optimization problem only as the polynomials \(G_i\) and \(H_j\) are the same for the first two optimization problems; in the third optimization problem there is one polynomial less and some polynomials \(G_i\) are among polynomials \(H_j\), but all ideals considered in Algorithm 2 and Step 2 of Algorithm 1 for the third optimization problem are among the ideals for the first optimization problem. In Step 3 we have to check whether elements satisfy \(G_i(x) \ge 0\) and \(H_j(x)=0\) for any of the three optimization problems. The code for this example can be found at the link:
https://github.com/kaiekubjas/phylogenetics
Table summarizing different boundary components
Dim I  Degree L  # of ideals 

5  92  1 
4  9  4 
4  1  8 
3  1  24 
2  1  6 
1  1  1 
Total  167  44 
Critical points with the highest values of the loglikelihood function
p  \(l_{u}\)  MLE 

(.183, .051, .256, .055, .147, .053, .204, .052)  \(\) 188.451  No 
(.183, .049, .243, .065, .156, .042, .207, .055)  \(\) 188.722  No 
(.191, .053, .243, .042, .156, .065, .199, .051)  \(\) 188.803  No 
(.165, .05, .23, .055, .165, .05, .23, .055)  \(\) 188.927  No 
(.17, .045, .225, .06, .17, .045, .225, .06)  \(\) 189.042  No 
(.174, .059, .221, .046, .174, .059, .221, .046)  \(\) 189.303  No 
(.22, .05, .22, .05, .175, .055, .175, .055)  \(\) 189.488  Yes 
Remark 5
In Example 5, we chose the rate parameters of the true data generating distribution such that the joint leaf probabilities of this distribution would be close to the boundary of the model. In particular, the Fourier leaf probabilities \(q_{010},q_{011},q_{110},q_{111}\) are almost zero. We recall that the semialgebraic description of the CFN model includes strict inequalities \(\mathbf q >0\). The global maximum of the likelihood function on the closure of the CFN model on \(K_{1,3}\) satisfies \(q_{010}=q_{011}=q_{110}=q_{111}=0\). Since this global maximum is not in the model, the MLE does not exist. We expect the similar phenomenon that if our true data generating distribution is close to the boundary, then the MLE does not exist to happen with nonzero probability. In particular, if the normalized data vector lies on the part of the boundary that is not in the model, then we know that the MLE does not exist.
Notes
Acknowledgements
We thank Elizabeth Allman, Taylor Brysiewicz, Marta Casanellas, Alexander Davie, Jesús FernándezSánchez, Serkan Hosten, Jordi RocaLacostena, Bernd Sturmfels, and Piotr Zwiernik for helpful discussions and comments on earlier versions of this manuscript. We thank Aki Malinen whose code considerably simplified doing the computations and helped us to correct a mistake in the maximum likelihood section. We thank the editor and two anonymous reviewers for their valuable comments that helped to improve the manuscript.
References
 Allman ES, Rhodes JA (2003) Phylogenetic invariants for the general Markov model of sequence mutation. Math Biosci 186:113–144MathSciNetCrossRefGoogle Scholar
 Allman ES, Rhodes JA (2004) Quartets and parameter recovery for the general Markov model of sequence mutation. Appl Math Res Express 4:107–131MathSciNetCrossRefGoogle Scholar
 Bates DJ, Hauenstein JD, Sommese AJ, Wampler CW (2006) Bertini: software for numerical algebraic geometry. Available at bertini.nd.edu. https://doi.org/10.7274/R0H41PB5
 Bates DJ, Hauenstein JD, Sommese AJ, Wampler CW (2013) Numerically solving polynomial systems with Bertini. SIAM, PhiladelphiazbMATHGoogle Scholar
 Cavender JA (1978) Taxonomy with confidence. Math Biosci 40:271–280MathSciNetCrossRefGoogle Scholar
 Cavender JA, Felsenstein J (1987) Invariants of phylogenies in a simple case with discrete states. J Classif 4:57–71CrossRefGoogle Scholar
 Casanellas M, FernándezSánchez J (2008) Geometry of the Kimura 3parameter model. Adv Appl Math 41:265–292MathSciNetCrossRefGoogle Scholar
 Casanellas M, FernándezSánchez J, Michałek M (2015) Low degree equations for phylogenetic groupbased models. Collect Math 66:203–225MathSciNetCrossRefGoogle Scholar
 Chang JT (1996) Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math Biosci 137:51–73MathSciNetCrossRefGoogle Scholar
 Chor B, Hendy MD, Holland BR, Penny D (2000) Multiple maxima of likelihood in phylogenetic trees: an analytic approach. Mol Biol Evol 17:1529–1541CrossRefGoogle Scholar
 Cox D, Little J, O’Shea D (1992) Ideals, varieties, and algorithms, Undergraduate texts in mathematics 3. Springer, New YorkCrossRefGoogle Scholar
 Dhar A, Minin VN (2016) Maximum likelihood phylogenetic inference. In: Kliman RM (ed) Encyclopedia of evolutionary biology. Academic Press, Oxford, pp 499–506CrossRefGoogle Scholar
 Evans SN, Speed TP (1993) Invariants of some probability models used in phylogenetic inference. Ann Stat 21:355–377MathSciNetCrossRefGoogle Scholar
 Farris JS (1973) A probability model for inferring evolutionary trees. Syst Zool 22:250–256CrossRefGoogle Scholar
 Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–76CrossRefGoogle Scholar
 Friedman N, Ninio M, Pe’er I, Pupko T (2002) A structural EM algorithm for phylogenetic inference. J Comput Biol 9:331–53CrossRefGoogle Scholar
 Gross E, Davis B, Ho KL, Bates DJ, Harrington HA (2016) Numerical algebraic geometry for model selection and its application to the life sciences. J R Soc Interface 13:1–9CrossRefGoogle Scholar
 Gross E, Rodriguez JI (2014) Maximum likelihood geometry in the presence of data zeros. In: Proceedings of the 39th international symposium on symbolic and algebraic computation, ACM, pp 232–239Google Scholar
 Hendy MD (1991) A combinatorial description of the closest tree algorithm for finding evolutionary trees. Discrete Math 96:51–58MathSciNetCrossRefGoogle Scholar
 Hobolth A, Jensen JL (2005) Statistical inference in evolutionary models of DNA sequences via the EM algorithm. Stat Appl Genet Mol Biol. https://doi.org/10.2202/15446115.1127
 Holmes I, Rubin GM (2002) An expectation maximization algorithm for training hidden substitution models. J Mol Biol 317:753–64CrossRefGoogle Scholar
 Hosten S, Khetan A, Sturmfels B (2005) Solving the likelihood equations. Found Comput Math 5:389–407MathSciNetCrossRefGoogle Scholar
 Janin R (1984) Direction derivative of the marginal function in nonlinear programming. Math Program Stud 21:127–138CrossRefGoogle Scholar
 Karush W (1939) Minima of functions of several variables with inequalities as side constraints, M.Sc. Dissertation. Department of Mathematics, University of Chicago, Chicago, IllinoisGoogle Scholar
 Kenney T, Gu H (2012) Hessian calculation for phylogenetic likelihood based on the pruning algorithm and its applications. Stat Appl Genet Mol Biol 11(4):1–46MathSciNetCrossRefGoogle Scholar
 Kingman JFC (1962) The imbedding problem for finite Markov chains. Z Wahrscheinlichkeitstheorie 1:14–24MathSciNetCrossRefGoogle Scholar
 Kuhn HW, Tucker AW (1951) Nonlinear programming. In: Proceedings of 2nd Berkeley symposium, University of California Press, Berkeley, pp 481–492Google Scholar
 Lake J (1987) A rateindependent technique for analysis of nucleic acid sequences: evolutionary parsimony. Mol Biol Evol 4:167–191Google Scholar
 Leykin A (2011) Numerical algebraic geometry. J Softw Algebra Geom Macaulay2 3:5–10MathSciNetCrossRefGoogle Scholar
 Matsen F (2009) Fourier transform inequalities for phylogenetic trees. IEEE/ACM Trans Comput Biol Bioinform 6:89–95CrossRefGoogle Scholar
 Neyman J (1971) Molecular studies of evolution: a source of novel statistical problems. In: Gupta SS, Yackel J (eds) Statistical decision theory and related topics. Academic Press, New York, pp 1–27Google Scholar
 Norris JR (1998) Markov chains. Cambridge University Press, CambridgezbMATHGoogle Scholar
 Olsen GJ, Matsuda H, Hagstrom R, Overbeek R (1994) fastDNAml: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Bioinformatics 10:41–8CrossRefGoogle Scholar
 RocaLacostena J, FernándezSánchez J (2017) Embeddability of Kimura 3ST Markov matrices, arXiv:1703.02263
 Rostalski P, Fotiou IA, Bates DJ, Beccuti AG, Morari M (2011) Numerical algebraic geometry for optimal control applications. SIAM J Optim 21:417–37MathSciNetCrossRefGoogle Scholar
 Schadt EE, Sinsheimer JS, Lange K (1998) Computational advances in maximum likelihood methods for molecular phylogeny. Genome Res 8:222–33CrossRefGoogle Scholar
 Sommese AJ, Wampler CW (2005) The numerical solution of systems of polynomials arising in engineering and science. World Scientific Publishing, HackensackCrossRefGoogle Scholar
 Steel M (1994) The maximum likelihood point for a phylogenetic tree is not unique. Syst Biol 43:560–4CrossRefGoogle Scholar
 Steel M, Hendy MD, Penny D (1998) Reconstructing phylogenies from nucleotide pattern probabilities: a survey and some new results. Discrete Appl Math 88:367–396MathSciNetCrossRefGoogle Scholar
 Sturmfels B, Sullivant S (2005) Toric ideals of phylogenetic invariants. J Comput Biol 12:204–228CrossRefGoogle Scholar
 Székely LA, Steel MA, Erdös PL (1993) Fourier calculus on evolutionary trees. Adv Appl Math 14:200–210MathSciNetCrossRefGoogle Scholar
 Verschelde J (1999) PHCpack: a generalpurpose solver for polynomial systems by homotopy continuation. ACM Trans Math Softw 25:251–276CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.