Abstract
We study model embeddability, which is a variation of the famous embedding problem in probability theory, when apart from the requirement that the Markov matrix is the matrix exponential of a rate matrix, we additionally ask that the rate matrix follows the model structure. We provide a characterisation of model embeddable Markov matrices corresponding to symmetric groupbased phylogenetic models. In particular, we provide necessary and sufficient conditions in terms of the eigenvalues of symmetric groupbased matrices. To showcase our main result on model embeddability, we provide an application to hachimoji models, which are eightstate models for synthetic DNA. Moreover, our main result on model embeddability enables us to compute the volume of the set of model embeddable Markov matrices relative to the volume of other relevant sets of Markov matrices within the model.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The embedding problem for stochastic matrices, also known as Markov matrices, deals with the question of deciding whether a stochastic matrix M is the matrix exponential of a rate matrix Q. A rate matrix, also known as a Markov generator, has nonnegative nondiagonal entries and row sums equal to zero. If a stochastic matrix satisfies such a property and can be expressed as a matrix exponential of a rate matrix, namely \(M= e^{Qt}\), then M is called embeddable. Applications of the embeddability property vary from biology and nucleotide substitution models (Verbyla et al. 2013) to mathematical finance (Israel et al. 2001). For a first formulation of the embedding problem see Elfving (1937). An account of embeddable Markov matrices is provided in Davies (2010). The embedding problem for \(2 \times 2\) matrices is due to Kendall and first published by Kingman (1962), for \(3 \times 3\) matrices is fully settled in a series of papers Carette (1995), Chen and Chen (2011), Cuthbert (1973), Israel et al. (2001) and Johansen (1974), while for \(4 \times 4\) stochastic matrices has been recently solved in Casanellas et al. (2020b). In general, when the size n of the stochastic matrix is greater than 4, the work Casanellas et al. (2020b) establishes a criterion for deciding whether a generic \(n\times n\) Markov matrix with distinct eigenvalues is embeddable and proposes an algorithm that lists all its Markov generators. In the present paper we study a refinement of the classical embedding problem, called the model embedding problem for a class of \(n \times n\) stochastic matrices coming from phylogenetic models.
Phylogenetics is the field that aims at reconstructing the history of evolution of species. A phylogenetic model is a mathematical model used to understand the evolutionary process given genetic data sets. The most popular phylogenetic models are nucleotide substitution models which use aligned DNA sequence data to study the molecular evolution of DNA. A comprehensive treatment of phylogenetic methods is given by Felsenstein, who is considered the initiator of statistical phylogenetics, in his seminal book Felsenstein (2003). Algebraic and geometric methods have been employed with great success in the study of phylogenetic models leading to an explosion of related research work and the establishment of the field of phylogenetic algebraic geometry, also known as algebraic phylogenetics; see Allman and Rhodes (2003), Baños et al. (2016), Casanellas and FernándezSánchez (2007), Cavender and Felsenstein (1987), Gross and Long (2018), Evans and Speed (1993), Lake (1987), Pachter and Sturmfels (2004) and Sturmfels and Sullivant (2005) for a nonexhaustive list of publications.
To build such a phylogenetic model, we first require a phylogenetic tree T, which is a directed acyclic graph comprising of vertices and edges representing the evolutionary relationships of a group of species. The vertices with valency 1 are called the leaves of the tree. The tree is considered rooted and the direction of evolution is from the root towards the leaves. On each vertex of the tree T, we associate a random variable with k possible states, where in phylogenetics k is often taken to be 2, for the binary states \(\{0,1\}\), or 4, to represent the four types of DNA nucleotides \(\{ \texttt {A, C, G, T} \}\). We also require a transition matrix (also known as a mutation matrix) \(M^{(e)}\) corresponding to each edge e of the tree, where the entries of this \(k\times k\) matrix \(M^{(e)}\) represent the probabilities of transition between states. In a phylogenetic tree, the leaves correspond to extant species and so the random variables at the leaves are observed, while the interior vertices correspond to possibly extinct species and so the random variables at the interior vertices are hidden.
In this paper we are focusing on symmetric groupbased substitution models. Substitution models are a class of phylogenetic models which use a Markov process to describe the substitution of nucleotides over time in a given DNA sequence and for which the transition matrices along an edge e are stochastic matrices of the form \(M^{(e)}=\exp \left( t_eQ^{(e)}\right) \). Groupbased models are a special class of substitution models, in which the matrices \(Q^{(e)}\) can be pairwise distinct, but they can all be simultaneously diagonalizable by a linear change of coordinates given by the discrete Fourier transform of an abelian group, also called a commutative group. For example, the Cavender–Farris–Neyman (CFN) model (Cavender 1978; Farris 1973; Neyman 1971), as well as the Jukes–Cantor (JC) (Jukes and Cantor 1969), the Kimura2 parameter (K2P) (Kimura 1980) and the Kimura3 parameter (K3P) (Kimura 1981) models for DNA are all groupbased phylogenetic models. In Sturmfels and Sullivant (2005), it is established that through the discrete Fourier transform groupbased models correspond to toric varieties, which are geometric objects with nice combinatorial properties. We are interested in symmetric groupbased substitution models. Namely, apart from distinct and simultaneously diagonalizable, the transition matrices \(Q^{(e)}\) are also symmetric square matrices. The symmetricity assumption guarantees that the eigenvalues of rate and transition matrices of a groupbased model are real, a property that we use in the proof of our main theorem. Symmetric models are a subset of a special class of models called timereversible models, where the Markov process appears identical when moving forward or backward in time. Our results apply to groupbased models following an ergodic timereversible Markov process, as in this case the rate matrices Q are symmetric according to Pachter and Sturmfels (2004, Lemma 17.2).
The classical embedding problem is concerned with deciding which square Markov matrices are embeddable, namely given a Markov matrix M whether there exists a rate matrix Q such that \(M=\exp (Q)\). A variant of the embedding problem that asks for a reversible Markov generator for a stochastic matrix is studied in Jia (2016). When we impose the assumption that the rate matrix Q follows the corresponding model conditions, we arrive at a different refined notion of embeddability called model embeddability. The embeddability of circulant and equalinput stochastic matrices is studied in Baake and Sumner (2020). In the current paper, we focus on the \((\mathcal {G},L)\)embeddability for \(n \times n\) matrices corresponding to symmetric groupbased substitution models. The \((\mathcal {G},L)\)embeddability means that we require that the rate matrices Q preserve the symmetric groupbased structure imposed by the abelian group \(\mathcal {G}\) and the symmetric labelling L, which we define at the beginning of Sect. 2. Model embeddability for symmetric groupbased models is relevant both for homogeneous and inhomogeneous timecontinuous processes, as groupbased models are Lie Markov models, and hence multiplicatively closed (Sumner et al. 2012; Verbyla et al. 2013). A study of the set of embeddable and modelembeddable matrices corresponding to the Jukes–Cantor, Kimura2 and Kimura3 DNA substitution models, which are all symmetric groupbased models, is undertaken in Casanellas et al. (2020a) and RocaLacostena and FernándezSánchez (2018). In particular, a full characterisation of the set of embeddable \(4\times 4\) Kimura 2parameter matrices is provided in Casanellas et al. (2020a), which together with the results of RocaLacostena and FernándezSánchez (2018) fully solve the embedding problem for the Kimura 3parameter model as well. Although model embeddability, which is a refined notion of embeddability imposed by the model structure, implies classical embeddability, the converse is generally not true (see also RocaLacostena and FernándezSánchez 2018, Example 3.1).
The main result of this paper is a characterization of \((\mathcal {G},L)\)embeddability for any abelian group \(\mathcal {G}\) equipped with a symmetric \(\mathcal {G}\)labeling function L in Theorem 1. We provide necessary and sufficient conditions which the eigenvalues of the stochastic matrix of the model need to satisfy for the matrix to be \((\mathcal {G},L)\)embeddable. To showcase our result, we first introduce three groupbased models with the underlying group \(\mathbb {Z}_2 \times \mathbb {Z}_2 \times \mathbb {Z}_2\), based on the hachimoji DNA system introduced in Hoshika et al. (2019). Hachimoji, a Japanese word meaning “eight letters”, is used to describe a synthetic analog of the nucleic acid DNA, where we have the four natural nucleobases \(\{\texttt {A,C,G,T}\}\) and furthermore an additional four synthetic nucleotides \(\{\texttt {P,Z,B,S}\}\). We then apply Theorem 1 to characterise the model embeddability for the three hachimoji DNA models. The three models are called hachimoji 7parameter, hachimoji 3parameter and hachimoji 1parameter models, which can be thought of as generalisations of the Kimura 3, Kimura 2 and Jukes–Cantor models respectively. Finally, the characterisation of model embeddability in terms of eigenvalues enables us to compute the volume of the \((\mathcal {G},L)\)embeddable Markov matrices and compare this volume with volumes of other relevant sets of Markov matrices. For the general Jukes–Cantor model, which includes the hachimoji 1parameter model, the volumes can be derived exactly; for the hachimoji 3parameter model and for the hachimoji 7parameter model symbolically and numerically.
The outline of the paper is the following. Section 2 gives a mathematical background covering notions such as the labeling functions, groupbased models and the discrete Fourier transform. Section 3 introduces symmetric \(\mathcal {G}\)compatible labelings which is a class of labeling functions with particularly nice properties. Section 4 presents the main result of this paper about the model embedding problem for symmetric groupbased models equipped with a certain labeling function. Then in Sect. 5 we focus on the hachimoji DNA and provide exact characterisation of the model embeddability in terms of eigenvalues of the Markov matrix for the hachimoji 7parameter, the hachimoji 3parameter and the hachimoji 1parameter models. Finally, Sect. 6 presents results on the volume of stochastic matrices that are \((\mathcal {G},L)\)embeddable for the three hachimoji groupbased models. The code for the computations in this paper is available at https://github.com/ardiyam1/ModelEmbeddabilityforSymmetricGroupBasedModels.
2 Preliminaries
In this section, we give background on groupbased models and the discrete Fourier transform.
Definition 1
Let \(\mathcal {G}\) be a finite additive abelian group and \(\mathcal {L}\) a finite set. A labeling function is any function \(L:\mathcal {G}\rightarrow \mathcal {L}\).
In the groupbased model with underlying finite abelian group \(\mathcal {G}\), states are in bijection with the elements of the group \(\mathcal {G}\). Fundamental in the definition of a groupbased model associated to a finite additive abelian group \(\mathcal {G}\) and a labeling function L is that the rate of mutation from a state g to state h depends only on \(L(hg)\): That is, the entries of a rate matrix Q are
for a vector \(\psi \in \mathbb {R}^{\mathcal {G}}\) satisfying \(\sum _{g \in \mathcal {G}} \psi (g)=0\), \(\psi (g) \ge 0\) for all nonzero \(g \in \mathcal {G}\) and \(\psi (g)=\psi (h)\), whenever \(L(g)=L(h)\). We say that such Q is a \((\mathcal {G},L)\)rate matrix. In this paper, the rate matrices in groupbased models are assumed to be symmetric, i.e., \(\psi (g)=\psi (g)\) for every \(g\in \mathcal {G}\). Since the matrix exponential of a symmetric matrix is again symmetric, then the entries of a transition matrix \(P=\exp (Q)\) are
for a vector \(f \in \mathbb {R}^{\mathcal {G}}\) satisfying \(\sum f(g)=1\), \(f(g) \ge 0\) for all \(g \in \mathcal {G}\) and \(f(g)=f(g)\) for all \(g \in \mathcal {G}\). As we see in Example 2, in general it is not true that \(f(g)=f(h)\) whenever \(L(g)=L(h)\). In Sect. 3, we introduce \(\mathcal {G}\)compatible labeling functions which guarantee this property and then we say that P is a \((\mathcal {G},L)\)Markov matrix.
Example 1
Let \(\mathcal {G}=\mathbb {Z}_2\times \mathbb {Z}_2\) and \(\mathcal {L}=\{0,1,2,3\}\). We identify nucleotides with the group elements of \(\mathbb {Z}_2\times \mathbb {Z}_2\) as \(\texttt {A}=(0,0), \texttt {T}=(0,1), \texttt {C}=(1,0) \text { and } \texttt {G}=(1,1)\). The Kimura 3parameter, the Kimura 2parameter and the Jukes–Cantor models correspond to the following labeling functions
respectively. The Kimura 3parameter rate and transition matrices have the form
In the case of the Kimura 2parameter model additionally \(b=c\), after choosing the ordering \(\texttt {A}, \texttt {T}, \texttt {C}, \texttt {G}\) of nucleotide bases. This will be further explained in Example 3. The Jukes–Cantor model is the submodel when \(b=c=d\).
Example 2
Let \(\mathcal {G}=\mathbb {Z}_7\) and L be a labeling function such that \(L(1)=L(2)=L(5)=L(6)\) and \(L(3)=L(4)\). Consider the \((\mathcal {G},L)\)rate matrix
In this rate matrix, \(\psi (1)=\psi (2)=\psi (5)=\psi (6)=0.125\) and \(\psi (3)=\psi (4)=0.25\). By direct computation, we get
The matrix P is not a \((\mathcal {G},L)\)Markov matrix, since \(0.0858551=f(1)=f(6)\ne f(2)=f(5)=0.0834148\). The entries of P induce a labeling function \(L'\) such that \(L'(1)=L'(6)\ne L'(2)=L'(5)\) and \(L'(3)=L'(4).\) In this case, the matrix P is a \((\mathcal {G},L')\)Markov matrix.
Example 2 shows that the matrix exponential does not necessarily preserve the labeling function associated to a rate matrix. Conversely, Example 3.1 of RocaLacostena and FernándezSánchez (2018) suggests that a Kimura 3parameter Markov matrix can be embeddable, despite the fact that it does not have any Markov generator satisfying Kimura 3parameter model constraints.
Let \(\mathbb {C}^*\) denote the multiplicative group of complex numbers without zero. A group homomorphism from \(\mathcal {G}\) to \(\mathbb {C}^*\) is called a character of \(\mathcal {G}\). The characters of \(\mathcal {G}\) form a group under multiplication, called the character group of \(\mathcal {G}\) and denoted by \(\widehat{\mathcal {G}}\). Here the product of two characters \(\chi _1,\chi _2\) of the group \(\mathcal {G}\) is defined by \((\chi _1\chi _2)(g)=\chi _1(g)\chi _2(g)\) for every \(g\in \mathcal {G}\). The character group \(\widehat{\mathcal {G}}\) is isomorphic to \(\mathcal {G}\). Given a group isomorphism between \(\mathcal {G}\) and \(\widehat{\mathcal {G}}\), we will denote by \(\widehat{g}\in \widehat{\mathcal {G}}\) the image of \(g\in \mathcal {G}\). For a finite group \(\mathcal {G}\), the values of characters are roots of unity.
Lemma 1
(Pachter and Sturmfels 2005, Lemma 17.1) Let \(g,h \in \mathcal {G}\) and \(k \in \mathbb {Z}\). Then \(\widehat{g}(h)=\overline{\widehat{g}(h)}\) and \(\widehat{kg}(h)=\widehat{g}(kh)\), where \(\overline{a}\) denotes the complex conjugate of \(a \in \mathbb {C}\).
Given a function \(a: \mathcal {G}\rightarrow \mathbb {C}\), its discrete Fourier transform is a function \(\check{a}: \mathcal {G}\rightarrow \mathbb {C}\) defined by
Lemma 2
(Matsen 2008, Section 2) For any realvalued function \(a:\mathcal {G}\rightarrow \mathbb {C}\), the identity \(\check{a}(g)=\overline{\check{a}(g)}\) holds for all \(g \in \mathcal {G}\). Moreover, if \(a(g)=a(g)\) for all \(g \in \mathcal {G}\), then \(\check{a}(g)=\check{a}(g)\) for all \(g \in \mathcal {G}\) and \(\check{a}\) is a realvalued function.
In the proof of Theorem 1, we will use that \(\check{\psi }\) and \(\check{f}\) are realvalued. For this reason, in this paper we consider only groupbased models that are equipped with symmetric labeling functions, i.e. \(L(g)=L(g)\) for all \(g\in \mathcal {G}\). In other words, a symmetric groupbased model assumes that the transition matrices are real symmetric matrices.
The discrete Fourier transform is a linear endomorphism on \(\mathbb {C}^{\mathcal {G}}\). We will denote its matrix by K. In particular, the entries of K are \(K_{g,h}=\widehat{g}(h)\) for \(g,h \in \mathcal {G}\). The matrix K is symmetric for any finite abelian group (Luong 2009, Section 3.2). The inverse of the discrete Fourier transformation matrix is \(K^{1}=\frac{1}{\mathcal {G}} K^*\), where \(K^*\) denotes the adjoint of K (Luong 2009, Corollary 3.2.2).
The following lemma describes the relation between functionals f and \(\psi \) in the case \(f(g)=f(g)\) and \(\psi (g)=\psi (g)\) for all \(g \in \mathcal {G}\).
Lemma 3
(Matsen 2008, Lemma 2.2) Let Q be determined by \(\psi \in \mathbb {R}^{\mathcal {G}}\) and P be determined by \(f \in \mathbb {R}^{\mathcal {G}}\) as described earlier in this section such that \(P=e^Q\). Furthermore, assume that \(\psi (g)=\psi (g)\) and \(f(g)=f(g)\) for all \(g \in \mathcal {G}\). Then, \(\check{f}(g)=e^{\check{\psi }(g)}\) for all \(g\in \mathcal {G}\).
Lemma 4
Let Q be determined by \(\psi \in \mathbb {R}^{\mathcal {G}}\) and P be determined by \(f \in \mathbb {R}^{\mathcal {G}}\) as described earlier in this section. Furthermore, assume that \(\psi (g)=\psi (g)\) and \(f(g)=f(g)\) for all \(g \in \mathcal {G}\). Let \(K_g\) denote the column of the discrete Fourier transform matrix labeled by g. The eigenpairs of Q (resp. P) are \((\check{\psi }(g),K_g)\) (resp. \((\check{f}(g),K_g)\)) for \(g \in \mathcal {G}\).
Proof
This result is stated in the proof of Matsen (2008, Lemma 2.2).
In particular, in the case of a Markov matrix, the column vector of ones is an eigenvector with eigenvalue one. In the case of a rate matrix, the column vector of ones is an eigenvector with eigenvalue zero. A direct consequence of Lemma 4 is that Q and P are diagonalizable by K, i.e. \(Q=KD_1K^{1}\) and \(P=KD_2K^{1}\) where \(D_1\) and \(D_2\) are diagonal matrices with diagonals given by the vectors \(\check{\psi }\) and \(\check{f}\) of \(\mathbb {R}^{\mathcal {G}}\) respectively.
3 \(\mathcal {G}\)compatible labeling functions
In this section, we introduce a class of labeling functions with the property that the symmetries of the probability vector are preserved under the discrete Fourier transformation. This property is required for any result that is proven using the discrete Fourier transform. Notably, labeling functions for all common groupbased models (CFN, K3P, K2P, and JC models) are \(\mathcal {G}\)compatible.
Definition 2
Let \(\mathcal {G}\) be a finite additive abelian group, \(\mathcal {L}\) a set and \(L:\mathcal {G}\rightarrow \mathcal {L}\) a labeling function. Let K be the discrete Fourier transformation matrix for \(\mathcal {G}\) and \(x_L\) be the column vector of length \(\mathcal {G}\) whose gth component is the indeterminate \(x_{L(g)}\). We say that L is a \(\mathcal {G}\)compatible labeling function if for every \(g,h\in \mathcal {G}\) with \(L(g)=L(h)\), we have that \(K_{g,:} \cdot x_L=K_{h,:} \cdot x_L\) and \((K^{1})_{g,:} \cdot x_L=(K^{1})_{h,:} \cdot x_L\). Here \(M_{a,:}\) denotes the row of M indexed by group element a.
A labeling function that maps every group element to a different label is trivially \(\mathcal {G}\)compatible.
Remark 1
In the definition of a \(\mathcal {G}\)compatible labeling, we require that the matrices K and \(K^{1}\) preserve the symmetries of the vector of labels \(x_L\). For symmetric groupbased models, it is enough to require that only K or \(K^{1}\) preserves the symmetries of the vector of labels \(x_L\). Recall that
The property \(\widehat{g}(h)=\overline{\widehat{g}(h)}\) implies \(K_{g,h}=\overline{K_{g,h}}\) and \(\overline{K_{g,h}}=K_{g,h}\). If \(h=h\), this means \(\overline{K_{g,h}}=K_{g,h}\) for all \(g \in \mathcal {G}\). If \(h \ne h\), then taking into account that \(x_{L(h)}=x_{L(h)}\) gives \(\overline{K_{g,h}} \cdot x_{L(h)}+\overline{K_{g,h}} \cdot x_{L(h)}=K_{g,h} \cdot x_{L(h)}+K_{g,h} \cdot x_{L(h)}\). Hence \(K^{1} \cdot x_L = 1/\mathcal {G}\cdot K \cdot x_L\).
Remark 2
If a labeling function L is symmetric \(\mathcal {G}\)compatible and Q is a \((\mathcal {G},L)\)rate matrix, then a Markov matrix \(P=e^Q\) is a \((\mathcal {G},L)\)Markov matrix, i.e. \(P_{g,h}=f(hg)\) for a vector \(f \in \mathbb {R}^G\) and \(f(g)=f(h)\) whenever \(L(g)=L(h)\).
Example 3
Let \(\mathcal {G}= \mathbb {Z}_2\times \mathbb {Z}_2\). The discrete Fourier transformation matrix for \(\mathcal {G}\) is
To show \(\mathcal {G}\)compatibility for the three labeling functions from Example 1, it is enough to check that K preserves the symmetries of \(x_L\). The labeling function of the Jukes–Cantor model is \(\mathcal {G}\)compatible, since
The labeling function of the Kimura 2parameter model is \(\mathcal {G}\)compatible, since
In the literature, usually \(L((1,0))=L((1,1))\) in the Kimura 2parameter model. However, here we assume that \(L((1,0))=L((0,1))\) which is simply due to the fact that we consider the identification \(\texttt {A}=(0,0), \texttt {T}=(0,1), \texttt {C}=(1,0) \text { and } \texttt {G}=(1,1)\). To be more precise, nucleotide bases fall into two categories depending on the molecular mechanisms of the base; purines (A or G) and pyrimidines (C or T). A transition occurs when a purine is substituted by a purine, or a pyrimidine by a pyrimidine. A change from a purine to a pyrimidine, or vice versa, is a transversion. The Kimura 2parameter model of sequence evolution distinguishes between transitions and transversions to account for the biological fact that transitions occur at higher rate than transversions (Kimura 1980, 1981). The rate and transition matrix of the Kimura 2parameter model have the form
The reason for choosing this identification and ordering is that we can use the discrete Fourier transform matrix in a format, which better demonstrates that it is the Kronecker product of discrete Fourier transformation matrices for \(\mathbb {Z}_2\). The labeling function of the Kimura 3parameter is \(\mathcal {G}\)compatible, because each group element maps to a different label.
Sturmfels and Sullivant consider a different class of labeling functions, called friendly labelings (Sturmfels and Sullivant 2005). Groupbased models with friendly labeling functions are equivalent to \(\mathcal {G}\)models defined by Michałek (2011, Remark 5.2). \(\mathcal {G}\)models are constructed using an arbitrary group \(\mathcal {G}\) that has a normal, abelian subgroup \(\mathcal {H}\) which acts transitively and freely on the finite set of states. The importance of \(\mathcal {G}\)models is that they are toric. We explore connections between friendly labelings and \(\mathcal {G}\)compatible labelings in “Appendix A”. We conjecture that every symmetric \(\mathcal {G}\)compatible labeling is a friendly labeling, but not vice versa.
The following lemma provides a necessary condition for \(\mathcal {G}\)compatible labeling functions.
Lemma 5
Let \(\mathcal {G}\) be a finite additive abelian group, \(\mathcal {L}\) a set and \(L:\mathcal {G}\rightarrow \mathcal {L}\) a labeling function. If L is \(\mathcal {G}\)compatible, then \(L(0)\ne L(g)\) for any \(g\ne 0\).
Proof
Let K be the discrete Fourier transformation matrix for \(\mathcal {G}\). The entries of K are \(\widehat{g}(h)\) for \(g,h \in \mathcal {G}\), which are roots of unity. The row \(K_{0,:}\) consists of ones. On the other hand, no other row of K consists of ones only, as this would contradict the uniqueness of the identity element in the character group. In particular, every other row of K contains at least one element whose real part is strictly less than one. Thus it is impossible that \(K_{0,:}\cdot x_L = K_{g,:}\cdot x_L\) for \(g \ne 0\).
Table 1 summarizes up to isomorphism all possible symmetric \(\mathcal {G}\)compatible labeling functions for additive abelian groups of order up to eight. In the table, two group elements receive the same label if they belong to the same subset in a partition of \(\mathcal {G}\).
We saw in Example 3 that the labeling function of the Jukes–Cantor model that assigns the same label to each nonzero element of the group \(\mathcal {G}=\mathbb {Z}_2\times \mathbb {Z}_2\) is a \(\mathcal {G}\)compatible labeling. This example can be generalized to other groups.
Lemma 6
Let \(\mathcal {G}\) be a finite abelian group. Let \(L:\mathcal {G}\rightarrow \{0,1\}\) be a labeling function such that and \(L(0)=0\) and \(L(g)=1\) for \(g\ne 0\). Then the labeling function L is symmetric \(\mathcal {G}\)compatible.
Proof
Clearly the labeling function L is symmetric. Let K be the discrete Fourier transformation matrix for \(\mathcal {G}\). By Luong (2009, Corollary 3.2.1), we have
Hence
Hence L is a \(\mathcal {G}\)compatible labeling function.
We call the model in Lemma 6 the general Jukes–Cantor model. We finish this section with giving another class of labeling functions that are \(\mathcal {G}\)compatible for every finite abelian group \(\mathcal {G}\).
Lemma 7
Let \(\mathcal {G}\) be a finite additive abelian group, \(\mathcal {L}\) a finite set and \(L:\mathcal {G}\rightarrow \mathcal {L}\) a labeling function such that for any two distinct elements \(g,h\in \mathcal {G}\), \(L(g)=L(h)\) if and only if \(g=h\). Then L is \(\mathcal {G}\)compatible.
Proof
By Lemma 1, the identity \(\widehat{g}(h)=\widehat{g}(h)\) holds for all \(g,h\in \mathcal {G}\). Then
Thus, the labeling function L is \(\mathcal {G}\)compatible as \(x_L\) is the column vector of indeterminates \(x_{L(g)}\).
The converse of Lemma 7 is not true in general. Two examples are given by the labeling functions for the Kimura 2parameter and the Jukes–Cantor model.
4 Model embeddability
The following theorem is the main result of this paper. It characterizes \((\mathcal {G},L)\)embeddable transition matrices in terms of their eigenvalues.
Theorem 1
Fix a finite abelian group \(\mathcal {G}\), a finite set \(\mathcal {L}\), and a symmetric \(\mathcal {G}\)compatible labeling function \(L: \mathcal {G} \rightarrow \mathcal {L}\). Let P be a \((\mathcal {G},L)\)Markov matrix. Then P is \((\mathcal {G},L)\)embeddable if and only if the vector \(\lambda \in \mathbb {R}^{\mathcal {G}}\) of eigenvalues of P is in the set
Proof
We start by summarizing the idea of the proof. We consider the set \(\Psi _{\mathcal {G},L}\) that consists of vectors \(\psi \) that determine \((\mathcal {G},L)\)rate matrices. Our goal is to characterize the set \(\check{F}_{\mathcal {G},L}\) of eigenspectra of Markov matrices that are matrix exponentials of \((\mathcal {G},L)\)rate matrices determined by vectors \(\psi \) in \(\Psi _{\mathcal {G},L}\). The first step is to consider the discrete Fourier transform of the set \(\Psi _{\mathcal {G},L}\), which we denote by \(\check{\Psi }_{\mathcal {G},L}\). By Lemma 4, this set is the set of eigenvalues of the \((\mathcal {G},L)\)rate matrices. The second step is to consider the image of the set \(\check{\Psi }_{\mathcal {G},L}\) under coordinatewise exponentiation. This set is precisely \(\check{F}_{\mathcal {G},L}\), because \((\mathcal {G},L)\)rate matrices are diagonalizable by the discrete Fourier transform matrix K by the discussion after Lemma 4 and thus if a \((\mathcal {G},L)\)rate matrix Q is determined by \(\psi \in \mathbb {R}^{\mathcal {G}}\) then
where \(\check{\psi }\) is the vector of eigenvalues of Q and \(e^{\check{\psi }}\) is the vector of eigenvalues of P.
More specifically, let
The vectors in the set \(\Psi _{\mathcal {G},L}\) are in onetoone correspondence with \((\mathcal {G},L)\)rate matrices. The image of \(\Psi _{\mathcal {G},L}\) under the discrete Fourier transform is the set
By Lemma 4, this set is the set of eigenvalues of the \((\mathcal {G},L)\)rate matrices.
The image of \(\check{\Psi }_{\mathcal {G},L}\) under the coordinatewise exponentiation is the set of eigenvalues of the \((\mathcal {G},L)\)Markov matrices, which we denote by \(\check{F}_{\mathcal {G},L}\). We claim that \(\check{F}_{\mathcal {G},L}\) is equal to the set
Indeed, let \(\check{f}=\exp (\check{\psi })\). Then \(\check{f}>0\) because the image of the exponentiation map is positive. The inequality \(a^Tx\ge 0\) is equivalent to \(\exp (a^Tx)\ge 1.\) Hence, the equation \(\check{\psi }(0)=0\) gives \(\check{f}(0)=1\) and the inequalities \((K^{1}\check{\psi })(g)\ge 0\) give
for all nonzero \(g\in \mathcal {G}\). Hence \(\check{f}\) is in the set (4.1). Conversely, let \(\check{f}\) be a vector in the set (4.1). Then under coordinatewise logarithm, \(\log (\check{f}) \in \check{\Psi }_{\mathcal {G},L}\) and \(\check{f}=\exp (\log (\check{f}))\). Hence \(\check{f}\) is in the image of \(\check{\Psi }_{\mathcal {G},L}\). Thus \(\check{F}_{\mathcal {G},L}\) is equal to the set (4.1).
It is left to rewrite the inequalities (4.2) as in the statement of the theorem. We have
for all \(g,h\in \mathcal {G}\). Here we use Lemma 1 and the definition of the discrete Fourier transformation matrix. If \(h=h\), then \((K^{1})_{g,h}=(K^{1})_{g,h}=\overline{(K^{1})_{g,h}}\), and hence \((K^{1})_{g,h}=\text {Re}((K^{1})_{g,h})\). If \(h \ne h\), then \(\check{f}(h)=\check{f}(h)\) by Lemma 2. Hence
We replace \(K^{1}\) by \(1/\mathcal {G}\cdot \overline{K}\) and take both sides of the resulting inequality to the power \(\mathcal {G}\). Finally, making the substitution \(\lambda _{{h}}=\check{f}(h)\) gives the desired characterization.
For \(\mathcal {G}\) cyclic, Theorem 1 has been independently proven by Baake and Sumner in the context of circulant matrices (Baake and Sumner 2020, Theorem 5.7). Moreover, they show that every embeddable circulant matrix is circulant embeddable (Baake and Sumner 2020, Corollary 5.2).
It follows from Lemma 3 that if a \((\mathcal {G},L)\)Markov matrix P is \((\mathcal {G},L)\)embeddable, then there exists a unique \((\mathcal {G},L)\)rate matrix Q such that \(P=\exp (Q)\). Indeed, since Q and P have both real eigenvalues and the eigenvalues of P are exponentials of eigenvalues of Q, then the eigenvalues of Q are uniquely determined by the eigenvalues of P. Then the \((\mathcal {G},L)\)rate matrix Q is the principal logarithm of P.
The inequalities \(\lambda _g >0\) in Theorem 1 imply \(\det (P)=\prod \lambda _{g}>0\). Hence the set of \((\mathcal {G},L)\)embeddable matrices for a symmetric groupbased model is a relatively closed subset of a connected component of the complement of \(\det (P)=0\). A relatively closed subset means here a set that can be written as the intersection of a closed subset of \(\mathbb {R}^{\mathcal {G} \times \mathcal {G}}\) and the connected component of the complement of \(\det (P)=0\).
In the rest of the current section and in Sect. 5, we will discuss applications of Theorem 1. We will recover known results about \((\mathcal {G},L)\)embeddability and as a novel application characterize embeddability for three groupbased models of hachimoji DNA.
Example 4
The CFN model is the groupbased model associated to the group \(\mathbb {Z}_2\). The CFN Markov matrices have the form
The discrete Fourier transform matrix is
The eigenvalues of P are \(\lambda _0=a+b=1\) and \(\lambda _1=ab\). By Theorem 1, the Markov matrix P is CFN embeddable if and only if \(0 < \lambda _1 \le 1\) or equivalently \(0 < ab \le 1\). This is equivalent to P satisfying \(\det (P)>0\), or equivalently \(\text {tr}(P)>1\). The result that a general \(2\times 2\) stochastic matrix is embeddable if and only if \(\det (P)>0\) or \(\text {tr}(P)>1\) goes back to Kingman (1962, Proposition 2). Hence P is CFN embeddable if and only if it is embeddable.
Example 5
Recall that the Kimura 3parameter model is the groupbased model associated to group \(\mathcal {G}=\mathbb {Z}_2 \times \mathbb {Z}_2\) and a K3P Markov matrix P has the form (2.1). The eigenvalues of P are
By Theorem 1, a Markov matrix P is K3P embeddable if and only if
In the Kimura 2parameter model \(b=c\) and \(\lambda _{(0,1)}=\lambda _{(1,0)}\). We get the conditions for the K2P embeddability by setting \(\lambda _{(0,1)}=\lambda _{(1,0)}\) in (4.4). Hence a K2P Markov matrix is K2P embeddable if and only if
In the Jukes–Cantor model \(b=c=d\) and \(\lambda _{(0,1)}=\lambda _{(1,0)}=\lambda _{(1,1)}\). A JC Markov matrix is JC embeddable if and only if
The K3P embeddability of a K3P Markov matrix with no repeated eigenvalues is equivalent to the embeddability of the matrix. Similarly, the JC embeddability of a JC Markov matrix is equivalent to the embeddability of the matrix. The same is not true for K2P Markov matrices with exactly two coinciding eigenvalues. See RocaLacostena and FernándezSánchez (2018, Section 3) for similar computations and further discussion on the model embeddability of K3P, K2P, and JC Markov matrices.
Remark 3
By Kingman (1962, Corollary on page 18), the map from rate matrices to transition matrices is locally homeomorphic except possibly when the rate matrix has a pair of eigenvalues differing by a nonzero multiple of \(2 \pi i\). Since for symmetric groupbased models rate matrices are real symmetric, then all their eigenvalues are real and hence the map from rate matrices to transition matrices is a homeomorphism. Therefore the boundaries of embeddable transition matrices of symmetric groupbased models are images of the boundaries of the rate matrices. For general Markov model, the boundaries of embeddable transition matrices are characterized in Kingman (Kingman 1962, Propositions 5 and 6).
Corollary 1
A \((\mathcal {G},L)\)embeddable transition matrix lies on the boundary of the set of \((\mathcal {G},L)\)embeddable transition matrices for a symmetric groupbased model if and only if it satisfies at least one of the inequalities in Theorem 1 with equality.
5 Hachimoji DNA
In this section, we suggest three groupbased models for a genetic system with eight building blocks recently introduced by Hoshika et al. (2019), and then characterize model embeddability for the proposed groupbased models. The genetic system is called hachimoji DNA. It has four synthetic nucleotides denoted S, B, Z, and P in addition to the standard nucleotides adenine (A), cytosine (C), guanine (G) and thymine (T). Detailed descriptions of the four additional nucleotides are given in Hoshika et al. (2019). If in the standard 4letter DNA, the purines are A and G and the pyrimidines are C and T, then in the hachimoji system, there are additionally purine analogs P and B, and pyrimidine analogs Z and S. The hydrogen bonds occur between the pairs AT, CG, SB and ZP.
This DNA genetic system with eight building blocks can reliably form matching base pairs and can be read and translated into RNA. It is mutable without damaging crystal structure which is required for molecular evolution. Hachimoji DNA has potential application in barcoding, retrievable information storage, and selfassembling nanostructures.
The underlying group we suggest for the hachimoji DNA is \(\mathbb {Z}_2\times \mathbb {Z}_2\times \mathbb {Z}_2\), since when restricted to the standard 4letter DNA it gives the group \(\mathbb {Z}_2 \times \mathbb {Z}_2\) that is the underlying group for the standard DNA models. We identify the nucleotides with the group elements of \(\mathbb {Z}_2 \times \mathbb {Z}_2 \times \mathbb {Z}_2\) as follows:
The discrete Fourier transformation matrix of the group \(\mathbb {Z}_2\times \mathbb {Z}_2\times \mathbb {Z}_2\) is
5.1 Hachimoji 7parameter model
The first model we propose is the analogue of the Kimura 3parameter model and we will call it the hachimoji 7parameter (H7P) model. In the hachimoji 7parameter model, each element of the group \(\mathbb {Z}_2\times \mathbb {Z}_2\times \mathbb {Z}_2\) maps to a distinct label. Thus the labeling function is trivially \((\mathcal {G},L)\)compatible. The H7P rate and transition matrices have the form
The eigenvalues of a H7P Markov matrix are
By Theorem 1, such a matrix is H7P embeddable if and only if all eigenvalues are positive and satisfy
5.2 Hachimoji 3parameter model
The second model we suggest specializes to the Kimura 2parameter model when restricted to the standard 4letter DNA. We will call it the hachimoji 3parameter (H3P) model. We recall that in the Kimura 2parameter model there are three distinct parameters for the rates of mutation: One parameter for a state remaining unchanged, one parameter for transversion from a purine base to a pyrimidine base or vice versa, and one parameter for transition to the other purine or to the other pyrimidine. We say that two bases are of the same type if they are both standard or synthetic bases. In the hachimoji 3parameter model, there are the following parameters:

a: the probability of a state remaining unchanged.

b: the probability of a transversion from a purine base to a pyrimidine base or vice versa.

c: the probability of a transition to another purine or pyrimidine base of the same type (same type transitions).

d: the probability of a transition to another purine or pyrimidine base of different type (different type transitions).
The H3P rate and transition matrices have the form
The labeling function of this model corresponds to the partition
which is \((\mathcal {G},L)\)compatible by Table 1.
The eigenvalues of a H3P Markov matrix are
By Theorem 1, a H3P Markov matrix P is H3P embeddable if and only if the eigenvalues of P satisfy
5.3 Hachimoji 1parameter model
The third model we suggest is the analogue of the Jukes–Cantor model and we will refer to it as hachimoji 1parameter (H1P) model. It is the simplest groupbased model associated to the group \(\mathbb {Z}_2 \times \mathbb {Z}_2 \times \mathbb {Z}_2\) and it is described by only two distinct parameters for the rates of mutation. The two parameters are for a state remaining the same and a state mutating to any other state. The corresponding labeling function is \((\mathcal {G},L)\)compatible by Lemma 6. The H1P rate and transition matrices have the form
The eigenvalues of a H1P Markov matrix are \(w:=\lambda _{(0,0,0)}=1\) and \(x:=\lambda _g=ab\) for \(g\ne 0\). By Theorem 1, such a matrix is H1P embeddable if and only if its eigenvalues satisfy
Remark 4
The same conditions as in (5.6) characterize model embeddability for the general Jukes–Cantor model as defined in Lemma 6. This is also a special instance of a more general result (Baake and Sumner 2020, Corollary 4.7) on equalinput embeddability. If the order of \(\mathcal {G}\) is even, then the notion of general embeddability is equivalent to the notion of model embeddability for the general Jukes–Cantor models by Baake and Sumner (2020, Theorem 4.6).
6 Volume
In this section we compute the relative volumes of model embeddable Markov matrices within some meaningful subsets of Markov matrices by taking advantage of the characterisation of embeddability in terms of eigenvalues. The aim of this section is to describe how large the different sets of matrices are compared to each other and provide intuition of how restrictive is the hypothesis of homogeneous continuoustime models.
We will focus on the hachimoji models and the generalization of the Jukes–Cantor model. We will use the following notation:

(i)
\(\Delta \) is the set of all Markov matrices in a model.

(ii)
\(\Delta _+\) is the subset of matrices in \(\Delta \) with only positive eigenvalues.

(iii)
\(\Delta _{dd}\) is the subset of diagonally dominant matrices in \(\Delta \), i.e. matrices in \(\Delta \) such that in each row the diagonal entry is greater or equal than the sum of all other entries in the same row.

(iv)
\(\Delta _{me}\) is the subset of model embeddable transition matrices in \(\Delta \).
Biologically, the subspace \(\Delta _{dd}\) of diagonally dominant matrices consists of matrices with probability of not mutating at least as large as the probability of mutating. If a diagonally dominant matrix is embeddable, it has an identifiable rate matrix (Cuthbert 1972, 1973), namely a unique Markov generator, which is crucial for proving the consistency of many phylogenetic reconstruction methods, such as those based on maximum likelihood methods (Casanellas et al. 2020c; Chang 1996). What is more, the set of Markov matrices with positive eigenvalues \(\Delta _{+}\) includes the multiplicative closure of the transition matrices in the continuoustime version of the model (Sumner et al. 2012). We have the inclusions \(\Delta _{me}\subseteq \Delta _{+}\subseteq \Delta \) and \(\Delta _{dd}\subseteq \Delta _+\). The volumes of these spaces are given for the Kimura 3parameter model in RocaLacostena and FernándezSánchez (2018, Theorem 4.1), for the Kimura 2parameter model in Casanellas et al. (2020b, Proposition 5.1] and for the Jukes–Cantor model in RocaLacostena and FernándezSánchez (2018, Section 4).
The subsets \(\Delta ,\Delta _+,\Delta _{dd},\) and \(\Delta _{me}\) can be described using the parameterization in terms of the entries of the Markov matrix or in terms of their eigenvalues. We parameterize the relevant subsets of Markov matrices in terms of the eigenvalues of the Markov matrices and compute the volumes using these parametrizations. If \(\varphi \) denotes the bijection from the set of entries of a Markov matrix in a particular model to the set of its eigenvalues and the matrix \(J(\varphi )\) denotes the Jacobian matrix of the map \(\varphi \), then the volume of any subset in the parametrization using entries of a Markov matrix will be \(det(J(\varphi ))\) times the volume in the parameterization using eigenvalues. Since the determinant of this Jacobian is constant for each of the three models we consider, the relative volumes of the set of model embeddable Markov matrices will not depend on the parameterization chosen.
Proposition 1
For the hachimoji 7parameter model, consider \(\Delta \), \(\Delta _+\), and \(\Delta _{dd}\) as subsets of \(\mathbb {R}^7\) parameterized by \(\lambda _{(0,0,1)},\ldots ,\lambda _{(1,1,1)}\), the eigenvalues of a H7P Markov matrix. Then: (i) \(V(\Delta )=\frac{256}{315};\) (ii) \(V(\Delta _+)=\frac{5}{144};\) (iii)\(V(\Delta _{dd})=\frac{2}{315}.\)
Proof
The entries of a H7P Markov matrix (5.2) are determined by a vector (a, b, c, d, e, f, g, h). The entries of this vector can be expressed in terms of the eigenvalues as
where K is the discrete Fourier transform matrix (5.1). In terms of the entries or the eigenvalues of a H7P Markov matrix, the relevant subsets in this model are given by:
\(\Delta _{me}\) is given by one equation and seven inequalities presented in Sect. 5.1. Expressing all conditions defining \(\Delta \), \(\Delta _+\), and \(\Delta _{dd}\) in terms of the eigenvalues \(\lambda _{(0,0,1)},\ldots ,\lambda _{(1,1,1)}\) allows us to compute volumes of these sets using Polymake (Gawrilow and Joswig 2000).
We are not able to compute the volume of the subspace of the H7P embeddable Markov matrices exactly. Instead we estimate the volume using the hitandmiss Monte Carlo integration method (Hammersley 2013) implemented in Mathematica. Table 2 summarizes the volume for various number of sample points. Table 3 gives relative volumes for the relevant sets.
Proposition 2
For the hachimoji 3parameter model, consider \(\Delta \), \(\Delta _+\), \(\Delta _{dd}\), and \(\Delta _{me}\) as subsets of \(\mathbb {R}^3\) parameterized by x, y, z, the eigenvalues of a H3P Markov matrix. Then: (i) \(V(\Delta )=\frac{4}{3}\); (ii) \(V(\Delta _+)=\frac{7}{16}\); (iii) \(V(\Delta _{dd})=\frac{1}{6}\); (iv) \(V(\Delta _{me})=\frac{1}{3}\); (v) \(V(\Delta _{me} \cap \Delta _{dd}) \approx 0.136733\).
Proof
The entries of a H3P Markov matrix as in (5.3) can be expressed in terms of the eigenvalues as
Expressing all conditions defining \(\Delta \), \(\Delta _+\), \(\Delta _{dd}\), and \(\Delta _{me}\) in terms of x, y, z allows us to use the Integrate command in Mathematica to compute the desired volumes. For \(V(\Delta _{me} \cap \Delta _{dd})\) we used the numerical integration command NIntegrate.
The sets \(\Delta _{me}\), \(\Delta _+\) and \(\Delta \) for the hachimoji 3parameter model are depicted in Fig. 1. The relative volumes of relevant sets are given in Table 4.
Finally, we discuss the generalization of the Jukes–Cantor model which includes the hachimoji 1parameter model. Let \(\mathcal {G}\) be a finite abelian group of order n and \(L:\mathcal {G}\rightarrow \{0,1\}\) a labeling function such that and \(L(0)=0\) and \(L(g)=1\) for \(g\ne 0\). In Lemma 6 we proved that L is a \(\mathcal {G}\)compatible labeling. In the general Jukes–Cantor model, the transition matrix P corresponding to this labeling is has the form
Since P is a Markov matrix, then \(a=1(n1)b\), and thus P is parameterized by b.
Proposition 3
For the general Jukes–Cantor model, consider \(\Delta \), \(\Delta _+\), \(\Delta _{dd}\), \(\Delta _{me}\) as subsets of \(\mathbb {R}\) parameterized by b, the offdiagonal element of the Markov matrix. Then: (i) \(\Delta =[0,\frac{1}{n1}]\); (ii) \(\Delta _+=[0,\frac{1}{n})\); (iii) \(\Delta _{dd}=[0,\frac{1}{2(n1)}]\); (iv) \(\Delta _{me}=[0,\frac{1}{n})\); (v) \(\Delta _{me} \cap \Delta _{dd}=[0,\frac{1}{2(n1)}]\).
Proof
The Markov matrix P has eigenvalues 1 with multiplicity 1 and \(ab=1nb\) with multiplicity \(n1\). Hence
(i) \(\Delta =\{b\in \mathbb {R}:a=1(n1)b \ge 0,b\ge 0\}= [0,\frac{1}{n1}]\).
(ii) \(\Delta _+=\{b\in \mathbb {R}:a=1(n1)b \ge 0,b\ge 0, 1nb>0\}= [0,\frac{1}{n})\).
(iii) \(\Delta _{dd}=\{b\in \mathbb {R}:a=1(n1)b \ge 0,b\ge 0, 1(n1)b\ge (n1)b\}= [0,\frac{1}{2(n1)}]\).
(iv) By Remark 4, a Markov matrix is general Jukes–Cantor embeddable if and only if the eigenvalue \(1nb\) satisfies \(1 \ge 1nb >0\). Since \(1 \ge 1nb\) necessarily holds for any Markov matrix, we have \(\Delta _{me}=\Delta _+\).
(v) Since \(\Delta _{dd} \subseteq \Delta _+ = \Delta _{me}\), then \(\Delta _{me} \cap \Delta _{dd}=\Delta _{dd}\).
The relative volumes of relevant sets for the general Jukes–Cantor model are presented in Table 5. Proposition 3 gives for the hachimoji 1parameter model (i) \(\Delta =[0,\frac{1}{7}]\); (ii) \(\Delta _+=[0,\frac{1}{8})\); (iii)\(\Delta _{dd}=[0,\frac{1}{14}]\); (iv) \(\Delta _{me}=[0,\frac{1}{8})\); (v)\(\Delta _{me} \cap \Delta _{dd}=[0,\frac{1}{14}]\).
7 Conclusion
When modelling sequence evolution we often adopt several simplifying assumptions, which make the statistical problems tractable. The commonly used Markov models depend on the assumption that sites evolve independently following a Markov process. The Markov chain is often assumed to be homogeneous continuoustime, that is the transition probabilities are independent of the time. This means that the instantaneous rates of substitution at any time are fixed and usually displayed as the entries of rate matrices. If an evolutionary process is not homogeneous, then one can multiply transition matrices of short homogeneous processes. The resulting matrix is not necessarily embeddable, but if it is, then the inhomogeneous process can be approximated by a homogeneous one.
In this paper we provide necessary and sufficient conditions for modelembeddability of \(n\times n\) symmetric groupbased substitution models, which include the well known Cavender–Farris–Neyman, Jukes–Cantor, Kimura2 and Kimura3 parameter models for DNA. We fully characterize those embeddable \(n\times n\) stochastic matrices following a symmetric groupbased model structure whose Markov generators also satisfy the constraints of the model, which we refer to as model embeddability.
A novel application of our main result is the characterization of model embeddability for three groupbased models for the hachimoji DNA, a synthetic genetic system with eight building blocks. For these models we also compute the relevant volumes of model embeddable matrices within other relevant sets of Markov matrices. These computations show how restrictive is the hypothesis of a particular hachimoji timecontinuous groupbased model.
In this article we have considered symmetric groupbased models. The importance of the symmetricity assumption is that it guarantees that the eigenvalues of rate and transition matrices of a groupbased model are real. We use this property in the proof of Theorem 1. A future research question is to explore whether this approach can be extended to groupbased models that are not symmetric.
References
Allman E, Rhodes J (2003) Phylogenetic invariants for the general Markov model of sequence mutation. Math Biosci 186:133–144
Baake M, Sumner J (2020) Notes on Markov embedding. Linear Algebra Appl 594:262–299
Baños H, Bushek N, Davidson R, Gross E, Harris PE, Krone R, Long C, Stewart A, Walker R (2016) Phylogenetic trees. arXiv:1611.05805
Carette P (1995) Characterizations of embeddable \(3 \times 3\) stochastic matrices with a negative eigenvalue. N Y J Math 1:120–129
Casanellas M, FernándezSánchez J (2007) Performance of a new invariants method on homogeneous and nonhomogeneous quartet trees. Mol Biol Evol 24(1):288–293
Casanellas M, FernándezSánchez J, RocaLacostena J (2020a) Embeddability and rate identifiability of Kimura 2parameter matrices. J Math Biol 80(4):995–1019
Casanellas M, FernàndezSrnànchez J, RocaLacostena J (2020b) The embedding problem for Markov matrices. 2005.00818
Casanellas M, Petrović S, Uhler C (2020c) Algebraic statistics in practice: applications to networks. Annu Rev Stat Appl 7:227–250
Cavender J (1978) Taxonomy with confidence. Math Biosci 40:271–280
Cavender J, Felsenstein J (1987) Invariants of phylogenies in a simple case with discrete states. Classification 4:57–71
Chang JT (1996) Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math Biosci 137(1):51–73
Chen Y, Chen J (2011) On the imbedding problem for threestate time homogeneous Markov chains with coinciding negative eigenvalues. J Theor Probab 24:928–938
Cuthbert RJ (1972) On uniqueness of the logarithm for Markov semigroups. J Lond Math Soc 2(4):623–630
Cuthbert RJ (1973) The logarithm function for finitestate Markov semigroups. J Lond Math Soc 2(3):524–532
Davies EB (2010) Embeddable Markov matrices. Electron J Probab 15:1474–1486
Elfving G (1937) Zur theorie der Markoffschen ketten. Acta Societatis Scientiarum FennicæNova Series A 2(8):17
Evans SN, Speed TP (1993) Invariants of some probability models used in phylogenetic inference. Ann Stat 21:355–377
Farris JS (1973) A probability model for inferring evolutionary trees. Syst Zool 22(3):250–256
Felsenstein J (2003) Inferring phylogenies. Sinauer Associates Inc, Publishers, Sunderland
Gawrilow E, Joswig M (2000) polymake: a framework for analyzing convex polytopes. In: Polytopes—combinatorics and computation (Oberwolfach, 1997), DMV Sem., vol 29, Birkhäuser, Basel, pp 43–73
Gross E, Long C (2018) Distinguishing phylogenetic networks. SIAM J Appl Algebra Geom 2(1):72–93
Hammersley J (2013) Monte Carlo methods. Springer, Berlin
Hoshika S, Leal NA, Kim MJ, Kim MS, Karalkar NB, Kim HJ, Bates AM, Watkins NE, SantaLucia HA, Meyer AJ et al (2019) Hachimoji DNA and RNA: a genetic system with eight building blocks. Science 363(6429):884–887
Israel RB, Rosenthal JS, Wei JZ (2001) Finding generators for Markov chains via empirical transition matrices, with applications to credit ratings. Math Finance 11(2):245–265
Jia C (2016) A solution to the reversible embedding problem for finite Markov chains. Stat Probab Lett 116:122–130
Johansen S (1974) Some results on the imbedding problem for finite Markov chains. J Lond Math Soc 2:345–351
Jukes TH, Cantor CR (1969) Evolution of protein molecules. In: Munro H (ed) Mammalian protein metabolism. Academic Press, New York, pp 21–132
Kimura M (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16(2):111–120
Kimura M (1981) Estimation of evolutionary distances between homologous nucleotide sequences. Proc Natl Acad Sci USA 78(1):454–458
Kingman JFC (1962) The imbedding problem for finite Markov chains. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 1(1):14–24
Lake JA (1987) A rateindependent technique for analysis of nucleic acid sequences: evolutionary parsimony. Mol Biol Evol 4:167–191
Luong B (2009) Fourier analysis on finite Abelian groups. Springer, Berlin
Matsen FA (2008) Fourier transform inequalities for phylogenetic trees. IEEE/ACM Trans Comput Biol Bioinform 6(1):89–95
Michałek M (2011) Geometry of phylogenetic groupbased models. J Algebra 339(1):339–356
Neyman J (1971) Molecular studies of evolution: a source of novel statistical problems. In: Gupta SS, Yackel J (eds) Stat Decis Theory Relat Top. Academic Press, New York, pp 1–27
Pachter L, Sturmfels B (2004) Tropical geometry of statistical models. Proc Natl Acad Sci USA 101(46):16132–16137
Pachter L, Sturmfels B (2005) Algebraic statistics for computational biology, vol 13. Cambridge University Press, Cambridge
RocaLacostena J, FernándezSánchez J (2018) Embeddability of Kimura 3ST Markov matrices. J Theor Biol 445:128–135
Sturmfels B, Sullivant S (2005) Toric ideals of phylogenetic invariants. J Comput Biol 12(2):204–228
Sumner JG, FernándezSánchez J, Jarvis PD (2012) Lie Markov models. J Theor Biol 298:16–31
Verbyla KL, Von Bing Yap AP, Shao Y, Huttley GA (2013) The embedding problem for Markov models of nucleotide substitution. PLoS ONE 8(7):e69187
Acknowledgements
An initial version of the main result of the present manuscript appeared in the preprint arXiv:1705.09228. Following the advice of referees, we divided the preprint into two manuscripts. The present manuscript builds on the section on the embedding problem in the earlier preprint. Most of the preprint arXiv:1705.09228 appeared in Bulletin of Mathematical Biology under the title “Maximum Likelihood Estimation of Symmetric GroupBased Models via Numerical Algebraic Geometry”.
Funding
Open access funding provided by Aalto University. Dimitra Kosta was partially supported by a Royal Society Dorothy Hodgkin Research Fellowship. Kaie Kubjas was partially supported by the Academy of Finland Grant 323416.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A Friendly labeling functions
A Friendly labeling functions
Besides \(\mathcal {G}\)compatible labeling functions, there is another class of labeling functions which has been studied in the literature. They are called friendly labeling functions and were introduced by Sturmfels and Sullivant (2005). Friendly labelings are useful in determining phylogenetic invariants for groupbased models on evolutionary trees. In particular, a friendly labeling guarantees that if a particular labeling comes from an assignment of group elements, then any choice of a group element to one particular edge which is consistent with the labeling can be extended to an assignment that is consistent with labeling on all edges of the claw tree.
Definition 3
Let \(\mathcal {G}\) be a finite abelian group and \(L:\mathcal {G}\rightarrow \mathcal {L}\) a labeling function. Let \(n\in \mathbb {N}\) and \(Z:=\{g\in \mathcal {G}^n: g_n=\sum _{i=1}^{n1}g_i\}.\) Define the map \(\tilde{L}:Z\subseteq \mathcal {G}^n\rightarrow \mathcal {L}^n\) to be the induced labeling function on \(Z\subseteq \mathcal {G}^n\). The labeling function L is said to be nfriendly if for every \(l\in \tilde{L}(Z)\) and \(i=1,2,\cdots ,n\), we have \(\pi _i(\tilde{L}^{1}(l))=L^{1}(\pi _i(l))\). Here, \(\pi _i\) denotes the projection to the ith component. Furthermore, the labeling function L is said to be friendly if it is nfriendly for all \(n\ge 3.\)
By Sturmfels and Sullivant (2005, Lemma 11), to check whether a labeling function is friendly, it is enough to check that the labeling is 3friendly.
Example 6
(Sturmfels and Sullivant 2005, Example 9) Let \(\mathcal {G}=\mathbb {Z}_4\) and \(L:\mathcal {G}\rightarrow \{0,1,2\}\) such that
Then L is not friendly labeling because \(L^{1}(\pi _3((1,1,2)))=\{2,3\}\) while \(\pi _3(\tilde{L}^{1}(1,1,2))=\pi _3((1,1,2))=\{2\}.\)
Table 6 summarizes all friendly labelings for abelian groups of order n, where \(2\le n \le 8\). In the table, two group elements receive the same label if they belong to the same subset in a partition of \(\mathcal {G}\). In the table we do not include the friendly labelings for \(\mathbb {Z}_2\times \mathbb {Z}_4\) and \(\mathbb {Z}_2\times \mathbb {Z}_2\times \mathbb {Z}_2\), since there are too many of them.
The next example shows that there are friendly labelings that are not \(\mathcal {G}\)compatible.
Example 7
Let \(\mathcal {G}\) be a finite abelian group and \(L:\mathcal {G}\rightarrow \{0\}\) the labeling function defined by \(L(g)=0\) for all \(g \in \mathcal {G}\). By Lemma 5, the labeling function L is not \(\mathcal {G}\)compatible. However, it is a friendly labeling, since \(\pi _i(\tilde{L}^{1}((0,0,0)))=\pi _i(Z)=\mathcal {G}\) and \(L^{1}(\pi _i((0,0,0)))=L^{1}(0)=\mathcal {G}\).
Computations for abelian groups of order at most eight demonstrate that every symmetric \(\mathcal {G}\)compatible labeling is a friendly labeling. It is left open, if the same is true for any finite abelian group.
Question 1
Given a finite abelian group \(\mathcal {G}\), is the set of all symmetric \(\mathcal {G}\)compatible labelings strictly contained in the set of all friendly labelings?
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ardiyansyah, M., Kosta, D. & Kubjas, K. The modelspecific Markov embedding problem for symmetric groupbased models. J. Math. Biol. 83, 33 (2021). https://doi.org/10.1007/s00285021016565
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00285021016565