The model-specific Markov embedding problem for symmetric group-based models

We study model embeddability, which is a variation of the famous embedding problem in probability theory, when apart from the requirement that the Markov matrix is the matrix exponential of a rate matrix, we additionally ask that the rate matrix follows the model structure. We provide a characterisation of model embeddable Markov matrices corresponding to symmetric group-based phylogenetic models. In particular, we provide necessary and sufficient conditions in terms of the eigenvalues of symmetric group-based matrices. To showcase our main result on model embeddability, we provide an application to hachimoji models, which are eight-state models for synthetic DNA. Moreover, our main result on model embeddability enables us to compute the volume of the set of model embeddable Markov matrices relative to the volume of other relevant sets of Markov matrices within the model.


Introduction
The embedding problem for stochastic matrices, also known as Markov matrices, deals with the question of deciding whether a stochastic matrix M is the matrix exponential of a rate matrix Q. A rate matrix, also known as a Markov generator, has non-negative non-diagonal entries and row sums equal to zero. If a stochastic matrix satisfies such a property and can be expressed as a matrix exponential of a rate matrix, namely M = e Qt , then M is called embeddable. Applications of the embeddability property vary from biology and nucleotide substitution models (Verbyla et al. 2013) to mathematical finance (Israel et al. 2001). For a first formulation of the embedding problem see Elfving (1937). An account of embeddable Markov matrices is provided in Davies (2010). The embedding problem for 2×2 matrices is due to Kendall and first published by Kingman (1962), for 3 × 3 matrices is fully settled in a series of papers Carette (1995), Chen and Chen (2011), Cuthbert (1973), Israel et al. (2001) and Johansen (1974), while for 4 × 4 stochastic matrices has been recently solved in Casanellas et al. (2020b). In general, when the size n of the stochastic matrix is greater than 4, the work Casanellas et al. (2020b) establishes a criterion for deciding whether a generic n × n Markov matrix with distinct eigenvalues is embeddable and proposes an algorithm that lists all its Markov generators. In the present paper we study a refinement of the classical embedding problem, called the model embedding problem for a class of n × n stochastic matrices coming from phylogenetic models.
Phylogenetics is the field that aims at reconstructing the history of evolution of species. A phylogenetic model is a mathematical model used to understand the evolutionary process given genetic data sets. The most popular phylogenetic models are nucleotide substitution models which use aligned DNA sequence data to study the molecular evolution of DNA. A comprehensive treatment of phylogenetic methods is given by Felsenstein, who is considered the initiator of statistical phylogenetics, in his seminal book Felsenstein (2003). Algebraic and geometric methods have been employed with great success in the study of phylogenetic models leading to an explosion of related research work and the establishment of the field of phylogenetic algebraic geometry, also known as algebraic phylogenetics; see Allman and Rhodes (2003), Baños et al. (2016), Casanellas and Fernández-Sánchez (2007), Cavender and Felsenstein (1987), Gross and Long (2018), Evans and Speed (1993), Lake (1987), Pachter and Sturmfels (2004) and Sturmfels and Sullivant (2005) for a non-exhaustive list of publications.
To build such a phylogenetic model, we first require a phylogenetic tree T , which is a directed acyclic graph comprising of vertices and edges representing the evolutionary relationships of a group of species. The vertices with valency 1 are called the leaves of the tree. The tree is considered rooted and the direction of evolution is from the root towards the leaves. On each vertex of the tree T , we associate a random variable with k possible states, where in phylogenetics k is often taken to be 2, for the binary states {0, 1}, or 4, to represent the four types of DNA nucleotides {A, C, G, T}. We also require a transition matrix (also known as a mutation matrix) M (e) corresponding to each edge e of the tree, where the entries of this k × k matrix M (e) represent the probabilities of transition between states. In a phylogenetic tree, the leaves correspond to extant species and so the random variables at the leaves are observed, while the interior vertices correspond to possibly extinct species and so the random variables at the interior vertices are hidden.
In this paper we are focusing on symmetric group-based substitution models. Substitution models are a class of phylogenetic models which use a Markov process to describe the substitution of nucleotides over time in a given DNA sequence and for which the transition matrices along an edge e are stochastic matrices of the form M (e) = exp t e Q (e) . Group-based models are a special class of substitution models, in which the matrices Q (e) can be pairwise distinct, but they can all be simultaneously diagonalizable by a linear change of coordinates given by the discrete Fourier transform of an abelian group, also called a commutative group. For example, the Cavender-Farris-Neyman (CFN) model (Cavender 1978;Farris 1973;Neyman 1971), as well as the Jukes-Cantor (JC) (Jukes and Cantor 1969), the Kimura-2 parameter (K2P) (Kimura 1980) and the Kimura-3 parameter (K3P) (Kimura 1981) models for DNA are all group-based phylogenetic models. In Sturmfels and Sullivant (2005), it is established that through the discrete Fourier transform group-based models correspond to toric varieties, which are geometric objects with nice combinatorial properties. We are interested in symmetric group-based substitution models. Namely, apart from distinct and simultaneously diagonalizable, the transition matrices Q (e) are also symmetric square matrices. The symmetricity assumption guarantees that the eigenvalues of rate and transition matrices of a group-based model are real, a property that we use in the proof of our main theorem. Symmetric models are a subset of a special class of models called time-reversible models, where the Markov process appears identical when moving forward or backward in time. Our results apply to group-based models following an ergodic time-reversible Markov process, as in this case the rate matrices Q are symmetric according to Pachter and Sturmfels (2004, Lemma 17.2).
The classical embedding problem is concerned with deciding which square Markov matrices are embeddable, namely given a Markov matrix M whether there exists a rate matrix Q such that M = exp(Q). A variant of the embedding problem that asks for a reversible Markov generator for a stochastic matrix is studied in Jia (2016). When we impose the assumption that the rate matrix Q follows the corresponding model conditions, we arrive at a different refined notion of embeddability called model embeddability. The embeddability of circulant and equal-input stochastic matrices is studied in Baake and Sumner (2020). In the current paper, we focus on the (G, L)embeddability for n×n matrices corresponding to symmetric group-based substitution models. The (G, L)-embeddability means that we require that the rate matrices Q preserve the symmetric group-based structure imposed by the abelian group G and the symmetric labelling L, which we define at the beginning of Sect. 2. Model embeddability for symmetric group-based models is relevant both for homogeneous and inhomogeneous time-continuous processes, as group-based models are Lie Markov models, and hence multiplicatively closed (Sumner et al. 2012;Verbyla et al. 2013). A study of the set of embeddable and model-embeddable matrices corresponding to the Jukes-Cantor, Kimura-2 and Kimura-3 DNA substitution models, which are all symmetric group-based models, is undertaken in Casanellas et al. (2020a) and Roca-Lacostena and Fernández-Sánchez (2018). In particular, a full characterisation of the set of embeddable 4 × 4 Kimura 2-parameter matrices is provided in Casanellas et al. (2020a), which together with the results of Roca-Lacostena and Fernández-Sánchez (2018) fully solve the embedding problem for the Kimura 3-parameter model as well. Although model embeddability, which is a refined notion of embeddability imposed by the model structure, implies classical embeddability, the converse is generally not true (see also Roca-Lacostena and Fernández-Sánchez 2018, Example 3.1).
The main result of this paper is a characterization of (G, L)-embeddability for any abelian group G equipped with a symmetric G-labeling function L in Theorem 1. We provide necessary and sufficient conditions which the eigenvalues of the stochastic matrix of the model need to satisfy for the matrix to be (G, L)-embeddable. To showcase our result, we first introduce three group-based models with the underlying group Z 2 × Z 2 × Z 2 , based on the hachimoji DNA system introduced in Hoshika et al. (2019). Hachimoji, a Japanese word meaning "eight letters", is used to describe a synthetic analog of the nucleic acid DNA, where we have the four natural nucleobases {A,C,G,T} and furthermore an additional four synthetic nucleotides {P,Z,B,S}. We then apply Theorem 1 to characterise the model embeddability for the three hachimoji DNA models. The three models are called hachimoji 7-parameter, hachimoji 3-parameter and hachimoji 1-parameter models, which can be thought of as generalisations of the Kimura 3, Kimura 2 and Jukes-Cantor models respectively. Finally, the characterisation of model embeddability in terms of eigenvalues enables us to compute the volume of the (G, L)-embeddable Markov matrices and compare this volume with volumes of other relevant sets of Markov matrices. For the general Jukes-Cantor model, which includes the hachimoji 1-parameter model, the volumes can be derived exactly; for the hachimoji 3-parameter model and for the hachimoji 7-parameter model symbolically and numerically.
The outline of the paper is the following. Section 2 gives a mathematical background covering notions such as the labeling functions, group-based models and the discrete Fourier transform. Section 3 introduces symmetric G-compatible labelings which is a class of labeling functions with particularly nice properties. Section 4 presents the main result of this paper about the model embedding problem for symmetric groupbased models equipped with a certain labeling function. Then in Sect. 5 we focus on the hachimoji DNA and provide exact characterisation of the model embeddability in terms of eigenvalues of the Markov matrix for the hachimoji 7-parameter, the hachimoji 3-parameter and the hachimoji 1-parameter models. Finally, Sect. 6 presents results on the volume of stochastic matrices that are (G, L)-embeddable for the three hachimoji group-based models. The code for the computations in this paper is available at https:// github.com/ardiyam1/Model-Embeddability-for-Symmetric-Group-Based-Models.

Preliminaries
In this section, we give background on group-based models and the discrete Fourier transform.
Definition 1 Let G be a finite additive abelian group and L a finite set. A labeling function is any function L : G → L.
In the group-based model with underlying finite abelian group G, states are in bijection with the elements of the group G. Fundamental in the definition of a groupbased model associated to a finite additive abelian group G and a labeling function L is that the rate of mutation from a state g to state h depends only on L(h − g): That is, the entries of a rate matrix Q are for a vector ψ ∈ R G satisfying g∈G ψ(g) = 0, ψ(g) ≥ 0 for all non-zero g ∈ G and ψ(g) = ψ(h), whenever L(g) = L(h). We say that such Q is a (G, L)-rate matrix. In this paper, the rate matrices in group-based models are assumed to be symmetric, i.e., ψ(−g) = ψ(g) for every g ∈ G. Since the matrix exponential of a symmetric matrix is again symmetric, then the entries of a transition matrix P = exp(Q) are As we see in Example 2, in general it is not true that f (g) = f (h) whenever L(g) = L(h). In Sect. 3, we introduce G-compatible labeling functions which guarantee this property and then we say that P is a (G, L)-Markov matrix.
Let C * denote the multiplicative group of complex numbers without zero. A group homomorphism from G to C * is called a character of G. The characters of G form a group under multiplication, called the character group of G and denoted by G. Here the product of two characters χ 1 , χ 2 of the group G is defined by (χ 1 χ 2 )(g) = χ 1 (g)χ 2 (g) for every g ∈ G. The character group G is isomorphic to G. Given a group isomorphism between G and G, we will denote by g ∈ G the image of g ∈ G. For a finite group G, the values of characters are roots of unity.
Lemma 1 (Pachter and Sturmfels 2005, Lemma 17.1) Let g, h ∈ G and k ∈ Z. Then g(−h) = g(h) and kg(h) = g(kh), where a denotes the complex conjugate of a ∈ C.
Given a function a : G → C, its discrete Fourier transform is a functionǎ : G → C defined byǎ Lemma 2 (Matsen 2008, Section 2) For any real-valued function a : G → C, the identityǎ(−g) =ǎ(g) holds for all g ∈ G. Moreover, if a(−g) = a(g) for all g ∈ G, thenǎ(−g) =ǎ(g) for all g ∈ G andǎ is a real-valued function.
In the proof of Theorem 1, we will use thatψ andf are real-valued. For this reason, in this paper we consider only group-based models that are equipped with symmetric labeling functions, i.e. L(g) = L(−g) for all g ∈ G. In other words, a symmetric group-based model assumes that the transition matrices are real symmetric matrices.
The discrete Fourier transform is a linear endomorphism on C G . We will denote its matrix by K . In particular, the entries of K are K g,h = g(h) for g, h ∈ G. The matrix K is symmetric for any finite abelian group (Luong 2009, Section 3.2). The inverse of the discrete Fourier transformation matrix is The following lemma describes the relation between functionals f and ψ in the Lemma 3 (Matsen 2008, Lemma 2.2) Let Q be determined by ψ ∈ R G and P be determined by f ∈ R G as described earlier in this section such that P = e Q . Furthermore, assume that ψ Lemma 4 Let Q be determined by ψ ∈ R G and P be determined by f ∈ R G as described earlier in this section. Furthermore, assume that ψ(g) = ψ(−g) and f (g) = f (−g) for all g ∈ G. Let K g denote the column of the discrete Fourier transform matrix labeled by g. The eigenpairs of Q (resp. P) are (ψ(g), Proof This result is stated in the proof of Matsen (2008, Lemma 2.2).
In particular, in the case of a Markov matrix, the column vector of ones is an eigenvector with eigenvalue one. In the case of a rate matrix, the column vector of ones is an eigenvector with eigenvalue zero. A direct consequence of Lemma 4 is that Q and P are diagonalizable by K , i.e. Q = K D 1 K −1 and P = K D 2 K −1 where D 1 and D 2 are diagonal matrices with diagonals given by the vectorsψ andf of R G respectively.

G-compatible labeling functions
In this section, we introduce a class of labeling functions with the property that the symmetries of the probability vector are preserved under the discrete Fourier transformation. This property is required for any result that is proven using the discrete Fourier transform. Notably, labeling functions for all common group-based models (CFN, K3P, K2P, and JC models) are G-compatible.
Definition 2 Let G be a finite additive abelian group, L a set and L : G → L a labeling function. Let K be the discrete Fourier transformation matrix for G and x L be the column vector of length |G| whose g-th component is the indeterminate x L(g) . We say that L is a G-compatible labeling function if for every g, h ∈ G with L(g) = L(h), we have that K g,: · x L = K h,: · x L and (K −1 ) g,: · x L = (K −1 ) h,: · x L . Here M a,: denotes the row of M indexed by group element a.
A labeling function that maps every group element to a different label is trivially G-compatible.

Remark 1
In the definition of a G-compatible labeling, we require that the matrices K and K −1 preserve the symmetries of the vector of labels x L . For symmetric groupbased models, it is enough to require that only K or K −1 preserves the symmetries of the vector of labels x L . Recall that To show G-compatibility for the three labeling functions from Example 1, it is enough to check that K preserves the symmetries of x L . The labeling function of the Jukes-Cantor model is G-compatible, since The labeling function of the Kimura 2-parameter model is G-compatible, since In the literature, usually L((1, 0)) = L((1, 1)) in the Kimura 2-parameter model. However, here we assume that L((1, 0)) = L((0, 1)) which is simply due to the fact that we consider the identification A = (0, 0), T = (0, 1), C = (1, 0) and G = (1, 1). To be more precise, nucleotide bases fall into two categories depending on the molecular mechanisms of the base; purines (A or G) and pyrimidines (C or T). A transition occurs when a purine is substituted by a purine, or a pyrimidine by a pyrimidine. A change from a purine to a pyrimidine, or vice versa, is a transversion. The Kimura 2-parameter model of sequence evolution distinguishes between transitions and transversions to account for the biological fact that transitions occur at higher rate than transversions (Kimura 1980(Kimura , 1981. The rate and transition matrix of the Kimura 2-parameter model have the form The reason for choosing this identification and ordering is that we can use the discrete Fourier transform matrix in a format, which better demonstrates that it is the Kronecker product of discrete Fourier transformation matrices for Z 2 . The labeling function of the Kimura 3-parameter is G-compatible, because each group element maps to a different label. Sturmfels and Sullivant consider a different class of labeling functions, called friendly labelings (Sturmfels and Sullivant 2005). Group-based models with friendly labeling functions are equivalent to G-models defined by Michałek (2011, Remark 5.2). G-models are constructed using an arbitrary group G that has a normal, abelian subgroup H which acts transitively and freely on the finite set of states. The importance of G-models is that they are toric. We explore connections between friendly labelings and G-compatible labelings in "Appendix A". We conjecture that every symmetric G-compatible labeling is a friendly labeling, but not vice versa.
The following lemma provides a necessary condition for G-compatible labeling functions.
Lemma 5 Let G be a finite additive abelian group, L a set and L : G → L a labeling function. If L is G-compatible, then L(0) = L(g) for any g = 0.
Proof Let K be the discrete Fourier transformation matrix for G. The entries of K are g(h) for g, h ∈ G, which are roots of unity. The row K 0,: consists of ones. On the other hand, no other row of K consists of ones only, as this would contradict the uniqueness of the identity element in the character group. In particular, every other row of K contains at least one element whose real part is strictly less than one. Thus it is impossible that K 0,: · x L = K g,: · x L for g = 0.
Table 1 summarizes up to isomorphism all possible symmetric G-compatible labeling functions for additive abelian groups of order up to eight. In the table, two group elements receive the same label if they belong to the same subset in a partition of G.
We saw in Example 3 that the labeling function of the Jukes-Cantor model that assigns the same label to each nonzero element of the group G = Z 2 × Z 2 is a G-compatible labeling. This example can be generalized to other groups.  Proof Clearly the labeling function L is symmetric. Let K be the discrete Fourier transformation matrix for G. By Luong (2009, Corollary 3.2.1), we have h∈G K g,h = n, g = 0 0, g = 0 . Hence Hence L is a G-compatible labeling function.
We call the model in Lemma 6 the general Jukes-Cantor model. We finish this section with giving another class of labeling functions that are G-compatible for every finite abelian group G.
Lemma 7 Let G be a finite additive abelian group, L a finite set and L : G → L a labeling function such that for any two distinct elements g, h ∈ G, L(g) = L(h) if and only if g = −h. Then L is G-compatible.
Proof By Lemma 1, the identity −g(h) = g(−h) holds for all g, h ∈ G. Then Thus, the labeling function L is G-compatible as x L is the column vector of indeterminates x L(g) .
The converse of Lemma 7 is not true in general. Two examples are given by the labeling functions for the Kimura 2-parameter and the Jukes-Cantor model.

Model embeddability
The following theorem is the main result of this paper. It characterizes (G, L)embeddable transition matrices in terms of their eigenvalues.
Proof We start by summarizing the idea of the proof. We consider the set G,L that consists of vectors ψ that determine (G, L)-rate matrices. Our goal is to characterize the setF G,L of eigenspectra of Markov matrices that are matrix exponentials of (G, L)rate matrices determined by vectors ψ in G,L . The first step is to consider the discrete Fourier transform of the set G,L , which we denote byˇ G,L . By Lemma 4, this set is the set of eigenvalues of the (G, L)-rate matrices. The second step is to consider the image of the setˇ G,L under coordinatewise exponentiation. This set is precisely F G,L , because (G, L)-rate matrices are diagonalizable by the discrete Fourier transform matrix K by the discussion after Lemma 4 and thus if a (G, L)-rate matrix Q is determined by ψ ∈ R G then whereψ is the vector of eigenvalues of Q and eψ is the vector of eigenvalues of P.
More specifically, let G,L = {ψ ∈ R G : g∈G ψ(g) = 0, ψ(g) ≥ 0 for all nonzero g ∈ G, and The vectors in the set G,L are in one-to-one correspondence with (G, L)-rate matrices. The image of G,L under the discrete Fourier transform is the seť By Lemma 4, this set is the set of eigenvalues of the (G, L)-rate matrices. The image ofˇ G,L under the coordinatewise exponentiation is the set of eigenvalues of the (G, L)-Markov matrices, which we denote byF G,L . We claim thatF G,L is equal to the set (4.1) Indeed, letf = exp(ψ). Thenf > 0 because the image of the exponentiation map is positive. The inequality a T x ≥ 0 is equivalent to exp(a T x) ≥ 1. Hence, the equationψ(0) = 0 givesf (0) = 1 and the inequalities ( for all nonzero g ∈ G. Hencef is in the set (4.1). Conversely, letf be a vector in the set (4.1). Then under coordinatewise logarithm, log(f ) ∈ˇ G,L andf = exp(log(f )). Hencef is in the image ofˇ G,L . ThusF G,L is equal to the set (4.1).
It is left to rewrite the inequalities (4.2) as in the statement of the theorem. We have for all g, h ∈ G. Here we use Lemma 1 and the definition of the discrete Fourier transformation matrix. If −h = h, then (K −1 ) g,h = (K −1 ) g,−h = (K −1 ) g,h , and (4.3) We replace K −1 by 1/|G| · K and take both sides of the resulting inequality to the power |G|. Finally, making the substitution λ h =f (h) gives the desired characterization.
For G cyclic, Theorem 1 has been independently proven by Baake and Sumner in the context of circulant matrices (Baake and Sumner 2020, Theorem 5.7). Moreover, they show that every embeddable circulant matrix is circulant embeddable (Baake and Sumner 2020, Corollary 5.2).
It follows from Lemma 3 that if a (G, L)-Markov matrix P is (G, L)-embeddable, then there exists a unique (G, L)-rate matrix Q such that P = exp(Q). Indeed, since Q and P have both real eigenvalues and the eigenvalues of P are exponentials of eigenvalues of Q, then the eigenvalues of Q are uniquely determined by the eigenvalues of P. Then the (G, L)-rate matrix Q is the principal logarithm of P.
The inequalities λ g > 0 in Theorem 1 imply det(P) = λ g > 0. Hence the set of (G, L)-embeddable matrices for a symmetric group-based model is a relatively closed subset of a connected component of the complement of det(P) = 0. A relatively closed subset means here a set that can be written as the intersection of a closed subset of R G×G and the connected component of the complement of det(P) = 0.
In the rest of the current section and in Sect. 5, we will discuss applications of Theorem 1. We will recover known results about (G, L)-embeddability and as a novel application characterize embeddability for three group-based models of hachimoji DNA.

Example 4
The CFN model is the group-based model associated to the group Z 2 . The CFN Markov matrices have the form The discrete Fourier transform matrix is The eigenvalues of P are λ 0 = a + b = 1 and λ 1 = a − b. By Theorem 1, the Markov matrix P is CFN embeddable if and only if 0 < λ 1 ≤ 1 or equivalently 0 < a − b ≤ 1. This is equivalent to P satisfying det(P) > 0, or equivalently tr(P) > 1. The result that a general 2 × 2 stochastic matrix is embeddable if and only if det(P) > 0 or tr(P) > 1 goes back to Kingman (1962, Proposition 2). Hence P is CFN embeddable if and only if it is embeddable.
The K3P embeddability of a K3P Markov matrix with no repeated eigenvalues is equivalent to the embeddability of the matrix. Similarly, the JC embeddability of a JC Markov matrix is equivalent to the embeddability of the matrix. The same is not true for K2P Markov matrices with exactly two coinciding eigenvalues. See Roca-Lacostena and Fernández-Sánchez (2018, Section 3) for similar computations and further discussion on the model embeddability of K3P, K2P, and JC Markov matrices.
Remark 3 By Kingman (1962, Corollary on page 18), the map from rate matrices to transition matrices is locally homeomorphic except possibly when the rate matrix has a pair of eigenvalues differing by a non-zero multiple of 2πi. Since for symmetric groupbased models rate matrices are real symmetric, then all their eigenvalues are real and hence the map from rate matrices to transition matrices is a homeomorphism. Therefore the boundaries of embeddable transition matrices of symmetric group-based models are images of the boundaries of the rate matrices. For general Markov model, the boundaries of embeddable transition matrices are characterized in Kingman (Kingman 1962, Propositions 5 and 6).

Corollary 1 A (G, L)-embeddable transition matrix lies on the boundary of the set of (G, L)-embeddable transition matrices for a symmetric group-based model if and only
if it satisfies at least one of the inequalities in Theorem 1 with equality.

Hachimoji DNA
In this section, we suggest three group-based models for a genetic system with eight building blocks recently introduced by Hoshika et al. (2019), and then characterize model embeddability for the proposed group-based models. The genetic system is called hachimoji DNA. It has four synthetic nucleotides denoted S, B, Z, and P in addition to the standard nucleotides adenine (A), cytosine (C), guanine (G) and thymine (T). Detailed descriptions of the four additional nucleotides are given in Hoshika et al. (2019). If in the standard 4-letter DNA, the purines are A and G and the pyrimidines are C and T, then in the hachimoji system, there are additionally purine analogs P and B, and pyrimidine analogs Z and S. The hydrogen bonds occur between the pairs A-T, C-G, S-B and Z-P. This DNA genetic system with eight building blocks can reliably form matching base pairs and can be read and translated into RNA. It is mutable without damaging crystal structure which is required for molecular evolution. Hachimoji DNA has potential application in bar-coding, retrievable information storage, and self-assembling nanostructures.
The discrete Fourier transformation matrix of the group Z 2 × Z 2 × Z 2 is (5.1)

Hachimoji 7-parameter model
The first model we propose is the analogue of the Kimura 3-parameter model and we will call it the hachimoji 7-parameter (H7P) model. In the hachimoji 7-parameter model, each element of the group Z 2 × Z 2 × Z 2 maps to a distinct label. Thus the labeling function is trivially (G, L)-compatible. The H7P rate and transition matrices have the form

Hachimoji 3-parameter model
The second model we suggest specializes to the Kimura 2-parameter model when restricted to the standard 4-letter DNA. We will call it the hachimoji 3-parameter (H3P) model. We recall that in the Kimura 2-parameter model there are three distinct parameters for the rates of mutation: One parameter for a state remaining unchanged, one parameter for transversion from a purine base to a pyrimidine base or vice versa, and one parameter for transition to the other purine or to the other pyrimidine. We say that two bases are of the same type if they are both standard or synthetic bases. In the hachimoji 3-parameter model, there are the following parameters: a: the probability of a state remaining unchanged.
b: the probability of a transversion from a purine base to a pyrimidine base or vice versa. c: the probability of a transition to another purine or pyrimidine base of the same type (same type transitions). d: the probability of a transition to another purine or pyrimidine base of different type (different type transitions).
The H3P rate and transition matrices have the form The  Table 1. The eigenvalues of a H3P Markov matrix are By Theorem 1, a H3P Markov matrix P is H3P embeddable if and only if the eigenvalues of P satisfy (5.4)

Hachimoji 1-parameter model
The third model we suggest is the analogue of the Jukes-Cantor model and we will refer to it as hachimoji 1-parameter (H1P) model. It is the simplest group-based model associated to the group Z 2 × Z 2 × Z 2 and it is described by only two distinct parameters for the rates of mutation. The two parameters are for a state remaining the same and a state mutating to any other state. The corresponding labeling function is (G, L)-compatible by Lemma 6. The H1P rate and transition matrices have the form b a b b b b b b  b b a b b b b b  b b b a b b b b  b b b b a b b b  b b b b b a b b  b b b b b b a b  b b b b b The eigenvalues of a H1P Markov matrix are w := λ (0,0,0) = 1 and x := λ g = a−b for g = 0. By Theorem 1, such a matrix is H1P embeddable if and only if its eigenvalues satisfy w = 1 and 1 ≥ x > 0. (5.6)

Remark 4
The same conditions as in (5.6) characterize model embeddability for the general Jukes-Cantor model as defined in Lemma 6. This is also a special instance of a more general result (Baake and Sumner 2020, Corollary 4.7) on equal-input embeddability. If the order of G is even, then the notion of general embeddability is equivalent to the notion of model embeddability for the general Jukes-Cantor models by Baake and Sumner (2020, Theorem 4.6).

Volume
In this section we compute the relative volumes of model embeddable Markov matrices within some meaningful subsets of Markov matrices by taking advantage of the characterisation of embeddability in terms of eigenvalues. The aim of this section is to describe how large the different sets of matrices are compared to each other and provide intuition of how restrictive is the hypothesis of homogeneous continuous-time models.
We will focus on the hachimoji models and the generalization of the Jukes-Cantor model. We will use the following notation: is the set of all Markov matrices in a model. (ii) + is the subset of matrices in with only positive eigenvalues. (iii) dd is the subset of diagonally dominant matrices in , i.e. matrices in such that in each row the diagonal entry is greater or equal than the sum of all other entries in the same row. (iv) me is the subset of model embeddable transition matrices in .
Biologically, the subspace dd of diagonally dominant matrices consists of matrices with probability of not mutating at least as large as the probability of mutating. If a diagonally dominant matrix is embeddable, it has an identifiable rate matrix (Cuthbert 1972(Cuthbert , 1973, namely a unique Markov generator, which is crucial for proving the consistency of many phylogenetic reconstruction methods, such as those based on maximum likelihood methods (Casanellas et al. 2020c;Chang 1996). What is more, the set of Markov matrices with positive eigenvalues + includes the multiplicative closure of the transition matrices in the continuous-time version of the model (Sumner et al. 2012). We have the inclusions me ⊆ + ⊆ and dd ⊆ + . The volumes of these spaces are given for the Kimura 3-parameter model in Roca-Lacostena and Fernández-Sánchez (2018, Theorem 4.1), for the Kimura 2-parameter model in Casanellas et al. (2020b, Proposition 5.1] and for the Jukes-Cantor model in Roca-Lacostena and Fernández-Sánchez (2018, Section 4).
The subsets , + , dd , and me can be described using the parameterization in terms of the entries of the Markov matrix or in terms of their eigenvalues. We parameterize the relevant subsets of Markov matrices in terms of the eigenvalues of the Markov matrices and compute the volumes using these parametrizations. If ϕ denotes the bijection from the set of entries of a Markov matrix in a particular model to the set of its eigenvalues and the matrix J (ϕ) denotes the Jacobian matrix of the map ϕ, then the volume of any subset in the parametrization using entries of a Markov matrix will be |det(J (ϕ))| times the volume in the parameterization using eigenvalues. Since the determinant of this Jacobian is constant for each of the three models we consider, the relative volumes of the set of model embeddable Markov matrices will not depend on the parameterization chosen.
We are not able to compute the volume of the subspace of the H7P embeddable Markov matrices exactly. Instead we estimate the volume using the hit-and-miss Monte Carlo integration method (Hammersley 2013) implemented in Mathematica. Table 2 summarizes the volume for various number of sample points. Table 3 gives relative volumes for the relevant sets.
The volumes of me and me ∩ + are estimated using Monte Carlo integration with 10 6 sample points Proof The entries of a H3P Markov matrix as in (5.3) can be expressed in terms of the eigenvalues as Expressing all conditions defining , + , dd , and me in terms of x, y, z allows us to use the Integrate command in Mathematica to compute the desired volumes. For V ( me ∩ dd ) we used the numerical integration command NIntegrate.
The sets me , + and for the hachimoji 3-parameter model are depicted in Fig. 1. The relative volumes of relevant sets are given in Table 4.
Finally, we discuss the generalization of the Jukes-Cantor model which includes the hachimoji 1-parameter model. Let G be a finite abelian group of order n and L : G → {0, 1} a labeling function such that and L(0) = 0 and L(g) = 1 for g = 0. In Lemma 6 we proved that L is a G-compatible labeling. In the general Jukes-Cantor model, the transition matrix P corresponding to this labeling is has the form Since P is a Markov matrix, then a = 1 − (n − 1)b, and thus P is parameterized by b.
Proposition 3 For the general Jukes-Cantor model, consider , + , dd , me as subsets of R parameterized by b, the off-diagonal element of the Markov matrix. Then: (iv) By Remark 4, a Markov matrix is general Jukes-Cantor embeddable if and only if the eigenvalue 1 − nb satisfies 1 ≥ 1 − nb > 0. Since 1 ≥ 1 − nb necessarily holds for any Markov matrix, we have me = + .
The relative volumes of relevant sets for the general Jukes-Cantor model are presented in Table 5. Proposition 3 gives for the hachimoji 1-parameter model (i) n−1 n 1 1

Conclusion
When modelling sequence evolution we often adopt several simplifying assumptions, which make the statistical problems tractable. The commonly used Markov models depend on the assumption that sites evolve independently following a Markov process. The Markov chain is often assumed to be homogeneous continuous-time, that is the transition probabilities are independent of the time. This means that the instantaneous rates of substitution at any time are fixed and usually displayed as the entries of rate matrices. If an evolutionary process is not homogeneous, then one can multiply transition matrices of short homogeneous processes. The resulting matrix is not necessarily embeddable, but if it is, then the inhomogeneous process can be approximated by a homogeneous one.
In this paper we provide necessary and sufficient conditions for model-embeddability of n × n symmetric group-based substitution models, which include the well known Cavender-Farris-Neyman, Jukes-Cantor, Kimura-2 and Kimura-3 parameter models for DNA. We fully characterize those embeddable n × n stochastic matrices following a symmetric group-based model structure whose Markov generators also satisfy the constraints of the model, which we refer to as model embeddability.
A novel application of our main result is the characterization of model embeddability for three group-based models for the hachimoji DNA, a synthetic genetic system with eight building blocks. For these models we also compute the relevant volumes of model embeddable matrices within other relevant sets of Markov matrices. These computations show how restrictive is the hypothesis of a particular hachimoji time-continuous group-based model.
In this article we have considered symmetric group-based models. The importance of the symmetricity assumption is that it guarantees that the eigenvalues of rate and transition matrices of a group-based model are real. We use this property in the proof of Theorem 1. A future research question is to explore whether this approach can be extended to group-based models that are not symmetric.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A Friendly labeling functions
Besides G-compatible labeling functions, there is another class of labeling functions which has been studied in the literature. They are called friendly labeling functions and were introduced by Sturmfels and Sullivant (2005). Friendly labelings are useful in determining phylogenetic invariants for group-based models on evolutionary trees. In particular, a friendly labeling guarantees that if a particular labeling comes from an assignment of group elements, then any choice of a group element to one particular edge which is consistent with the labeling can be extended to an assignment that is consistent with labeling on all edges of the claw tree.
Definition 3 Let G be a finite abelian group and L : G → L a labeling function. Let n ∈ N and Z := {g ∈ G n : g n = n−1 i=1 g i }. Define the mapL : Z ⊆ G n → L n to be the induced labeling function on Z ⊆ G n . The labeling function L is said to be n-friendly if for every l ∈L(Z ) and i = 1, 2, · · · , n, we have π i (L −1 (l)) = L −1 (π i (l)). Here, π i denotes the projection to the i-th component. Furthermore, the labeling function L is said to be friendly if it is n-friendly for all n ≥ 3.
By Sturmfels and Sullivant (2005, Lemma 11), to check whether a labeling function is friendly, it is enough to check that the labeling is 3-friendly. Then L is not friendly labeling because L −1 (π 3 ((1, 1, 2))) = {2, 3} while π 3 (L −1 (1, 1, 2)) = π 3 ((1, 1, 2)) = {2}. Table 6 summarizes all friendly labelings for abelian groups of order n, where 2 ≤ n ≤ 8. In the table, two group elements receive the same label if they belong to the same subset in a partition of G. In the table we do not include the friendly labelings for Z 2 × Z 4 and Z 2 × Z 2 × Z 2 , since there are too many of them.
The next example shows that there are friendly labelings that are not G-compatible.