A new algebraic approach to genome rearrangement models

We present a unified framework for modelling genomes and their rearrangements in a genome algebra, as elements that simultaneously incorporate all physical symmetries. Building on previous work utilising the group algebra of the symmetric group, we explicitly construct the genome algebra for the case of unsigned circular genomes with dihedral symmetry and show that the maximum likelihood estimate (MLE) of genome rearrangement distance can be validly and more efficiently performed in this setting. We then construct the genome algebra for a more general case, that is, for genomes that may be represented by elements of an arbitrary group and symmetry group, and show that the MLE computations can be performed entirely within this framework. There is no prescribed model in this framework; that is, it allows any choice of rearrangements that preserve the set of regions, along with arbitrary weights. Further, since the likelihood function is built from path probabilities—a generalisation of path counts—the framework may be utilised for any distance measure that is based on path probabilities.


Introduction
In the eight decades since Dobzhansky and Sturtevant observed that differences in fruit fly genomes could be explained by a sequence of reversals of genome segments (Dobzhansky and Sturtevant 1938), the study of evolution via genome rearrangement has developed into a rich and active field, with diverse applications (Chen et al. 2018;Darmon and Leach 2014;Oesper et al. 2017). Much work focuses on the calculation of evolutionary distances under rearrangement models, with the distances subsequently used to reconstruct phylogenetic trees. For example, minimal rearrangement distances between genomes-and other distance estimates based on these-have been studied extensively and, under various model restrictions, can be calculated efficiently (Bader et al. 2001;Wang et al. 2006;Bader and Ohlebusch 2006;Oliveira et al. 2019). There are, however, good arguments for applying stochastic methods that estimate genomic distance, via rearrangement, as evolutionary time elapsed (Serdoz et al. 2017), particularly when such an approach allows various rearrangement models to be considered (Terauds and Sumner 2019).
The maximum likelihood approach detailed in Serdoz et al. (2017) utilised the theory of the symmetric and dihedral groups to model circular genomes and region set-conserving rearrangements, motivated by earlier group-theoretical approaches to rearrangement models (Francis 2014;Egri-Nagy et al. 2014). The combinatorial problem of calculating the maximum likelihood estimate (MLE) of evolutionary distance was then converted into a numerical one in Sumner et al. (2017) via the representation theory of the symmetric group algebra. In Terauds and Sumner (2019), the consideration of symmetry was extended to include symmetry of rearrangement models, the role of this in simplifying calculations was explored, and the concrete implementation of the technique for a general model was described. Whilst the representation theory approach reduces the complexity of the MLE computations, the complexity is still of factorial order, meaning that computations for large number of regions remain, for the moment, out of reach.
In this work, we suggest that the appropriate theoretical setting for such MLE computations is in fact not the symmetric group algebra but a lower-dimensional algebra. In the symmetric group algebra, the basis elements for computations are individual permutations, each representing a rearrangement or a genome in a fixed orientation, and symmetry is incorporated as an extra step in the calculations. To simplify this, we construct an algebra that incorporates the inherent symmetry into each element. Here, the basis elements are permutation clouds. These correspond to genomes, by simultaneously including all physical orientations; due to the corresponding symmetries of the rearrangement model, they also represent rearrangements in a natural way.
This approach explains and removes the redundancy in the MLE computations that was observed in Terauds and Sumner (2019). In developing the approach, we firstly focus on the simple concrete case of uni-chromosomal circular genomes modelled with unoriented regions and no distinguished positions, building on previous work (Serdoz et al. 2017;Sumner et al. 2017;Terauds and Sumner 2019). Subsequently, we demonstrate that our results may be applied more generally, for example to genomic models that include region orientation and/or origin and terminus of replication. Further, although our focus is on calculation of MLEs, our approach can be applied to calculate other measures of genomic distance under rearrangement; in particular, any that utilise path counts or weighted path counts, such as minimum distance. Whilst the framework does not specify a rearrangement model-indeed, one may choose the allowed rearrangements and their weights-we note that the group-based approach limits us to rearrangements that conserve the set of genomic regions, and thus cannot accommodate insertions, deletions or duplications. We are currently working on expanding the framework to a semigroup-based approach that could incorporate at least some of these rearrangement types. Some of the algebra easily extends to the semigroup case (see Remark 4.7, for example), however there is much yet to be done and this is outside the scope of the present paper.
In the next section, we outline the details of the symmetric group algebra approach to calculating MLEs Terauds and Sumner 2019) for pairs of unsigned, circular genomes, which forms the foundation for the current work. Following this, in Sect. 3, we construct the genome algebra, based on permutation clouds, for this case and show that it provides a coherent framework for modelling such genomes and region set-conserving rearrangements and for calculating MLEs. Section 4 outlines the extension of our results and techniques from permutations with dihedral symmetry to an arbitrary group and symmetry group. This verifies that, as well as incorporating flexibility in the rearrangement model, the framework is not specific to one particular genomic model. The paper concludes with a brief discussion section.

Background: the permutation approach
In this section we set out the theoretical framework for rearrangement models based on permutations, and recall the key elements of the technique for calculating the maximum likelihood estimate of evolutionary distance. Full derivations and details may be found in Terauds and Sumner (2019) and the earlier papers Serdoz et al. 2017). For the specific case study in this and the next section, we model the evolution of single-strand, circular genomes; we do not consider the regions to be oriented and do not distinguish any positions. 1 Genomes that are to be compared share N identified regions 2 of interest and we consider only rearrangements that conserve the set of regions. Accordingly, we use unsigned permutations, that is, elements of the symmetric group, S N , to represent both genomes and rearrangements. Explicitly: the regions and positions are each labelled by the integers {1, 2, . . . , N }, and a given genome is represented by a permutation σ ∈ S N , where Note that while the region labels are chosen once and are immutable, the position labelling reflects a choice of reference frame (starting position and direction of numbering) that changes when we move the genome in space. Since we do not distinguish any positions, there are 2N possible choices of reference frame and thus 2N distinct permutations that represent any given genome; we denote these by (1) Here, D N is the dihedral group, and the genome has dihedral symmetry.
Since D N is a subgroup of S N , the sets [σ ] are cosets, that is, each [σ ] = D N σ is an equivalence class of S N . Since a given genome exists independently of its orientation in space, we may identify it with the entire coset (Serdoz et al. 2017;Egri-Nagy et al. 2014). However, in this initial formulation, we choose any one of the permutations from the coset (1), say σ , to specify the genome, work with this single permutation at first, and incorporate all permutations in the set [σ ] (all symmetries of the genome) into the likelihood calculations in due course.
We model evolution as a sequence of discrete rearrangement events occurring in continuous time. In this section, as in previous work, we consider a rearrangement to be a single permutation acting on a single permutation; in the next section we shall develop this into the notion of permutation clouds acting on permutation clouds. For now, however, for a genome represented by σ ∈ S N , a rearrangement event is represented by a permutation, a ∈ S N , acting on σ (on the left): σ → a σ . We refer to permutations a acting in this way as "rearrangements". The full biological model for evolution is given by (M, w, dist), where M ⊆ S N is the set of allowed rearrangements, w : M → (0, 1] is the probability distribution on this set, and dist is the probability distribution of the independent rearrangement events in time. One may have biological evidence for including particular types and sizes of rearrangements in the model with differing relative probabilities, or may wish to compare distances computed under differing models (see Terauds et al. 2021 for some specific examples of models and distance comparisons). The distribution dist may similarly be chosen according to evidence or preference; in this treatment, we use the Poisson distribution.
We shall emphasise at this stage that we make minimal further restrictions on the set of allowed rearrangements M. Without loss of generality, we assume that M generates the group S N ; this means that any permutation in S N may be obtained from any other by applying a sequence of elements from M. (Note that the case of M not generating S N is simpler: in this case M generates a subgroup, H ⊆ S N , and the problem reduces to considering this smaller group, since any pair of elements would simply be unrelated under the model or both be elements of a coset H σ .) Elements of D N are, formally, allowed in the set of rearrangements, although their action does not actually alter the genome. This allows for full generality; for example, if one wishes to include 'all inversions' in the model, then the inversion of a region of size N − 1 is the same as flipping the genome over in space.
The first model condition simply states that the model should naturally possess the same symmetry as the genome (in the current case, dihedral symmetry). Suppose, for example, that (1, 2) ∈ M, meaning that the regions in positions 1 and 2 may swap places. Then, since the position labelling is arbitrary, we should have ( , + 1) ∈ M for all = 1, 2, . . . , N − 1, and (N , 1) ∈ M, meaning that any two regions in adjacent positions may swap places; further, these rearrangements should all be equally probable. We refer to this property as dihedral symmetry of the model and, mathematically, express the condition as for each a ∈ M and d ∈ D N , dad −1 ∈ M and w(dad −1 ) = w(a) . (M1) The second model condition ensures that the modelling is agnostic to the temporal direction of evolution. More precisely, the condition states that for any rearrangement that is allowed, its inverse is also allowed, with the same probability. That is, for each a ∈ M, a −1 ∈ M and w(a −1 ) = w(a) . (M2) We refer to this property as rearrangement reversibility of the model, or simply model reversibility. This condition is natural in the current group-based setting, where the typical rearrangements are reversals (which are self-inverse) and translocations (whose inverses are translocations). It is not essential for most of the construction, however does have some nice implications. For example, when we interpret our model of evolution as a Markov process, in Sect. 3.1, we will show that (M2) is equivalent to the time reversibility of the Markov process.
The evolutionary distance measure we consider in this paper is the maximum likelihood estimate of time elapsed (MLE). This is the maximum value of the likelihood function, which gives the probability, for any given time T , that the reference genome has evolved into the target genome in this amount of time. To be precise, for reference genome represented by the identity permutation e ∈ S N and target genome represented by σ ∈ S N , the MLE is the maximum of the function L(T |σ ), where Of course, the likelihood function need not have a maximum; this simply means that no evidence of an evolutionary relationship between the reference and the target under the given model can be discerned. This scenario, familiar from DNA sequence alignment paradigms such as the Jukes-Cantor correction (Felsenstein 2004), was discussed in the context of genome rearrangement models in Serdoz et al. (2017) and Terauds and Sumner (2019), where it was observed to occur in a substantial proportion of cases (independently of the chosen biological model).
For each k, the factor P(k events in time T ) in the likelihood expression is determined by the distribution dist. The first factor, P(e → [σ ] via k events) is the genome path probability, which we shall denote by α k (σ ). Since the target genome may be represented by any permutation from the set [σ ], 'e → [σ ]' is shorthand for "the permutation e is transformed into any permutation from the set [σ ]" and we calculate the genome path probability as a sum of permutation path probabilities, denoted by β k (σ ). That is, Given a permutation σ ∈ S N and a model (M, w, dist), we specify each permutation path probability β k (σ ) by considering the set P k (σ ) of all k-length sequences of permutations, chosen from M, that transform e into σ . Since we assume rearrangement events to be independent, the permutation path probability is then the sum of the probabilities of all such sequences, that is, We note that permutation path probabilities vary for different elements of Theorem 2.1 Let (M, w, dist) be a full biological model for evolution. For all k ∈ N 0 and σ ∈ S N , the following hold.
(iii) If the model has the dihedral symmetry property (M1) and the reversibility property (M2), then α k (σ 1 ) = α k (σ 2 ) for all σ 1 , σ 2 in the set It was shown in Sumner et al. (2017) that the combinatorial problem of calculating path probabilities may be converted into a linear algebra problem via the representation theory of the symmetric group algebra, C[S N ]. For full details in the general model setting, we refer the reader to Terauds and Sumner (2019). We recall the essential steps in the derivation here, since we shall undertake a similar procedure in a lower dimensional algebra in the next section.
Here, we use the term algebra to mean a vector space equipped with a bilinear product. In particular, we require the group algebra C[S N ] consisting of all formal linear combinations of elements of the group S N ; this algebra has natural basis S N , and thus dimension N !. For detailed background on the symmetric group algebra, and algebras more generally, we refer the reader to Sagan (2001) and Etingof et al. (2011) respectively. The following group algebra elements are key to our calculations. To reformulate the path probabilities, we firstly observe that Then, for σ ∈ S N , we multiply (3) on the left by σ −1 to see that β k (σ ) is the coefficient of e in the expansion of σ −1 s k . The representation theory of the symmetric group algebra tells us that this is exactly ( 1 N ! times) the trace of the regular representation of σ −1 s k . That is, Thus, for σ ∈ S N , the kth genome path probability is where we have used the linearity of the characters to incorporate the symmetry element z. The final equality is gained by decomposing the regular representation of C[S N ] into irreducible representations. Recall that the irreducible representations of C[S N ] correspond to the integer partitions of N (Sagan 2001, Prop. 1.10.1); here, we denote a partition of N by p N and index the representations and related objects accordingly. Specifically, for each partition p N , ρ p is the irreducible representation corresponding to p, D p is its multiplicity (and dimension), and χ p is the character of this representation. The above derivation of the permutation path probabilities follows that of Sumner et al. (2017); in that paper it was also noted that an alternative derivation is possible via the theory of the Fourier transform on S N . That is, one may extend the probability distribution w on M to w on the whole of S N , notice that the Fourier transform of w with respect to an irreducible representation ρ p is equal to ρ p (s) and that w convolved with itself k times is exactly the function β k on S N , and then apply the Fourier inversion formula to obtain (4).
Now, for a model with rearrangement reversibility (M2), the irreducible representations of the model element s are diagonalisable (Terauds and Sumner 2019) and we obtain where for each p, the eigenvalues of ρ p (s) are {λ p,i : i = 1, . . . , r p } and, for each p and i, E p,i is the projection onto the eigenspace of λ p,i . Substituting this into the likelihood expression (2) and setting the distribution of events in time to be dist = Poisson(1), we obtain where we have observed that the expression is in fact a power series and, accordingly, have been able to eliminate the infinite sum from the expression. We note that, for a given model, one need only calculate the eigenvalues of each ρ p (s) once. Thus the bulk of the calculation burden is now in calculating the partial traces, that is, for any given genome, the set tr(ρ p (σ −1 z)E p,i ) : p N , i = 1 . . . r p of coefficients that correspond to the distinct eigenvalues in the likelihood equation.
In implementing the likelihood calculations using the expression (6), we observed that for all genomes, most of these partial trace coefficients were zero (Terauds and Sumner 2019). That is, most of our calculations ended up not contributing to the final likelihood function. In the next section, we explain the occurrence of these zeroes and show that the redundancy can be removed from the computations.

The circular genome algebra
The calculations outlined above are performed in the group algebra C[S N ], where each permutation in S N is a distinct basis element. However, we (and, indeed, the computations) do not distinguish between different permutations that represent the same genome-that is, between elements of each equivalence class for σ ∈ S N . We now construct a lower-dimensional algebra by combining these equivalent permutations together to form basis elements-permutation clouds-that correspond to circular genomes. Until otherwise stated, assume that we have fixed a number of regions N and a biological model for evolution (M, w, dist) that has dihedral symmetry and is reversible, that is, satisfies (M1) and (M2).
Definition 3.1 For symmetry element z ∈ C[S N ], the circular genome algebra for N regions is Any element of A of the form zσ , where σ ∈ S N , is called a permutation cloud.
One easily verifies that A is a subalgebra of C[S N ]. To see that A has a natural basis that is in correspondence with the set of genomes, firstly observe that any element of A can be written as a linear combination of permutation clouds zσ , for σ ∈ S N . Thus there exists a basis for A of the form {zσ 1 , . . . , zσ K }, for σ i ∈ S N . Now, so that each basis element is a weighted sum of elements from a set [σ i ], representing a particular genome. Since the sets are equivalence classes, for any σ 1 , σ 2 ∈ S N we have This means that the set of distinct permutation clouds corresponds to the set of distinct genomes, and these form a basis for A. Finally, noting that for all σ ∈ S N , [σ ] = 2N , we see that For the remainder of this section, we fix a basis for A, where we have chosen a representative σ i ∈ S N of each equivalence class [σ i ] for notational convenience. We set the first basis element to correspond to [e] = D N , so that zσ 1 = z. Having used the symmetry of the genomes to construct the algebra, we now incorporate the symmetry of the model to extract some useful properties.

Proposition 3.2
The model and symmetry elements, s, z ∈ C[S N ], have the following properties.
(i) z is idempotent; (ii) s and z commute.
(ii) Now we use the dihedral symmetry (M1) of the model to rewrite the model as m base rearrangements, a 1 , . . . , a m ∈ S N , along with their symmetries. That is, Then, using the same idea as in (i), The above properties translate immediately into properties of the representations of z and s.

Corollary 3.3 Let p N and ρ p : C[S N ] → M D p (C) denote the corresponding irreducible representation of the symmetric group algebra. Then
(i) the only eigenvalues of ρ p (z) are 0 and 1; (ii) ρ p (z) and ρ p (s) are simultaneously diagonalisable, with real eigenvectors.
Proof Claim (i) is immediate since z, and thus ρ p (z), is idempotent. To show (ii), we firstly choose the representation ρ p to be orthogonal on S N (Sagan 2001). Then the rearrangement reversibility of the model ensures that ρ p (s) is symmetric (Terauds and Sumner 2019) and, similarly, one may verify directly that ρ p (z) T = ρ p (z). Thus, since z and s commute, the representation matrices commute and are simultaneously diagonalisable. In particular, since these matrices are real symmetric, the orthonormal set of simultaneous eigenvectors may be chosen to be real. Now fix p N and choose a set of orthonormal vectors {v 1 , v 2 , . . . , v D p } ⊆ R D p that are eigenvectors for both ρ p (z) and ρ p (s), ordered so that the first k p of them are eigenvectors for the eigenvalue 1 of ρ p (z). Take an eigenvalue, λ p,i of ρ p (s) and let J i ⊆ {1, 2, . . . , D p } such that {v j : j ∈ J i } are the eigenvectors for λ p,i . Then, for σ ∈ S N , the partial trace for λ p,i may be written as where for each j, We see then that ρ p (z) "knocks out" parts of the partial traces; in particular, it does this independently of the genome. We shall establish shortly that, in total, 2N −1 2N of the partial traces are knocked out in this way, thus explaining the observation in Terauds and Sumner (2019) that most of the calculated partial traces were zero.
The key to performing MLE computations in the symmetric group algebra is the relationship between the character of the regular representation and the identity element, e ∈ C[S N ]: the character χ reg (τ ) counts occurrences of the identity in a generic element τ ∈ C[S N ]. Since z is idempotent, it is a left identity (but not a right identity) in the algebra A. We now construct the regular representation, ρ A reg , of A and show that its character, the regular character χ A reg , functions in exactly this way for the left identity z ∈ A.
We construct the regular representation of A via the left action of elements of A on the basis B = {zσ 1 , . . . , zσ K } fixed above (8). We need only consider the representation of a generic basis element, zσ for σ ∈ S N , since one may extend linearly to all of A. For arbitrary σ ∈ S N , the i jth entry of the matrix ρ A reg (zσ ) is the coefficient of zσ i in the expansion of (zσ )(zσ j ), that is, One readily verifies that ρ A reg (z) is the K × K identity matrix and that ρ A reg (zσ ) T = ρ A reg (zσ −1 ). The regular character χ A reg is the trace of the regular representation matrix. For a generic basis element zσ ∈ A, Since zσ = z if and only if σ ∈ D N , this shows that we can use the character of the regular representation of the algebra A to track coefficients of the left identity z, just as we do for the identity e in C[S N ]. Further, we can express the regular character of A as a sum over the irreducible characters of C[S N ], and thus see that the regular characters of A and C[S N ] coincide on A.

Proposition 3.4 For arbitrary τ ∈ A,
Proof (i) Given τ ∈ A and the basis B from (8), there exist c 1 , . . . , (ii) It suffices to consider a generic basis element zσ ∈ A, since the characters are linear. We shall apply the dual orthogonality relations on the irreducible characters of S N (see for example (James and Liebeck 2001, Thm. 16.4)), given for σ, τ ∈ S N by where cent S N (σ ) := {γ ∈ S N : γ σ = σ γ } is the centraliser of σ and the map δ : Recall that for p N and τ ∈ S N , by (13), recalling that K = N ! 2N .
An immediate consequence of the above is an expression for the dimension of A in terms of the characters where, for each p N , k p is the multiplicity of the eigenvalue 1 of ρ p (z). The first part of this can also be seen directly from the dual orthogonality relations.
To perform the MLE calculations in the algebra A efficiently, we will need a decomposition of the regular character in terms of irreducible characters of A. Firstly, we'll observe that irreducible submodules of A = zC[S N ] can be produced by acting with z on the irreducible submodules of C[S N ]. This is straightforward and, in fact, true in a more general context (see, for example, Steinberg 2016, Lemma 4.15), however we include the details here since we'll use them in our subsequent constructions. We denote the irreducible submodules of Theorem 3.5 The non-trivial modules gained by acting with the symmetry element z on the irreducible submodules of C[S N ] are irreducible modules of the genome algebra A.
Proof Let p N . As above, we may take a set {v 1 , v 2 , . . . , v D p } of (real, orthonormal, linearly independent) eigenvectors for both ρ p (s) and ρ p (z), ordered such that the first k p of them correspond to the eigenvalue 1 of ρ p (z).
It is clear that W p is an A-module; we need show that it is either {0} or irreducible. Suppose that there exists U p ⊆ W p such that U p is an A-module, and 0 = u ∈ U p . Then It was shown in Terauds and Sumner (2019) that there exist p N for all N > 3 such that χ p (z) = k p = 0; that is, there are always some C[S N ]-modules that are projected down to zero in A. We note that this is not true in the more general case considered in Sect. 4 (for example if z is constructed from a different symmetry group). Now, the dimension expression (16) suggests that we will not be able to decompose the algebra A into a direct sum of irreducible submodules as we can for C[S N ] (17). By (Etingof et al. 2011, Thm. 3.5.8), if this were possible with the irreducible submodules W p from above, then the dimension of A would be p N k 2 p . We have not yet verified here that these W p comprise all irreducible submodules of A, nor that they are all distinct (not isomorphic to one another), but this is indeed the case (Steinberg 2016, Thm. 4.23). The difference between the dimension of A and that gained from the irreducible modules here is signalling that not all of the information about A can be represented by the action of A-in this case, the left action of A on its irreducible modules, and on itself, is not injective.
To see this, let W be an irreducible module of A. Then, since z is a left identity in A, z must act as the identity on W . But then, for any zσ, zσ ∈ A such that zσ z = zσ z, for all w ∈ W . From Theorem 2.1(ii), such zσ = zσ ∈ A correspond to physically distinct genomes that share the same path probabilities and likelihood functions: we have In the language of algebras, A has a non-trivial radical (Etingof et al. 2011, Def. 3.5.1), since (for N > 3) there are non-zero elements zσ − zσ ∈ A that annihilate all irreducible modules of A. As a concrete example, consider the following.
Example 3.6 Let σ = (1, 2), σ = (2, 3) ∈ S N . Setting r = (1, 2, . . . , N ) ∈ D N , we observe that σ = r σ r −1 , so that zσ z = zσ z. But there exists no d ∈ D N such that σ = dσ , and thus zσ = zσ . ♦ For our practical purposes, this is perfect: the algebra sees genomes as distinct entities, but their representations do not distinguish between genomes corresponding to an equivalence class [σ ] D , whose likelihood functions are the same. Further, whilst zσ and zσ correspond to distinct genomes, if we consider them as rearrangements, they are not distinct, since they have the same action. We shall return to this presently, when we define models in the genome algebra.
Proof For the reverse implication, we argue as above, replacing w in (19) by each basis element zσ i to verify that the matrices are the same. Conversely, if the regular representations coincide, then we immediately have that (zσ )z = (zσ )z.
We now explicitly consider the irreducible representations of A on the irreducible submodules and use these to rewrite the regular character of A in terms of irreducible characters of A. Let p N such that k p > 0 and consider the module W p of A which, as in the proof of Theorem 3.5, has a basis {v 1 , . . . , v k p } ⊆ R d of orthonormal eigenvectors. Now, the action of A on W p = z · V p is inherited from the action of C[S N ] on V p , so for arbitrary τ ∈ A, we define the (k p × k p ) representation matrix ρ A p (τ ) on W p via the action of ρ p (τ ) on the basis vectors v j : More concisely, setting Q p to be the (D p × k p ) matrix with {v 1 , . . . , v k p } as columns, we have Clearly (also c.f. (19)), ρ A p (z) is the (k p × k p ) identity matrix for each such p N . For each p N such that k p = 0, we formally define ρ A p to be the zero representation. Now we may calculate the irreducible characters χ A p of A and see that they coincide with the irreducible characters χ p of C[S N ] restricted to A.

Proposition 3.8 For each p N and τ ∈
By linearity, we need only verify the claim on a generic basis element, zσ ∈ A. Again utilising the orthonormal eigenvectors {v 1 , . . . , v D p } of ρ p (z), where those for i ≤ k p correspond to the eigenvalue 1 and the remainder to the eigenvalue 0, we have where in the final step we have used the cyclicity of the trace and the idempotency of z (Proposition 3.2).
Combining Propositions 3.4 and 3.8 gives the desired character decomposition.
Corollary 3.9 For arbitrary τ ∈ A, Having defined and decomposed the regular character of A, we are ready to return to the likelihood calculations. Using the equivalence of the characters of A and C[S N ] on the algebra A, along with the the interplay between the genome and model symmetry, we now verify that we may work entirely in A to calculate the genome path probabilities and thus the likelihood functions, as defined in the previous section (2).
Theorem 3.10 Let σ ∈ S N and k ∈ N 0 . Then Proof From (5) , since z and s commute, z is idempotent and the trace is cyclic. The first equality is then clear from Proposition 3.4 and the second from Corollary 3.9.
We have mentioned the importance of the 'identity counting' property of the regular character, that is, Proposition 3.4 (i), but this combinatorial component is somewhat hidden in the proof of Theorem 3.10. To highlight it, one may begin with the identity (3) stated in the previous section and, for any given genome zσ (σ ∈ S N ), multiply by zσ −1 z to obtain By observing that there are exactly 2N values of τ ∈ S N for which zσ −1 τ = z, one thus sees directly that the coefficient of z in the expansion of zσ −1 zs k is α k (σ ).
Note that we could have simplified the above character expression (21) a little, that is, However, as we did in the algebra C[S N ], we want to diagonalise the matrices representing the model element, namely the matrices ρ A p (zs). So we keep the middle z and write For each p N , as in the proof of Corollary 3.3, we can choose ρ p (s) to be symmetric; thus by the definition (20) each matrix ρ A p (zs) is symmetric and thus diagonalisable.
Then we obtain where E A p,i is the projection onto the eigenspace of the ith eigenvalue, λ p,i , of ρ A p (zs). Now, finally substituting the path probabilities (22) into the theoretical likelihood expression (2), we obtain It is clear from Theorem 3.10 that the likelihood expression (23), involving only elements of the genome algebra A, is equal to that (6) gained via the group algebra C[S N ]. We now show that the above is, really, a simplified version of (6): that is, by working in the smaller algebra we have eliminated the many eigenvalue terms that occur with zero coefficients.

Proposition 3.11
For each p N such that W p = {0}, the eigenvalues of the matrix ρ A p (zs) are exactly the eigenvalues of ρ p (s) that occur with non-zero coefficient in the likelihood expression (6).

Proof
Let p N such that W p = {0}. As above, take the set {v 1 , v 2 , . . . , v D p } ⊆ R D p of orthonormal eigenvectors for both ρ p (z) and ρ p (s), with the first k p corresponding to the eigenvalue 1 of ρ p (z), and form the matrix Q p with the first k p vectors as columns. Then, as in (20), where each λ i is clearly an eigenvalue of both ρ A p (zs) and ρ p (s) (and the λ i are not necessarily distinct). Suppose λ is an eigenvalue of ρ p (s) that does not appear in the matrix of (24). Then λ has corresponding eigenvector(s) {v j : j ∈ J }, where J ⊆ {k p+1 , k p+2 , . . . , D p }. But then, for any σ ∈ S N , the coefficient of the λ term in the likelihood expression is the partial trace by (10) and (11).
Note that, although in the proof of Proposition 3.11 we construct each ρ A p (zs) as a diagonal matrix (in which case the projections onto the eigenspaces would be diagonal matrices of 1s and 0s), we do this only to verify that the representation has the required properties, and we utilise the eigenvectors of the representation ρ p (s). In practice, the whole point is to not calculate the much bigger representations ρ p (s). That is, when implementing calculations, we would expect to construct a basis for each irreducible module W p directly, hence the general form of the projections in (23).
We note that the equivalence of the path probabilities and thus likelihoods on the classes [σ ] D and [σ ] D R stated in Theorem 2.1 can alternatively be obtained by working directly in the genome algebra A. We omit the proof here since we shall prove a more general version of the result in Sect. 4.
Note that this does not imply that k p = 1 2N D p for each (or any) p N , rather that on average, and asymptotically, the dimension of each irreducible submodule W p of A is Given that the dimension of the algebra A is still N ! 2N , this does not significantly reduce the computational complexity. However, since the multiplicity of the irreducible submodules in A is the same as in the group algebra (25), the reduction in the dimension of the irreducible submodules is (relatively) much larger than the reduction in total dimension. Example 3.12 Consider N = 6. There are N ! = 720 permutations in S 6 , so the dimension of the regular representation of C[S 6 ] is 720. The dimensions of the irreducible modules V p of C[S 6 ] (given as a list rather than a set as they are not all distinct) are [D p : p 6] = [1,5,9,10,5,16,10,5,9,5,1] .
Thus, for any rearrangement model, each likelihood expression will be a sum of at most eight terms, corresponding to at most eight distinct eigenvalues. ♦ We note that such dimension reductions are less striking for larger N . In any case, we see a significant theoretical gain here: the genome algebra A incorporates the symmetry of the genomes and models into a unified framework, within which the problem can be formulated and the computations performed. To highlight this, we next consider the regular representations of s in the algebra C[S N ] and of zs in the algebra A as Markov matrices; then we conclude this section by re-formulating the model in the genome algebra framework.

The Markov interpretation
In the group algebra, C[S N ], the rows and columns of the regular representation are determined by the N ! permutations σ i ∈ S N . In particular, where for each rearrangement permutation a ∈ M, the i jth entry of ρ reg (a) is 1 if aσ j = σ i and 0 otherwise, so that the ρ reg (a) matrices have exactly one '1' in each row and column. Then ρ reg (s), as a convex sum of Markov matrices, is itself a Markov matrix. The jth column of ρ reg (s) contains |M| non-zero entries, each equal to a unique w(a), since for the distinct permutations a ∈ M, the permutations aσ j are all distinct.
Thus ρ reg (s) is the transition matrix of a discrete Markov chain where the states are the N ! permutations in S N and the i jth entry is the probability of permutation σ j transitioning into permutation σ i via one rearrangement chosen from the model M. That is, It is clear from this formulation that the matrix ρ reg (s) is symmetric if and only if the model has the rearrangement reversibility property (M2). Thus, since the stationary distribution on the Markov chain is the uniform distribution on the states, the reversibility property (M2) of the model is equivalent to reversibility of the Markov model. Now, in the algebra A, the corresponding matrix representing the model element is As above, the matrices on the right hand side represent basis elements of the algebra, here za for a ∈ M. Although the basis elements here do not form a group (so their regular representations are not, in general, zero-one matrices), each of the ρ A reg (za) is again a Markov matrix: for a given a ∈ M, the i jth entry of ρ A reg (za) is the coefficient of zσ i in the expansion of (za)(zσ j ) and, since z = 1 2N d∈D N d, the expansion is a convex sum. Thus the entries in each column of each ρ A reg (za) sum to one and ρ A reg (zs), as a convex sum of Markov matrices, is indeed a Markov matrix.
Each basis element zσ i corresponds to a genome, so ρ A reg (zs) is the transition matrix of a Markov chain where the states are genomes. The i jth entry, which we calculated as the proportion of the expansion of (zs)(zσ j ) that is equal to zσ i , is of course the probability of the genome (zσ j ) transitioning into the genome zσ i in one step, via the model.

Permutation clouds: a unifying concept
In the permutation approach detailed in Sect. 2, we considered rearrangement events to be individual permutations acting on individual permutations. In the setting of the genome algebra, we represent both genomes and rearrangements by permutation clouds, each of which is a sum of permutations weighted by their probabilities (zσ = 1 2N d∈D N dσ ). A single rearrangement event is here modelled by a permutation cloud za acting on a permutation cloud zσ . Mathematically, this event results in a convex combination of permutation clouds c i zσ i ; biologically, it results in one of the genomes zσ i , according to the probability distribution given by the coefficients c i .
The permutation cloud view of circular genomes seems to us quite natural. To observe a genome, we fix an orientation and a reference frame, and assign to it a single permutation (any one, from the appropriate equivalence class [σ ], with probability 1 2N ). We refer to this as an instance of the genome. Theoretically, however, the genome exists simultaneously as all of its possible physical orientations in space; it is the cloud, zσ .
What about rearrangements? For a rearrangement permutation a ∈ S N and d ∈ D N , the result of the action da on σ ∈ S N is d(aσ ) ∈ [aσ ], that is, it results in the same genome as a acting on σ . So we can think of za acting on zσ as encompassing (all orientations of (a acting on (all orientations of σ ))).
Of course, the action of za on zσ also incorporates the dihedral symmetries of a as an action, that is, dad −1 for d ∈ D N . For a biological model (M, w, dist) for evolution of genomes as permutations, under the assumption of dihedral symmetry, we wrote (9) where for each a k and all d ∈ D N , w(da k d −1 ) = w(a k ). Since da k d −1 ∈ [a k ] D , the action of z(da k d −1 ) on zσ is the same as the action of za k on zσ (see Proposition 3.7) and thus each ρ A reg (z(da k d −1 )) = ρ A reg (za k ). Having shown that the MLE computations can be performed in the genome algebra A, and discussed the representation of both genomes and rearrangements as permutation clouds in this algebra, it remains to reformulate the model within this framework. Given the model (27)  Since the dihedral symmetry of the genomes is built into the algebra A, specifying the model to consist of elements of A in this way makes the dihedral symmetry requirement (M1) redundant. Model reversibility in this setting is formulated as This condition is sufficient to ensure that the irreducible representations of zs are diagonalisable, which is convenient for computations. Although the algebra A has a left identity, it does not contain inverses, so za −1 is not (in general) an inverse of za. However, as we shall see in the next section, model reversibility is, further, equivalent to the reversibility of the Markov model. We conclude this section with an example to illustrate some of these key concepts.
Example 3.13 Suppose we wish to consider a model consisting only of "small inversions", which we will take to be inversions of two or three regions. In the permutation framework, we would define this model to be Here there are N , rather than 2N , distinct instances of each rearrangement type, since for inversions, 4 each flip coincides with a rotation. For the rearrangement probabilities, one could choose the uniform distribution, w(a) = 1 2N for all a ∈ M, or one may consider the larger inversions to be less likely and set, for all a ∈ M, In the genome algebra, the model is simpler to express; we take the rearrangement instances (1, 2) and (1, 3) and the model is The weight functions corresponding to the above would then be w A (z(1, 2) (1, 3)) = 1 3 . One may recall from Example 3.6 that z(1, 2) = z(2, 3), however, one need not (and indeed should not) include both of these in the rearrangement model since they have the same action: z(1, 2) · zσ = z(2, 3) · zσ for any genome zσ ∈ A, since z(1, 2)z = z(2, 3)z. Further, one must be aware of complementary rearrangements.
For example, in the case N = 5, the actions of z(1, 2) and z(1, 3) coincide: inverting a two-region segment is, under dihedral symmetry, the same rearrangement as inverting the complementary three region segment (correspondingly, when N = 5, z(1, 2)z = z(1, 3)z). ♦ One can easily eliminate the possibility of such 'rearrangement redundancies' by reformulating the model in the genome algebra as a set of elements of the form zaz; we do this in the next section (30). More on such considerations, along with explicit examples of rearrangement models in the oriented region case, may be found in Terauds et al. (2021); a deeper algebraic consideration of rearrangements is given in Stevenson et al. (2022).

More general models of genomes
The construction of the genome algebra A = zC[S N ] in Sect. 3 was determined by assumptions we made about how to model the genomes. In particular, following on from previous work (Serdoz et al. 2017;Sumner et al. 2017;Terauds and Sumner 2019), we chose to model circular genomes, without considering orientation of regions, which meant an instance of the genome could be represented by a permutation σ ∈ S N . We modelled the genomes without a distinguished position, which meant the genome symmetries corresponded to the dihedral group D N . In this section, we outline how the constructions and techniques presented for this specific case can be generalised to cover different genomic models: any for which genome (and rearrangement) instances can be represented as elements of a group G.
Suppose, for example, that one wanted to vary the above model to include an origin of replication in the circular genomes. We would model this as a distinguished position and the genomes would then have no rotational symmetry, only reflectional. The symmetry group would thus be Z N = {e, f }, the symmetry element z = 1 2 (e+ f ), and each genome an element zσ = 1 2 (σ + f σ ) ∈ zC[S N ]. The model would naturally reflect this symmetry, with rearrangements taking the form za, a ∈ S N . In particular, this case allows for rearrangements at different positions on the genome, relative to the origin of replication, to be assigned different probabilities. With appropriate choices of rearrangements, this framework could also be used to represent linear genomes.
To include orientation of genes, one would use a different underlying group, for example the hyperoctahedral group H N of signed permutations (as outlined in Egri-Nagy et al. 2014), and a symmetry group of choice (for example, a copy of the dihedral group in the case of a circular genome with no distinguished positions). An explicit consideration of the genome algebra for the signed region case, including some detailed examples, may be found in Terauds et al. (2021).
To construct the general genome algebra, we begin with a group G, whose elements represent instances of the genomes of interest, and a subgroup Z ⊆ G that represents the physical symmetries of these genomes. 5 We consider rearrangements such that a single rearrangement event for an instance g ∈ G of a genome can be modelled via the left action of a particular element a ∈ G on g. The terms in the following definition reflect our applications of the objects, but obviously the subsequent results concerning the algebras hold whether or not one applies them to genomes. We call A the genome algebra of G with Z , A 0 the class algebra of G with Z and z the symmetry element of A and A 0 .
Rather than proceeding as in Sects. 2 and 3 , where we first defined the rearrangement model, path probabilities and likelihoods for genome instances (group elements) and then showed that the calculations could be performed in the genome algebra, we will here formulate these concepts (and then perform the computations) entirely in the genome algebra A. We include the class algebra A 0 for completeness. Following the observations in the previous section, it seems a natural next step to consider the algebra formed by combining together the elements of A that act indistinguishably. However, we shall see that this lower dimensional algebra is not the appropriate setting for our calculations. Proof Since the sets [g] and [g] D for g ∈ G are respectively right cosets and double cosets of G with respect to the subgroup Z , it is clear that they are equivalence classes.
We use the label 'D' for the classes defined in (ii) above to refer to the double coset structure of the sets [g] D (noting that this conveniently coincides with the original usage (Terauds and Sumner 2019) of the label, which referred to the dihedral symmetry in the circular genome case). The following statements can be derived directly from the subgroup properties of Z and Lemma 4.2 (c.f. the corresponding results in Sect. 3). For the remainder of this section, we fix a basis for each of A and A 0 , as defined in (ii) and (iii) above. Remark 4.4 Each equivalence class [g] D can be viewed as an orbit of G under an action of the group Z × Z and thus, from (iii) above, the dimension L of A 0 may be calculated via Burnside's lemma (James and Liebeck 2001, Prop. 29.4). By combining this with the dual orthogonality relations on the group G, one can directly obtain the dimension result stated below in Theorem 4.5 (i).

Proposition 4.3 Let G be a finite group with subgroup Z ⊆ G. Let A and
We note that working 'entirely' in the genome algebra does not mean that we forget about the group G. In practice, one would observe a genome with a particular orientation and reference frame, thus as an instance g ∈ G, and then identify the genome as the cloud zg ∈ A for the purposes of computation. There are K = |G| |Z | distinct genomes, corresponding to the distinct basis elements zg of the genome algebra A. Similarly, one would conceive a rearrangement initially as an instance a ∈ G and then lift to za in the genome algebra. Considering all orientations of a rearrangement instance on all orientations of a genome corresponds to a left action of the algebra A on itself. Distinct elements of A that correspond to the same element of A 0 act indistinguishably, since Thus there are L = dim(A 0 ) distinct rearrangement actions. 6 The (left) regular representations ρ A reg of the genome algebra A and ρ 0 reg of the class algebra A 0 can be constructed in the usual way (c.f. (12)) via the bases fixed above. As in Sect. 3, one readily verifies that ρ A reg (z) is the K × K identity matrix and that ρ A reg (zg) T = ρ A reg (zg −1 ). Since z is an identity in A 0 , ρ 0 reg (z) is the L × L identity matrix. In this case, the equivalence classes [g] D need not be the same size and thus, in general, ρ 0 reg (zgz) T = ρ 0 reg (zg −1 z). 7 We denote the regular characters of A and A 0 by χ A reg and χ 0 reg respectively and note that these take real values on any algebra element that is a real linear combination of basis elements.
Recall that, by Maschke's theorem (Etingof et al. 2011, Thm. 4.1.1), the group algebra C[G] of any finite group G can be written as a direct sum over its irreducible modules.

Theorem 4.5 Let G be a finite group with subgroup Z
and denote the corresponding irreducible representations and characters of C[G] by ρ i and χ i respectively. Then the following hold. 6 We note that most of these mathematically possible rearrangements would not correspond to biologically plausible ones, so would not appear in rearrangement models in practice. For a deeper consideration of biologically plausible rearrangements from an algebraic perspective, see Stevenson et al. (2022). 7 One can verify via a simple counting argument that ρ 0

(i) For each i, W i := z · V i is either {0} or an irreducible A 0 -module and
Denoting the corresponding characters of A and A 0 by χ A i and χ 0 i respectively, and defining for all g ∈ G and all 1 ≤ i ≤ M. Thus the characters where χ reg and χ A reg denote the regular characters of C[G] and A respectively.
Proof For (i), we use (Steinberg 2016, Prop. 4.18, Thm. 4.23). For the remaining results, we use the observation (28) and proceed just as for the corresponding results in Sect. 3. Note that in this general setting we cannot assume that the irreducible representations ρ i are orthogonal on G, but we can choose them to be unitary (Etingof et al. 2011, Thm. 4.6.2). This means that the corresponding irreducible representations of z in C[G] are self adjoint, and thus each ρ i (z) is unitarily diagonalisable, so that its eigenvectors form an orthonormal basis for C D i ∼ = V i . Since the eigenvectors need not be real, the only difference in the proofs is that we need the conjugate transposes, not just transposes, of these vectors.
The above results imply that A 0 ∼ = A/Rad(A) (Etingof et al. 2011, Thm. 3.5.4), which formalises the relationship between the genome algebra and the class algebra: A 0 is obtained from A by factoring out the elements of A that act trivially. We have previously expressed this as A 0 combining together the elements of A that act indistinguishably. Another aspect of this is the following.
where {zg 1 , . . . , zg K } is our fixed basis for A and each p i is the proportion of the expansion of zazg that is equal to zg i or, equivalently, the probability that the rearrangement za acting on the genome zg will result in the genome zg i . Thus where we have rearranged and collected terms in the final step so that, for each i, q j=1 w(za j z)p j,i is the total probability that the genome z will be transformed into the genome zg i via some (single) rearrangement chosen from the model. Thus and, by repeatedly applying s , one sees that Now, for g ∈ G an instance of the genome of interest, multiply (31) on the right by g −1 to obtain Since zg i g −1 = z if and only if zg i = zg, we see that α k (zg) is the coefficient of z in the expansion of s k zg −1 , and thus by Theorem 4.5 (v). Theorem 4.5 allows us to decompose the regular character in (32) into irreducible characters, however we also will need to diagonalise the irreducible representation matrices.
Lemma 4.8 Let G be a finite group with subgroup Z ⊆ G. Let (M, w, dist) be a biological model for evolution of genomes represented by elements zg ∈ A of the genome algebra (where g ∈ G) and let s ∈ A be the corresponding model element. If the model is reversible, then the following hold.

(i) The irreducible representation matrices of the model element s in A are diagonalisable. (ii) The regular representation of s in A is symmetric.
Proof (i) Suppose that the model is reversible and let 1 ≤ i ≤ q. We have (c.f. (20)) where Q is the D i × k i matrix of orthonormal eigenvectors for ρ i (z). Since G is a finite group, we may choose the irreducible representation ρ i on G to be unitary. Then, writing ρ i ( s z) as a sum of matrices of the form w(zaz) ρ i (zaz) + ρ i (za −1 z) (omitting the second term if a = a −1 ), each of which is self adjoint, we see that ρ i ( s z) is self-adjoint and thus so is ρ A i ( s ). For claim (ii), we proceed similarly, using the observation that ρ A reg (za) T = ρ A reg (za −1 ). (i) For any k ∈ N 0 , the probability that the reference genome z is transformed into the genome zg via k rearrangements chosen from the model is where for each i, E A i, j is the projection onto the eigenspace of the jth eigenvalue λ i, j of ρ A i ( s ). (ii) If the distribution of rearrangement events in time is dist = Poisson(1), then the probability that the reference genome z is transformed into the genome zg via the given model in time T is given by the likelihood function (iii) For any genome zh ∈ A with an instance h ∈ [g] D ∪ [g −1 ] D , the path probabilities and likelihood functions of zg and zh coincide.
Proof The first expression for the path probability α k (zg) was gained above (32). To gain the second, we use the decomposition of the regular character from Theorem 4.5 (iv), and then for each i, use the cyclicity of the trace to write Then, from Lemma 4.8, we may diagonalise ρ A i ( s ) to gain the second expression. Analogously to the definition in Sect. 2, but with genomes instead of elements of G, we define the likelihood function as Then substituting in the expression from (i) and simplifying the power series gives (ii).
(iii) Let h ∈ G such that zhz = zgz or zhz = zg −1 z. We show that α k (zg) = α k (zh) for all k ∈ N 0 , which implies (iii). Let k ∈ N 0 . Since the trace is cyclic, we have where the second equality was obtained by taking the transpose of the argument and applying Lemma 4.8. Then from (i) it is clear that α k (zg) = α k (zg −1 ) and these coincide with α k (zh) by Corollary 4.6.
Since the regular representation of the model element is symmetric, by Lemma 4.8, and the equilibrium distribution is the uniform distribution on the set of genomes, reversibility of the model M is equivalent to time reversibility of the underlying Markov process. As in Sect. 3.1, the regular representation of the model element in A is the transition matrix for a Markov chain with states being genomes, with the probability that genome zg j transitions into genome zg i via k rearrangement steps from the model given by Reversibility then means that for any genomes zg, zh ∈ A, the probability of zg transforming into zh in k steps via the given model is the same as that of zh transforming into zg in k steps. In terms of path probabilities, which is just a special case of Theorem 4.9 (iii). Model reversibility thus implies that the MLE distance is 'directionless', or symmetric, as is any evolutionary distance measure based on path probabilities calculated in this framework.
To conclude this section, we return briefly to the class algebra A 0 . By constructing simple examples, one can verify that, in general, the regular representation matrices of non-identity basis elements in A 0 have non-zero entries on the diagonal, and thus see that we do not have an analogue of Theorem 4.5 (v) for A 0 . That is, the regular character of A 0 is not counting occurrences of the identity in elements of this algebra, and thus cannot be used to calculate path probabilities as in Theorem 4.9.
Consider the underlying Markov model here, with transition matrix given by the regular representation of the model element analogue ρ 0 reg ( s z). Now the states are the basis elements zg i z, each corresponding to an equivalence class [g i ] D . Since each equivalence class [g] D is the disjoint union of |[g] D | |Z | equivalence classes of the form [gz] for some z ∈ Z , each basis element zgz is the average of |[g] D | |Z | distinct genomes of the form zgz for some z ∈ Z . Thus an arbitrary element of the matrix gives us the average probability of a genome from a certain class transitioning into a genome from another class (and a diagonal element gives the probability of a transition within a class). This is not refined enough for our purposes, since, given one genome zg and two more genomes zg and zg that are in the same class (g ∈ [g ] D ), the probability of transitioning between zg and zg need not be the same as the probability of transitioning between zg and zg .
The information is not entirely lost, however; one can use the first column of this Markov matrix to calculate path probabilities. Given an observed instance g ∈ G of a genome, we find the appropriate basis element zg z of A 0 such that g ∈ [g ] D , and then Of course, the fact that this path probability information exists in the regular representation does not mean that it is easy to obtain, in particular since the size of the regular representation in A 0 is likely to be rapidly increasing with the number of genomic regions (for example for G = S N and Z = D N , dim(A 0 ) is proportional to (N − 2)!) and, being unable to retain the 'first column' information through diagonalisation (as one can for the trace), one would need to calculate the kth power of the matrix for each desired path probability. We further note that calculating the equivalence classes themselves, and checking for membership of an equivalence class, is a non-trivial exercise and simply not feasible for large numbers of regions.
In any case, the class algebra A 0 is nicer in some ways than the genome algebra A, in particular in that it is decomposable (that is, isomorphic to a direct sum of its irreducible modules). This property is formally known as semisimplicity. Then, since the irreducible modules of A 0 are identical to those of the algebra A and the irreducible representations of the two algebras not only have the same dimension but coincide on the objects of interest, one may in fact choose to implement the calculations of the irreducible representations in A or A 0 and then, either way, combine the results together according to the decomposition given in Theorem 4.9.

Conclusion
We have presented a coherent algebraic framework for modelling some classes of genomes and rearrangements in an algebra that incorporates the inherent physical symmetries into each element. Algebraic frameworks for modelling genome rearrangement have been studied previously (Meidanis and Dias 2000; Moulton and Steel 2012;Francis 2014), and the importance of including genome symmetry in rearrangement distance calculations has been recognised (Egri-Nagy et al. 2014;Serdoz et al. 2017), however our unified approach, incorporating symmetry into the position paradigm framework (Bhatia et al. 2018), is new.
Beginning with the specific case of circular genomes modelled with unoriented regions and dihedral symmetry, we explicitly constructed the genome algebra from the symmetric group algebra, and showed that the MLE computations can be performed entirely within this algebra. By identifying genomes and rearrangements with single elements-permutation clouds-in the genome algebra, we have advanced previous work that identified genomes with cosets of permutations (where each element of a given coset represents an instance of a genome in a fixed physical orientation) but used the permutations as the basis elements for computation (Egri-Nagy et al. 2014;Serdoz et al. 2017;Sumner et al. 2017;Terauds and Sumner 2019). We have both explained and removed the redundancy that we identified (Terauds and Sumner 2019) in the implementation of the calculations in the symmetric group algebra.
In Terauds and Sumner (2019), we also signalled a desire to extend our technique for calculating the MLE to other settings, for example to include oriented regions or genomes with non-dihedral symmetry. We have not recorded the results of any explicit computations here, however we have algebraically verified that the technique can indeed be extended to a much more general case. For genomes where a single physical orientation can be represented by elements of a group G, and their physical symmetries by the subgroup Z ⊆ G, we defined the genome algebra of G with Z ; here, as in the special case described above, genomes and rearrangements correspond to basis elements (clouds). We showed that the path probabilities and thus the MLE can be formulated and the computations performed entirely in this genome algebra. An application of the framework to modelling signed circular genomes, using the hyperoctahedral group and two possible symmetry groups, is presented in Terauds et al. (2021), along with the results of some sample computations that illustrate how the framework may be applied to compare different models and distance measures.
Although the genome algebra has lower dimension than the group algebra (by a factor of 1 2N in the D N case and 1 |Z | in the general case), this does not significantly reduce the computational complexity of calculating the MLE. We have performed distance calculations, in reasonable time, for genomes with up to twelve unoriented regions (unpublished) and up to six oriented regions (Terauds et al. 2021). Work to extend our initial experimental calculations to implementation of the framework for larger numbers of regions is ongoing. In particular, we are exploring the use of simulations and intend to apply numerical approximations to make distance calculations tractable for genomes with larger numbers of regions.
Whilst the framework does not specify a particular rearrangement model (and indeed, allows choice both in the type of rearrangements allowed and their rela-tive probabilities of occurring), we cannot currently model insertions, deletions, or duplications, since the underlying group structure means we are restricted to rearrangements that do not alter the set of regions. This is a clear limitation of the current approach. To address this, we are currently working on extending the framework to a semigroup-based approach, with the aim of accommodating insertions and deletions. Furthermore, whilst one can apply different probabilities to different rearrangements (and, depending on the genome's symmetry, rearrangements at different genomic positions), the current approach does not incorporate intergenic regions or explicitly consider breakpoints. Whether a group-or semigroup-based genome algebra approach can be devised that incorporates these biological realities, and others such as multiple chromosomes, is another question for future research.
Finally, we note that the applications of this algebraic framework are not limited to calculating MLEs. The likelihood function is built from path probabilities; since our fundamental results hold for these 'building blocks', other rearrangement distance measures that are based on path probabilities may be calculated via the genome algebra. We have shown that, via the regular representation of the genome algebra, the general genome rearrangement model can be viewed as a discrete (or, with the addition of the stochastic component, continuous time) Markov chain, and thus represented as a connected graph, generalising the Cayley graph approach (Moulton and Steel 2012;Clark et al. 2019). This facilitates the calculation of further distance measures, for example mean first passage time, as demonstrated in Terauds et al. (2021).
Funding Open Access funding enabled and organized by CAUL and its Member Institutions.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.