An algebraic model for inversion and deletion in bacterial genome rearrangement

Inversions, also sometimes called reversals, are a major contributor to variation among bacterial genomes, with studies suggesting that those involving small numbers of regions are more likely than larger inversions. Deletions may arise in bacterial genomes through the same biological mechanism as inversions, and hence a model that incorporates both is desirable. However, while inversion distances between genomes have been well studied, there has yet to be a model which accounts for the combination of both deletions and inversions. To account for both of these operations, we introduce an algebraic model that utilises partial permutations. This leads to an algorithm for calculating the minimum distance to the most recent common ancestor of two bacterial genomes evolving by inversions (of adjacent regions) and deletions. The algebraic model makes the existing short inversion models more complete and realistic by including deletions, and also introduces new algebraic tools into evolutionary distance problems.


Introduction
Methods for computing the evolutionary distance between bacterial genomes are important for phylogenetic reconstruction, especially by way of contrast with organisms that have morphological characteristics and better defined species boundaries. Approaches to distances based on large-scale rearrangements have been widely studied in bacteria because they are often relatively quick to compute and can be used to complement, or even improve, trees based on other methods such as sequence comparisons (Bochkareva et al., 2018).
The bacterial genomes that we will consider have a single circular chromosome. During the evolution of bacterial genomes a frequent rearrangement event is the inversion, where the clockwise order of a contiguous block of conserved regions is reversed (Eisen et al., 2000). If the orientation of regions is taken into account, these events also reverse the orientations of regions in this block. While most early mathematical models assumed the probability of all inversions to be equal, evidence to the contrary has emerged which suggests shorter inversions are more likely (Seoighe et al., 2000;Dalevi et al., 2002;Lefebvre et al., 2003;Darling et al., 2008). With this in mind, throughout this paper we will be concerned with inversions of length two. Many other large scale changes to bacterial DNA have been observed and investigated, notably insertion of novel DNA (horizontal gene transfer), deletion of segments, translocation of segments to different locations on the genome, and duplication of segments (Saier, 2008). Deletions are special in the context of inversions however, because they can occur by the same mechanism, namely site-specific recombination (Plasterk et al., 1983). This means that inversions and deletion are related biologically in a way that other combinations of rearrangement operations are not.
Site-specific recombination acts on the circular genome by forming a synaptic complex around two copies of a specific sequence on the genome, that might be far apart on the sequence but close together in a three-dimensional sense in the cell. The recombinase then cuts the DNA at both sites and rejoins across the two, in effect locally replacing a trivial 2-braid with a braid generator (as an algebraic topologist might describe it). This event can result in the inversion of a segment of the genome relative to the rest of the genome, but can also result in the deletion of a segment, as shown in Figure 1. Site-specific recombination giving rise to deletion, with the area of recombinase action shown shaded on the left. The result is in fact a pair of linked components (topologically, a "Hopf link"), but over time any component without the essential genes from the original genome (such as origin and terminus of replication) would degrade and the result would be a genome without the genetic material from that component (that is, a deletion). If the figure on the left had an even number of twists the result would be an inversion (see for example Francis (2014)).
Most rearrangement models, with few exceptions (see Alexandrino et al. (2021a) for instance), assume that the genomes in question have the same sets of regions. While inversion models need both genomes to have the same gene content (or ignore gene content that is not shared), a model incorporating both inversions and deletions can model an evolutionary history of two genomes with differing gene content under the assumption that they both evolved from a common ancestor with the union of their sets of genes. Incorporating deletions thus enables a wider class of genomes to be compared more completely, especially since in some instances (see Raeside et al. (2014)) deletions are the most frequently observed recombination event.
By thinking of bacterial genomes as sequences of region labels or integers (see Bhatia et al. (2018) for a review of these conventions), a pair of genomes σ 1 and σ 2 can be represented by signed or unsigned permutations, assuming all regions are distinct. The minimum length sequence of operations t 1 , · · · , t k such that σ 1 t 1 · · · t k = σ 2 consequently provides an estimate of the evolutionary distance between these genomes. These distances may then be used to reconstruct phylogenetic trees using methods such as neighbor-joining (Saitou and Nei, 1987).
Although finding the unsigned inversion distance between two genomes is NPhard (Caprara, 1997), the signed inversion distance can be found in polynomial time when all inversions (of any length) are assumed to be equally likely (Hannenhalli and Pevzner, 1999). For unsigned inversions, an upper bound on the inversion distance between genomes was first provided in (Watterson et al., 1982), with polynomial time algorithms later established from a combinatorial perspective by Jerrum (1985) and an algebraic perspective by (Egri-Nagy et al., 2014). Polynomial time algorithms also exist for signed inversion distances (Galvao et al., 2017;Oliveira et al., 2018) (using terms such as "super short reversal").
When a polynomial time algorithm for a rearrangement distance exists, it is often possible to incorporate both deletions and insertions into the model. Polynomial time algorithms exist for calculating the minimal genomic distance under exclusively insertions and deletions (Marron et al., 2004), with insertions, deletions and signed inversions (El-Mabrouk, 2000), and with inversions, transpositions, insertions and deletions (Alexandrino et al., 2021b). Insertions and deletions have also been incorporated into other models such as double cut and join (Braga et al., 2010;Shao and Lin, 2012).
When insertions and deletions are both allowed, the minimum distance between any pair of genomes G 1 and G 2 with region labels R 1 and R 2 respectively always exists. Furthermore, this distance is symmetric in the sense that the distance from G 1 to G 2 is the same as the distance from G 2 to G 1 , because the deletion of a region can be "undone" by inserting the deleted region back into the genome and vice versa. There is, however, little work that considers the addition of deletions without also considering insertions. When considering deletions without insertions, unless we make the assumption that R 1 ⊆ R 2 or R 2 ⊆ R 1 or both (as in El-Mabrouk (2000)), there will not necessarily be an inversion/deletion sequence that transforms one genome into the other. To deal with this, we will provide a model for directly reconstructing the most recent common ancestor of G 1 and G 2 . This model will make use of partial analogues of the symmetric group, namely the symmetric inverse monoid and the symmetric inverse category, which will be discussed in Section 3. To the best of the authors' knowledge they have not yet been explicitly used in any distance-based methods. When working with these structures, it is advantageous to adopt the convention of writing maps on the right and composing from left to right. That is, we write (x)f instead of f (x), and f g is written instead of g • f .
Hereafter we will use the term "inversion" to mean an inversion of precisely two adjacent regions. The paper proceeds as follows. In Section 2 we provide an algebraic framework for describing bacterial genomes. After introducing a number of key algebraic structures in Section 3, these structures are then used to establish an algebraic model of the inversion/deletion process in Section 4. This allows us to define a problem called the region alignment problem, where it will be shown that solving this problem over all pairs of orientations of G 1 and G 2 allows for the reconstruction of a parsimonious most recent common ancestor with respect to inversion and deletions. An exact algorithm for calculating this distance is provided in Section 5. The paper ends with a Discussion in Section 6 that describes some of the important limitations of the models here, and also some of the opportunities for further development. In particular, it is to be hoped that the introduction of the semigroup models here will lead to further work by algebraists to improve applicability and utility of genome rearrangement models.

An Algebraic Model of Bacterial Genomes
For a circular genome G with a set R of n distinct regions, different rotations and reflections of G represent different ways of viewing the genome in three dimensional space. These symmetries are accounted for by an action of the dihedral group D n , which consists of permutations in the symmetric group S n (the group of permutations of n = {1, . . . , n}) representing the rotations and reflections of an n-gon. Beginning with a set X R containing the n! words of length n whose distinct letters are from R, consider the action · of D n on X R where for σ ∈ D n we have The equivalence relation ∼ on X R induced by this action (where words u, v ∈ X R are related if and only if there exists σ ∈ D n such that u = σ · v) allows for the following algebraic definition of a circular genome.
Definition 2.1. A genome G with region set R is an equivalence class in the quotient set X R / ∼.
For u ∈ X R the equivalence class of u is denoted by [u], elements of each equivalence class (words in X R ) are called the reference frames of G, and for two genomes G 1 and G 2 a reference pair is an element of the Cartesian product G 1 × G 2 .
To visualise the reference frames of a genome, begin with the unit circle centered at (0, 0) in R 2 and specify a distinguished point at (0, 1). Subdivide the circle into n arcs of equivalent length proceeding clockwise from (0, 1) where the arc immediately clockwise from (0, 1) is considered to be position 1, the next arc clockwise is considered to be position 2 and so on until we reach position n (which will be the arc directly anti-clockwise from (0, 1)). If x 1 · · · x n is a reference frame of G then its diagram is obtained by labelling position i by x i ∈ R via bijection λ : R → n from regions to positions (see Figure 2). With this is mind, these bijections may also be used to represent reference frames rather than elements of X R .  Figure 2. Given a set of regions R = {a, b, c, d, e, f, g, h}, the reference frames g 1 = abcdef gh, g 2 = cdef ghab and g 3 = hgf edcba of the genome [abcdef gh] represent different ways of viewing the same circular genome in three dimensional space. The reference frame g 2 is obtained by rotating g 1 two positions anticlockwise and g 3 is obtained by reflecting g 1 in the vertical axis.
We will proceed under the assumption that each genome has arisen via the minimum possible number of inversions and deletions, which is commonly known as the parsimony criterion. This approach allows genome rearrangement problems to be viewed as combinatorial optimisation problems whose minimised solutions represent evolutionary distances in accordance with this criterion (Fertin et al., 2009). With this assumption in mind the most recent common ancestor of genomes G 1 and G 2 with region sets R 1 and R 2 respectively will have region set R 1 ∪ R 2 , noting that it must certainly contain the union of the two sets of regions, but could possibly contain more (in which case a greater number of deletions would be required to yield G 1 and G 2 , contradicting the parsimony criterion). Figure 3 illustrates an example of how reference frames g 1 and g 2 of genomes G 1 and G 2 respectively may arise via inversions and deletions from a (not necessarily most recent) common ancestor A.  Figure 3. An example of frames of reference g 1 and g 2 of G 1 and G 2 arising from an ancestor A with the deletions occurring first, followed by inversions. After the deletions but prior to inversions there are intermediate reference frames g ′ 1 and g ′ 2 of genomes G ′ 1 and G ′ 2 .

The Symmetric Group, the Symmetric Inverse Monoid and their Generalisations
To model the inversion/deletion process and formalise the notion of a distance between genomes we use the machinery of the symmetric group, the symmetric inverse monoid and their generalisations. Throughout we let n = {1, . . . , n} for all positive integers n (where 0 = ∅), let N = {0, 1, . . . } and N + = N \ {0}, and let the restriction of a map f to a subset X of its domain be denoted by f | X .
For a monoid M the inverse of m ∈ M is the unique m −1 ∈ M such that mm −1 m = m and m −1 mm −1 = m −1 . If all elements of M have an inverse in this sense, then M is an inverse monoid. The set of partial permutations from n to itself, which is denoted I n , is an inverse monoid called the symmetric inverse monoid whose identity is the identity map. We will also consider the set I m,n of partial permutations from the set m to the set n for all m, n ∈ N, where if m = n we write I n = I n,n . These partial permutations will be used to represent the relative positions of conserved regions that appear in two circular bacterial genomes, and to represent inversion/deletion operations.
The symmetric inverse category, denoted I, is the (small) category whose objects are the natural numbers and where the set of arrows from m to n is I m,n . For partial permutations f ∈ I m,n and g ∈ I n,p , their composition f g ∈ I m,p is such that, for all The diagram of f ∈ I m,n is formed by arranging m vertices labelled by elements of {1, . . . , m} above n vertices labelled by elements of {1, . . . , n} forming two parallel rows of vertices. If (i)f = j then there is an edge connecting i in the upper row with j in the lower row of the diagram (as in Figure 4).  Using the diagrams of f and g it is often helpful to view their composition diagrammatically by first associating the vertices in the lower row of the diagram of f with those in the upper row of the diagram of g, forming a graph called the product graph (see Figure 5). If there is a path from i in the upper row of the product graph to j in the lower row then (i)f g = j.  Calculating the product of partial permutations f ∈ I 5,5 and g ∈ I 5,4 .
A partial permutation f ∈ I m,n is said to be order preserving if, for all i, j ∈ dom(f ), we have i < j if and only if (i)f < (j)f . Instances where i < j but (i)f > (j)f are called crossings. The set of order preserving elements of I m,n is denoted by POI m,n . A partial permutation f ∈ I m,n with dom(f ) = {x 1 , . . . , x k } is said to be orientation preserving (cf. (McAlister, 1998;Catarino and Higgins, 1999) The set of orientation preserving elements of I m,n is denoted POPI m,n . Order preserving partial permutations will arise when regions common to two genomes appear in the same order reading from position 1 to position n, while orientation preserving partial permutations will arise when these regions appear in the same (clockwise) cyclic order in both genomes.

An Algebraic Model of Inversions and Deletions
Given a reference frame of a genome G specified by a bijection λ : R → n, inversions and deletions acting on G are modelled by composing on the right of λ by certain elements of the symmetric inverse category I. For all n ∈ N + let s i;n be the adjacent transposition (i, i + 1) in the symmetric group S n for all 1 ≤ i ≤ n − 1 and, to account for the circular nature of G, we also consider the 2-cycle s n;n = (1, n) since positions 1 and n are adjacent in G. Letting composing on the right of λ by elements of T n will represent an inversion interchanging two adjacent regions in G. Note that the term "inversion" is used to refer to elements of T n as well as the evolutionary operations they represent.
To model deletions, suppose n ≥ 2 and let d i;n be the unique order preserving map in POI n,n−1 with dom(d i;n ) = n \ {i} and im(d i;n ) = {1, . . . , n − 1} (see Figure  6 for an example). 1 2 3 4 1 2 3 4 5 Figure 6. The partial permutation diagram of d 2;5 ∈ I 5,4 . Note that this is still an injective map between the regions that are preserved, with the vertex corresponding to the position of the deleted region having degree 0.
composing on the right of λ by d i;n ∈ D n will represent deleting the region appearing in position i. Composing by a deletion yields a partial permutation from R to n, where a region x is not in the domain if it has been deleted. Note that after we compose on the right by d i;n , for all j > i the region that appeared in position j now appears in position j − 1. For all j < i the region appearing in position j remains in that position. Figure 7 illustrates the corresponding compositions of deletions and inversions yielding the reference frame g 2 from the genome A in Figure 3. Let G 1 and G 2 be arbitrary genomes and suppose that G 2 can be obtained from G 1 by inversions and deletions (note that we are not considering the most recent common ancestor of G 1 and G 2 here). For a fixed reference pair (λ G 1 , λ G 2 ) ∈ G 1 × a b c d e f g h i j k l 1 2 3 4 5 6 7 8 = 1 2 3 4 5 6 7 8 a b c d e f g h i j k l Figure 7. Given the reference frame of A from Figure 3, which is represented by the bijection where a → 1, b → 2 and so on, the deletions of the regions at positions 3,4,7 and 12 is given by composing on the right by the (non-unique) term d 12;12 d 7;11 d 4;10 d 3;9 . The subsequent inversion of positions 2 and 3 yielding g 2 is represented by composing on the right by s 2;8 . After these operations a is in position 1, e is in position 2 and so on.
G 2 a parsimonious inversion/deletion sequence transforming λ G 1 into λ G 2 , when it exists, corresponds to a minimum length well-defined product u of elements in Given a reference frame λ G of any genome G and a well-defined product u of elements in X we let and let the length of u be denoted by ℓ(u). For a fixed reference frame λ G 1 of G 1 the quantity represents the length of a parsimonious inversion/deletion sequence transforming G 1 into G 2 beginning with the reference frame λ G 1 , while the quantity We now work towards establishing Lemma 4.1 from which it follows, for a fixed reference frame λ G 1 of G 1 , that there exists a reference frame λ G 2 of G 2 and a minimum length inversion/deletion sequence transforming λ G 1 into λ G 2 where the deletions occur first.
We proceed by first defining a digraph ∆ whose paths represent the possible sequences of inversions, deletions, rotations and reflections of a genome that can occur. The digraph ∆ (see Figure 8) has • vertex set N; • a directed edge from n to n for each element of T n representing inversions for all n ∈ N + ; • a directed edge from n + 1 to n for each element of D n representing deletions for all n ∈ N + .
The digraph ∆ also has a directed edge from n to n for all n ∈ N + labelled by c n representing the n-cycle rotation (1, . . . , n) in S n , along with an edge labelled by α n representing a reflection where (1, n)(2, n − 1) · · · (k, k + 2) if n = 2k + 1.
Note that the dihedral group D n is generated by {c n , α n }. The free category ∆ * on ∆ contains all words over the alphabet corresponding to paths in ∆ (note that edges may be traversed more than once if possible) that represent sequences of inversions, deletions, rotations and reflections. It can be verified (with the aid of diagrams as in Figure 9 or using the presentation of the symmetric inverse category by East (2020)) that the following relations are satisfied by the corresponding partial permutations in I for all meaningful values of n, subject to stated constraints:  Lemma 4.1. Let G 1 be a circular genome with region set R 1 of size m and let G 2 be a circular genome with region set R 2 of size n where R 2 ⊂ R 1 . Given a fixed reference frame λ 1 : R 1 → m of G 1 suppose that p is a minimum length product corresponding to a path in ∆ * such that λ 1 p ∈ G 2 . There exists a reference frame λ 2 : R 2 → n of G 2 , a product x consisting solely of deletions and a product y consisting solely of inversions (both corresponding to paths in ∆ * ) such that ℓ(xy) = ℓ(p) and λ 1 xy = λ 2 .
Proof. Let p be a minimum length product in I corresponding to a word in ∆ * (which, by abuse of notation, we also denote by p) consisting of inversions and deletions such that λ 1 p ∈ G 2 . Suppose also that p contains at least one deletion. Using the relations in (R1) -(R14) it is clear that p is related to a word of the form xyr where x consists solely of deletions, y consists solely of inversions and r consists solely of dihedral symmetries. Since each application of these relations does not increase word length, it follows that ℓ(xy) ≤ ℓ(xyr) ≤ ℓ(p). Now, if λ 1 p is in G 2 then so too is λ 1 xyr since p and xyr evaluate to the same partial permutation in I. As r consists only of rotations and reflections, it then follows that λ 1 xyrr −1 = λ 1 xy is also in G 2 as the partial permutation corresponding to r −1 is a dihedral group element. The minimality of ℓ(p) together with the fact that ℓ(xy) ≤ ℓ(p) implies that ℓ(xy) = ℓ(p) which completes the proof.
Theorem 4.2. For a fixed reference frame λ G 1 of G 1 there exists a product u of elements in X minimising d(λ G 1 , G 2 ) where the deletions occur first.
Proof. This follows immediately from Lemma 4.1.

4.1.
Reconstructing the Most Recent Common Ancestor. Given genomes G 1 and G 2 , candidates for their most recent common ancestor (under the parsimony criterion) are genomes A with region set R 1 ∪ R 2 minimising the sum d(A, G 1 ) + d(A, G 2 ). While it could be the case that there are distinct reference frames λ A 1 and λ A 2 of A such that d(A, G 1 ) = d(λ A 1 , G 1 ) and d(A, G 2 ) = d(λ A 2 , G 2 ) where d(A, G 1 ) + d(A, G 2 ) is minimal, the following theorem establishes the fact that minimum length inversion/deletion sequences yielding G 1 and G 2 can always be thought of as beginning with a fixed reference frame of A.
Theorem 4.3. Let G 1 and G 2 be genomes with region sets R 1 and R 2 respectively and suppose, among genomes with region set R 1 ∪ R 2 , that the genome A has the property that d(A, G 1 ) + d(A, G 2 ) is minimal. There exists a reference frame λ A of A such that d(A, G 1 ) = d(λ A , G 1 ) and d(A, G 2 ) = d(λ A , G 2 ).
Proof. Suppose that λ A 1 y = λ G 1 and λ A 2 z = λ G 2 where λ A 1 and λ A 2 are reference frames of A and where y and z are minimum length sequences of inversions and deletions. Suppose also that w ∈ {α n , c n } * (that is, the set of all words whose letters are in {α n , c n }) is such that λ A 1 w = λ A 2 . Since both α n and c n are dihedral group elements there exists α −1 n and c −1 n such that α n α −1 n and c n c −1 n are the identity map at n ∈ N. As such, there exists a word u ∈ {α n , c n } * such that wu corresponds to the identity map at n and so (1) Using the relations (R8) -(R14) there exists a word v ∈ {α a , c a } * for some a ∈ N + and a word y ′ of inversions and deletions such that uy ∼ y ′ v where ℓ(y ′ ) ≤ ℓ(y), in which case λ A 2 y ′ v = λ G 1 (2) by Equation (1). Consider the fact that λ A 2 y ′ vv −1 = λ A 2 y ′ by Equation (2). Since multiplying on the right by v −1 ∈ {α a , c a } * is equivalent to changing the reference frame of λ A 2 y ′ v = λ G 1 , there is thus a sequence of inversions and deletions of length ℓ(y ′ ) such that λ A 2 y ′ ∈ G 1 which completes the proof since ℓ(y ′ ) ≤ ℓ(y) and y is minimal.
To find the minimal distance d(A, G 1 ) + d(A, G 2 ) given G 1 and G 2 we define a problem which we will refer to as the region alignment problem, and show that if the solution to the region alignment problem is k ∈ N over all reference pairs in G 1 × G 2 then these genomes have arisen minimally in d(A, G 1 ) + d(A, G 2 ) = k + |R 1 ⊖ R 2 | inversions and deletions (where ⊖ denotes the symmetric difference of sets).
To define this problem, begin with G 1 and G 2 where |R 1 | = m and |R 2 | = n. For a reference pair (g 1 , g 2 ) ∈ G 1 × G 2 where g 1 = x 1 · · · x m and g 2 = y 1 · · · y n construct a partial permutation σ g 1 ,g 2 ∈ I m,n where (i)σ g 1 ,g 2 = j if and only if x i = y j . Figure  10 illustrates an example of how σ g 1 ,g 2 is formed.  Figure 10. Forming the partial permutation σ g 1 ,g 2 ∈ I 8,6 given two reference frames g 1 = abcdef gh and g 2 = eibach of G 1 and G 2 with R 1 = {a, b, c, d, e, f, g, h} and R 2 = {a, b, c, e, h, i}. The fact that a appears at position 1 in g 1 and at position 4 in g 2 means (1)σ g 1 ,g 2 = 4.
The elements in m \ dom(σ g 1 ,g 2 ) and n \ im(σ g 1 ,g 2 ) represent the regions in the symmetric difference R 1 ⊖ R 2 that do not appear in both genomes. The crossings of σ g 1 ,g 2 represent the disorder of the labels in R 1 ∩ R 2 , in the sense that if (i, j) is a crossing in σ g 1 ,g 2 then the label x j ∈ R 1 ∩ R 2 appears before x i ∈ R 1 ∩ R 2 in the word g 2 while x j appears after x i in g 1 . If σ g 1 ,g 2 is order preserving then regions in R 1 ∩ R 2 appear in the same order reading from position 1 to position n around the genome.
Once σ g 1 ,g 2 has been constructed multiplying on the left of σ g 1 ,g 2 by an element of T m represents an inversion acting on the reference frame g 1 while multiplying on the right by an element of T n represents an inversion acting on g 2 . In the region alignment problem we are given a reference pair (g 1 , g 2 ) ∈ G 1 × G 2 and ask for the minimum number of inversions acting on either g 1 or g 2 (or both) to place the regions in R 1 ∩ R 2 in the same (clockwise) cyclic order in both genomes. The region alignment problem is stated mathematically as follows.
Problem 4.4. Let G 1 and G 2 be genomes with region sets R 1 and R 2 respectively where |R 1 | = m and |R 2 | = n. For a reference pair (g 1 , g 2 ) ∈ G 1 ×G 2 , find a sequence t m of elements in T m and a sequence t n of elements in T n minimising ℓ(t m ) + ℓ(t n ) such that t m σ g 1 ,g 2 t n ∈ POPI m,n .
This problem is a generalisation of the problem considered by Egri-Nagy et al.
(2014) regarding the minimum inversion distance between two genomes with the same region set, which for a permutation σ ∈ S n , asks for the minimum length sequence t n of elements in T n such that σt n is the identity.
Theorem 4.5. If G 1 and G 2 are genomes with region sets R 1 and R 2 respectively and µ(g 1 , g 2 ) is the minimum length solution to Problem 4.4 for a fixed reference pair (g 1 , g 2 ) then, under the parsimony criterion, G 1 and G 2 have descended from their most recent common ancestor in ℓ(g 1 , g 2 ) = |R 1 ⊖ R 2 | + min{µ(g 1 , g 2 ) : (g 1 , g 2 ) ∈ G 1 × G 2 } inversions and deletions.
Proof. Let k = min{µ(g 1 , g 2 ) : (g 1 , g 2 ) ∈ G 1 × G 2 }. We begin by showing that d(A, G 1 )+d(A, G 2 ) is bounded below by k +|R 1 ⊖R 2 | over all genomes A with region set R 1 ∪R 2 . To do this, suppose with the aim of obtaining a contradiction that there exists a reference pair (g 1 , g 2 ) ∈ G 1 × G 2 and products t m and t n of elements in T m and T n respectively with t m σ g 1 ,g 2 t n ∈ POPI m,n (i.e. ℓ(t m ) + ℓ(t n ) = k), but where G 1 and G 2 have descended from their most recent common ancestor in strictly less than k + |R 1 ⊖ R 2 | inversions and deletions.
By Theorem 4.2 there exists a genome A with region set R 1 ∪ R 2 minimising d(A, G 1 ) + d(A, G 2 ) where |R 1 ⊖ R 2 | deletions occur first. Further, by Theorem 4.3 there exists a fixed reference frame λ A of A where d(A, G 1 ) = d(λ A , G 1 ) and d(A, G 2 ) = d(λ A , G 2 ) in a minimal sum d(A, G 1 ) + d(A, G 2 ). With these facts in mind and using Figure 3 as a guide, there exists a parsimonious inversion/deletion sequence yielding G 1 that proceeds by first deleting the regions in R 2 \ R 1 from a reference frame of A. This gives rise to a reference frame of an intermediate genome G ′ 1 . Likewise for G 2 , the regions in R 1 \ R 2 are deleted first from A to yield a reference frame of an intermediate genome G ′ 2 . Since G ′ 1 and G ′ 2 have been obtained via deletions from the same reference frame of A, for all reference pairs (g ′ 1 , g ′ 2 ) ∈ G ′ 1 × G ′ 2 the regions in R 1 ∩ R 2 appear in the same clockwise cyclic order in both genomes. Thus, the partial permutation σ g ′ 1 ,g ′ 2 is orientation preserving (that is, σ g ′ 1 ,g ′ 2 ∈ POPI m,n ). Equivalently, there exists (g ′ 1 , g ′ 2 ) ∈ G ′ 1 × G ′ 2 (possibly after rotating one of the genomes) such that σ g ′ 1 ,g ′ 2 is order preserving (that is, σ g ′ 1 ,g ′ 2 ∈ POI m,n ). If G 1 and G 2 subsequently arise by sequences of inversions p and q in T m and T n acting on g ′ 1 and g ′ 2 respectively, then there exists (g 1 , g 2 ) ∈ G 1 × G 2 such that pσ g ′ 1 ,g ′ 2 q = σ g 1 ,g 2 . However, it would then follow that p −1 σ g 1 ,g 2 q −1 is orientation preserving where ℓ(p −1 ) + ℓ(q −1 ) = ℓ(p) + ℓ(q). Since we have σ g ′ 1 ,g ′ 2 = p −1 σ g 1 ,g 2 q −1 , the assumption that k = min{µ(g 1 , g 2 ) : (g 1 , g 2 ) To complete the proof, we show that if k = min{µ(g 1 , g 2 ) : (g 1 , g 2 ) ∈ G 1 × G 2 } then there exists a genome A with region set R 1 ∪R 2 such that d(A, G 1 )+d(A, G 2 ) = k + |R 1 ⊖ R 2 |. Beginning with a reference pair (g 1 , g 2 ) ∈ G 1 × G 2 with µ(g 1 , g 2 ) = k, suppose there exists sequences t m and t n of inversions from T m and T n respectively such that t m σ g 1 ,g 2 t n ∈ POPI m,n (where ℓ(t m ) + ℓ(t n ) = k). Further, suppose that reference frames g ′ 1 and g ′ 2 of G ′ 1 and G ′ 2 are the result of these sequences of inversions acting on g 1 and g 2 respectively. By the circularity of the genomes (performing a rotation if necessary), it may be assumed without loss of generality that the regions in R 1 ∩ R 2 = {r 1 , . . . , r h } appear in the same order reading from 1 to n in both g ′ 1 and g ′ 2 . Let U i be the set of regions appearing between r i and r i+1 in g ′ 2 reading from 1 to n for all 1 ≤ i ≤ h − 1, let U h be the set of regions appearing after r h (up to and including position n) in g ′ 2 and let U 0 be the set of regions appearing before r 1 (from position 1 onward) in g ′ 2 . Beginning with g ′ 1 , form a genome A with region set R 1 ∪ R 2 by first inserting the regions in U 0 before r 1 in g ′ 1 where the minimum element of U i is at position 1 in A. If U 0 is empty, then r 1 is in position 1 in A. Next, for all 1 ≤ i ≤ h − 1 insert regions from U i into g ′ 1 between r i and r i+1 in any way that ensures the regions in U i appear in the same order that they do in g ′ 2 reading from position 1 to n. Finally, insert regions U h after r h in any way that ensures their appropriate order reading from 1 to n where the maximal element of U h is position n in the resulting genome A. If U h is empty, then r h appears in position n in A. Figure 11 illustrates an example of these insertions.
Given the construction of A, it is easily verified that deleting the regions in R 2 \R 1 from A yields g ′ 1 and that deleting the regions in R 1 \ R 2 from A yields g ′ 2 . The inverses of the sequences t m and t n of inversions then act on g ′ 1 and g ′ 2 respectively to yield g 1 and g 2 in a total of k + |R 1 ⊖R 2 | inversions and deletions, as required.

Exact Algorithm and Complexity
We continue to assume that G 1 and G 2 are genomes with region sets R 1 and R 2 , respectively with |R 1 | = m and |R 2 | = n. In this section we provide an exact algorithm for computing sequences t m in T m and t n in T n from Problem 4.4 such that ℓ(t m ) + ℓ(t n ) is minimised and t m σ g 1 ,g 2 t n ∈ POPI m,n . As previously, given σ g 1 ,g 2 this minimised value is denoted by µ(σ g 1 ,g 2 ). Additionally, we describe the asymptotic time and space complexity of the algorithm, and the limits of its practical applicability on currently available computer hardware.
We denote the identity partial permutation on the set X by id X . For the purposes of this section, a graph Γ is a triple (V, E, X) where V is a set whose elements are  Figure 11. Given g ′ 1 and g ′ 2 , we form the genome A by inserting regions from R 2 \R 1 into g ′ 1 in their appropriate positions (with respect to elements of R 1 ∩ R 2 ) and appropriate order from 1 to n.
called the vertices of Γ; X is the set of edge labels of Γ; and E ⊆ V × X × V is the set of edges of Γ.
If S is a semigroup and X is a subset of S, then we define the left Cayley graph of S with respect to X to be the graph with nodes S and edges (s, x, xs) ∈ S × X × S for all s ∈ S and for all x ∈ X; we denote this by Γ L (S, X). The right Cayley graph is defined dually, and is denoted Γ R (S, X). If Γ L (S 1 , X 1 ) and Γ R (S 2 , X 2 ) are the left and right Cayley graphs (respectively) of semigroups S 1 and S 2 with respect to subsets X 1 and X 2 then, given S 1 ⊆ S 2 , we define the union of these graphs to be the graph with nodes S 2 and all of the edges belonging to Γ L (S 1 , X 1 ) and Γ R (S 2 , X 2 ). If Γ = (V, E, X) is any graph and A is a subset of the vertices V of Γ, then the subgraph induced by A is the graph (A, E ∩ (A × X × A), X).
Let σ g 1 ,g 2 ∈ I m,n and suppose without loss of generality that m ≤ n. We consider I m,m and POPI m,n to be embedded in I n,n via an embedding f where (i)σ = j for σ in I m,m or POPI m,n if and only if (σ)f in I n,n maps i to j. The algorithm for determining µ(σ g 1 ,g 2 ) has the following steps: (i) suppose that σ g 1 ,g 2 ∈ I m,n where | dom(σ g 1 ,g 2 )| = r and m ≤ n; (ii) let X j denote the generating set for I j,j consisting of T j and id {1,...,j−1} ; (iii) compute the left Γ L (I m,m , X m ) and right Γ R (I n,n , X n ) Cayley graphs of I m,m and I n,n with respect to the sets X m and X n respectively; (iv) compute the union Γ m,n of Γ L (I m,m , X m ) and Γ R (I n,n , X n ); (v) compute the set D r = {α ∈ I n,n : | dom(α)| = r} in Γ m,n .
Given that the relation D on I n,n where α D β if and only if | dom(α)| = | dom(β)| is an equivalence relation (called Green's D-relation), the subgraph ∆ n,r induced by D r is strongly connected (in the sense that there is a path in both directions between all pairs of vertices). Paths in this strongly connected component will traverse edges from Γ L (I m,m , X m ) representing inversions from T m acting on the genome G 1 with m regions, and edges from Γ R (I n,n , X n ) representing inversions from T n acting on the genome G 2 with n regions.
(vi) compute the subgraph ∆ n,r of Γ m,n induced by D r ; (vii) µ(σ g 1 ,g 2 ) is then the minimum distance in ∆ n,r between σ g 1 ,g 2 and any element of POPI m,n in D r .
Note that steps (i) to (vi) need only be computed once for each m,n and r, and the resulting value of ∆ n,r can be memoised.
Steps (i) and (ii) have combined time complexity O(n); step (iii) has time and space complexity O(|X n ||I n,n |) = O n n r=0 n r 2 r! (using the Froidure-Pin Algorithm described by Froidure and Pin (1997) for example). Hence the time and space complexity for this step is at best O(n!). Steps (iv) and (v) also have time complexity O(|X n ||I n,n |) since the number of vertices and edges in Γ L (I m,m , X M ) and Γ R (I n,n , X n ) is O(|X n ||I n,n |). Hence steps (i) to (vi) overall have time and space complexity at best O(n!).
For step (vii), the distance between any two vertices in a graph can be found in a number of ways. One approach would be to apply the Floyd-Warshall Algorithm to compute the shortest path between every pair of vertices in ∆ n,r ; the time complexity of Floyd-Warshall is O(n 3 ) where n is the number of vertices in the graph. Another approach is to perform a depth or breadth first search. The version implemented by Beule et al. (2022) uses a breadth first search that also utilises the automorphism group of the graph to avoid visiting multiple identical branches. The automorphism groups of the graphs ∆ n,r are non-trivial when r = 0 and this approach seems to offer the best performance; see Table 1. Due to its high complexity the exact algorithm given above is only applicable for relatively small values of n; see Table 2.   (i) to (vi) -(s) 3.880 × 10 −8 1.228 × 10 −6 7.227 × 10 −3 8.556 × 10 −2 2.808 × 10 0 2.595 × 10 2 Time for (vii) -total (s) 4.083 × 10 −4 7.594 × 10 −3 1.938 × 10 −1 7.402 × 10 0 ∼ 6 minutes ∼ 7 hours Time for (vii) -mean (s) 2.669 × 10 −6 2.809 × 10 −6 2.800 × 10 −6 2.956 × 10 −6 ? ? Table 2. Time for the various steps in the exact algorithm when applied to every pair of genomes with n and k regions where n ≥ k.
To the best of the authors' knowledge, it is not clear whether there exists a polynomial time algorithm for Problem 4.4. This problem is potentially a computationally difficult problem, and so from a practical perspective it appears that approximation based approaches, or variations, offer the most promise moving forward.
To highlight this, we finish this section by showing that a variation of Problem 4.4 -whether two genomes of equal size are an equivalent inversion/deletion distance from their most recent common ancestor -is NP-complete. Since genomes of equal size arise from their common ancestor via the same number of deletions, by Theorem 4.5 this problem is equivalent to a problem called balancedsort. Balancedsort takes a partial permutation σ ∈ I m,n and k ∈ N, and asks whether there exist sequences t m and t n of inversions in T m and T n respectively with ℓ(t m ) + ℓ(t n ) ≤ k such that t m σt n ∈ POI m,n and ℓ(t m ) = ℓ(t n ). Note that we may consider POI m,n instead of POPI m,n here as, if there exists (g 1 , g 2 ) ∈ G 1 × G 2 such that t m σ g 1 ,g 2 t n ∈ POPI m,n , then there exists a reference pair (h 1 , h 2 ) ∈ G 1 × G 2 obtained by rotating at least one of the genomes such that t m σ h 1 ,h 2 t n ∈ POI m,n .
Theorem 5.1. Determining whether two bacterial genomes of equal size are an equivalent inversion/deletion distance from their most recent common ancestor is NP-complete.
Proof. We proceed by showing that balancedsort, which is clearly in NP, is NPcomplete. Consider an instance of the well known NP-complete problem partition, which consists of a multiset A = {a 1 , . . . , a n } of positive integers and asks if there exists a partition of A into disjoint multisets X and Y such that x∈X x = y∈Y y. Construct an instance of balancedsort from an instance of partition by letting m = n + n i=1 a i and by defining a partial permutation σ ∈ I m to be such that • (1)σ = a 1 + 1 and (a 1 + 1)σ = 1, a i for all 2 ≤ j ≤ n − 1 and The value of k is the sum of all elements in A. Figure 12 illustrates an example of this reduction, which is easily seen to run in polynomial time. We first show that if there exists a partition of A into X and Y such that x∈X x = y∈Y y then there exists sequences p and q of inversions in T m with ℓ(p) + ℓ(q) ≤ k such that pσq ∈ POI m where ℓ(p) = ℓ(q). Define, without loss of generality, sequences C a j of inversions (represented by 2-cycles) for all a j ∈ A where a i and note that the sequence C a j removes the crossing consisting of domain elements j + j−1 i=1 a i and j + j i=1 a i in a minimal way by left or right multiplication (but not both) without creating additional crossings. Given a partition of A into X = {a x 1 , . . . , a x b } and Y = {a y 1 , . . . , a yc } there is thus sequences p = C ax 1 · · · C ax b and q = C ay 1 · · · C ay c of inversions in T m (where ℓ(p) + ℓ(q) = k by construction) such that pσq ∈ POI m and where ℓ(p) = ℓ(q) since x∈X x = y∈Y y.
Conversely, suppose that for the constructed instance of balancedsort there exists sequences p and q of inversions in T m with ℓ(p) + ℓ(q) ≤ k such that pσq ∈ POI m where ℓ(p) = ℓ(q). By construction ℓ(p) + ℓ(q) = k where k = a∈A a. The sequence p can be written in the form C ax 1 · · · C ax b and the sequence q can be written in the form C ay 1 · · · C ay c where the sets {x 1 , . . . , x b } and {y 1 , . . . , y c } are disjoint. This is because each crossing is removed minimally by exclusively left or right multiplication of inversions without creating additional crossings. In other words, these sequences determine a partition of A into X = {a x 1 , . . . , a x b } and Y = {a y 1 , . . . , a yc } where x∈X x = y∈Y y follows from the fact that p and q are such that ℓ(p) = ℓ(q).

Discussion
This paper has introduced an algebraic framework for modelling two genome rearrangements, inversion and deletion, that are known to occur through the same biological process, namely site-specific recombination. This framework involves the use of the symmetric inverse monoid, and appears to be the first usage of this type of semigroup model in the study of genome rearrangements. As such a first step, there are on the one hand clear limitations of the model presented, and on the other, clear opportunities for further development.
The most significant limitation involves the scope of the allowable rearrangements. While the model treats a genome as a circular sequence of preserved regions of DNA (a standard way to view genomes in the rearrangement literature), it only permits inversions of adjacent regions, and only permits deletions of a single region at a time. These two simplifying restrictions make the algebra more manageable by restricting the generating sets of the monoids involved. But they are also broadly consistent with each other, since the underlying biological argument behind restricting the length of DNA sequence inverted or deleted is the same, because both arise from the same mechanism. As noted in the Introduction, traditional rearrangement models do not restrict the length of the inverted region, and those that incorporate deletion (such as El-Mabrouk (2000)) allow any length to be deleted, and with equal probability. They also generally allow the opposite operation, insertion, which typically occurs via different biological mechanism and so the savings in the computational simplicity come at an arguable cost to biological faithfulness -as indeed they do in the present paper.
A natural extension to the model presented here would be to allow longer regions to be inverted and/or deleted, perhaps along the lines attempted in Bhatia et al. (2020), which allows longer inversions in a group-theoretic model, but imposes a cost by length. Indeed, some results here, such as Theorem 4.2, apply regardless of the generating set for S n , or the number of regions being deleted.
Other generalisations may become available as a direct result of the algebraic framework. For instance, the algebraic formalisation using the symmetric inverse monoid can be generalised further by using monoids and categories of binary relations or partial functions. The use of certain binary relations λ : R → n (or partial functions n → R using the convention of positions to regions) allows one to account for repeated region labels, where an ordered pair (r, n) is in λ if and only if the region r appears in position n in a sequences of genome regions. For instance, the sequence r 1 r 2 r 1 r 3 of regions where {r 1 , r 2 , r 3 } ⊆ R would correspond to the relation {(r 1 , 1), (r 2 , 2), (r 1 , 3), (r 3 , 4)}.
Given sets m and n where m, n ∈ N, the set of relations {(x 1 , y 1 ), . . . , (x j , y j )} such that {x 1 , . . . , x j } ⊆ m, y i ∈ n for all 1 ≤ i ≤ j and y i = y j when i = j is denoted by PT m,n , while the set of (analogously defined) partial functions from m → n is denoted PT m,n . One can define a (small) category PT whose objects are the natural numbers, and where the set of arrows from m to n is the set PT m,n under the composition of binary relations. Given a relation λ from R to n described above, we can compose on the right by elements of PT to represent not only inversions and deletions (since PT contains I), but also to represent duplications of regions. To model a duplication we multiply on the right by relations of the form V i;n ∈ PT n,n+1 where, without loss of generality (as in Figure 13), we have V i;n = {(1, 1), . . . , (i, i), (i, i + 1), (i + 1, i + 2), . . . , (n, n + 1)}.
1 2 3 4 5 6 1 2 3 4 5 Figure 13. The relation diagram of V 2;5 ∈ PT 5,6 . Note that unlike the deletion in Figure 6 where the degree of the vertex labelled by 2 in the upper row was 0 (to represent the fact the region in position 2 was deleted), the upper row vertex labelled by 2 in this instance has degree 2 to represent the fact that the region in position 2 has been duplicated.
With this algebraic framework in mind, it is possible to consider the new problem of constructing the most recent common ancestor of two bacterial genomes (which may have repeated regions) under the three operations of inversions, deletions and duplications. Since the problem of reconstructing the most recent common ancestor of two genomes under exclusively inversions and deletions is a special case of this new problem, the same asymmetry present in the inversion/deletion model is also present in the inversion/deletion/duplication model given that only pre-existing genome regions may be duplicated. It is then natural to investigate whether similar combinatorial optimization problems regarding elements of PT have analogous interpretations to those presented here, such as Problem 4.4.
Finally, it would be interesting to explore whether the framework developed here could be cast in the representation-theoretic framework designed for maximum likelihood estimates for genome rearrangement models (Serdoz et al., 2017), that is presented in ; Terauds and Sumner (2022). Indeed, on one hand, Terauds and Sumner (2022) remark that it may be generalised to models using semigroups, while on the other hand, the representation theory of finite monoids including that of the symmetric inverse monoid has been well studied (Steinberg et al., 2016;Munn, 1964;Solomon, 2002).

Declarations
Andrew Francis was partially supported by Australian Research Council Discovery Project DP180102215. Data sharing is not applicable to this article as no datasets were generated or analysed. The authors have no competing interests to declare that are relevant to the content of this article.