# New Genome Similarity Measures Based on Conserved Gene Adjacencies

## Abstract

Many important questions in molecular biology, evolution and biomedicine can be addressed by comparative genomics approaches. One of the basic tasks when comparing genomes is the definition of measures of similarity (or dissimilarity) between two genomes, for example to elucidate the phylogenetic relationships between species.

The power of different genome comparison methods varies with the underlying formal model of a genome. The simplest models impose the strong restriction that each genome under study must contain the same genes, each in exactly one copy. More realistic models allow several copies of a gene in a genome. One speaks of gene families, and comparative genomics methods that allow this kind of input are called *gene family-based*. The most powerful – but also most complex – models avoid this preprocessing of the input data and instead integrate the family assignment within the comparative analysis. Such methods are called *gene family-free*.

In this paper, we study an intermediate approach between family-based and family-free genomic similarity measures. The model, called *gene connections*, is on the one hand more flexible than the family-based model, on the other hand the resulting data structure is less complex than in the family-free approach. This intermediate status allows us to achieve results comparable to those for family-free methods, but at running times similar to those for the family-based approach.

Within the gene connection model, we define three variants of genomic similarity measures that have different expression power. We give polynomial-time algorithms for two of them, while we show NP-hardness of the third, most powerful one. We also generalize the measures and algorithms to make them more robust against recent local disruptions in gene order. Our theoretical findings are supported by experimental results, proving the applicability and performance of our newly defined similarity measures.

## 1 Introduction

Many important questions in molecular biology, evolution and biomedicine can be addressed by comparative genomics approaches. One of the basic tasks in this area is the definition of measures of similarity between two genomes. Direct applications of such measures are the computation of phylogenetic trees or the reconstruction of ancestral genomes, but also more indirect tasks like the prediction of orthologous gene pairs (derived from the same ancestor gene through speciation) or the transfer of gene function across species profit immensely from accurate genome comparison methods.

Indeed, over the past forty-or-so years, many methods have been proposed to quantify the similarity of single genes, mostly based on pairwise or multiple sequence alignments. However, in many situations similarity measures based on whole genomes are more meaningful than gene-based measures, because they give a more representative picture and are more robust against side effects such as horizontal gene transfer. Therefore, in this paper we develop and analyze methods for whole genome comparison, based on the physical structure (gene order) of the genomes.

The most simple picture of a genome is one where in a set of genomes under study orthologous genes have been identified beforehand, and only groups of orthologous genes (also known as *gene families*) are considered that have exactly one member in each genome. In this model, a variety of genomic similarity (or distance) measures have been studied and are relatively easy to compute [1, 2, 3, 4]. However, the singleton gene family is a great oversimplification compared to what we find in nature. Therefore, more general models have been devised where several genes from the same family can exist in one genome. The computation of genomic similarities in these cases is generally much more difficult, though. In fact, many problem variants are NP-hard [5, 6, 7, 8, 9].

Another biological inaccuracy arises from the fact that a gene family assignment is not always without dispute, because orthology is usually not known but just predicted, and most prediction methods require some arbitrary threshold, deciding when two genes belong to the same family and when not. Therefore *gene family-free* measures have recently been proposed, based on pairwise similarities between genes [10, 11, 12, 13]. While the resulting similarity measures are very promising, their computation is usually not easier than for the family-based models and therefore NP-hard as well [10, 13].

In this paper, we study an intermediate approach between family-based and family-free genomic similarity measures, *gene connections*. It requires some preprocessing of the genes contained in the genomes under study, but in a less stringent way than in the family-based approach. On the other hand, the resulting data structure is less complex than in the family-free approach, where arbitrary (real-valued) similarities between genes are considered. This intermediate status allows us to achieve results comparable to those for family-free methods, but at time complexities similar to those for the family-based approach.

The paper is structured as follows. We first define three new genome similarity measures based on conserved gene adjacencies (Sect. 2), followed by some pointers to related literature (Sect. 3). Each of the three following sections is then devoted to one of the similarity measures. We show that the first problem can be computed in polynomial time, but is biologically quite simplistic. The second one, while avoiding some of the weaknesses of the first, is NP-hard to compute and can therefore not be applied for genomes of realistic size. The third measure, finally, provides a compromise between biological relevance and computational complexity. In Sect. 7 we compare the results obtained with our similarity measures experimentally, using a large data set of plant (rosid) genomes. The last section concludes the paper.

The implemented algorithms used in this work as well as the studied dataset are available for download from http://bibiserv.cebitec.uni-bielefeld.de/newdist.

## 2 Basic Definitions

An *alphabet* is a finite set of *characters*. A *string* over an alphabet \(\mathcal {A}\) is a sequence of characters from \(\mathcal {A}\). Given a string *S*, *S*[*i*] refers to the *i*th character of *S* and |*S*| is the *length* of *S*, i.e., the number of characters in *S*. In a *signed string**S*, each character is labeled with a sign, denoted \( sgn _S(i)\) for the character at index position *i*. A sign is either positive (\(+\)) or negative (\(-\)). In comparative genomics, for example, the signs may indicate the orientations of genes on their genomic sequences, which themselves are represented as strings. Therefore in this paper we use the term *gene* as a synonym for “signed character” and the term *genome* as a synonym for “signed string”.

**Definition 1**

**(gene connection graph).** Given two genomes *S* and *T*, a *gene connection graph**G*(*S*, *T*) of *S* and *T* is a bipartite graph with one vertex for each gene of *S* and one vertex for each gene of *T*. An edge between two vertices, one from *S* and one from *T*, indicates that there is some *connection* between the two genes represented by these vertices.

The term *connection* in the above definition is not very specific. Depending on the data set and context, connections may be defined based on gene homology, sequence similarity, functional relatedness, or any other similarity measure between genes.

For ease of notation, we let *S*[*i*] denote both the *i*th gene of genome *S*, as well as the vertex of *G* representing this gene. Similar for *T*[*j*]. The set of edges of a graph *G* is denoted by *E*(*G*). The size of a graph *G* is the number of its edges, \(|G| = |E(G)|\). Further, we define a *connection function**t* that returns for an index position *i* of *S* the list *t*(*i*) of index positions in *T* that are connected to *S*[*i*] by an edge in *G*(*S*, *T*). That is, \(t(i) = [j \mid (i,j) \in E(G(S,T)) \; \text {for} \; 1 \le j \le |T|]\). The function *s*(*j*) for an index position of *T* is defined analogously.

A pair of adjacent index positions \((i,i')\) with \(i' = i+1\) in a string is called an *adjacency*. Note that this definition of adjacency only considers direct neighborhood of genes (\(i' = i+1\)), while all our following uses of this term refer to an extended definition given by Zhu *et al.* [14], who introduced *generalized gene adjacencies* as follows:

**Definition 2**

**(adjacency).** Given an integer \(\theta \ge 1\), a pair of index positions \((i,i')\) with \(i' \le i+\theta \) in a string is a (\(\theta \)-) *adjacency*.

In other words, two genes of the same genome form a \(\theta \)-adjacency if the number of genes between them is less than \(\theta \). In the following we will frequently differentiate between *simple adjacencies* (\(\theta =1\)) and *generalized adjacencies* (\(\theta \ge 1\)).

As mentioned in the Introduction, in this paper we are interested in defining measures of similarity to compare pairs of genomes. A simple approach is based on their number of *conserved adjacencies*. Although below we will study different variants of similarities, they all use the following basic notion of conserved adjacency:

**Definition 3**

**(conserved adjacency).**Given two genomes

*S*and

*T*and a gene connection graph

*G*(

*S*,

*T*), a pair of adjacencies \((i,i')\) in

*S*and \((j,j')\) in

*T*is called a

*conserved adjacency*, denoted \((i,i' || j,j')\), if one of the following two holds:

- (a)
\((i,j) \in E(G(S,T))\), \((i',j') \in E(G(S,T))\), \( sgn _S(i) = sgn _T(j)\) and \( sgn _S(i') = sgn _T(j')\); or

- (b)
\((i,j') \in E(G(S,T))\), \((i',j) \in E(G(S,T))\), \( sgn _S(i) \ne sgn _T(j')\) and \( sgn _S(i') \ne sgn _T(j)\).

We further denote two conserved adjacencies as *conflicting* if their intervals in either genome are overlapping:

**Definition 4**

**(conflicting conserved adjacencies).** Two conserved adjacencies \((i,i' || j,j')\) and \((k,k' || l,l')\) are *conflicting* if (1) \((i, i' || j, j') \ne (k, k' || l, l')\) and (2) \([i,i'-1] \cap [k,k'-1] \ne \emptyset \) or \([j,j'-1] \cap [l,l'-1] \ne \emptyset \).

Subsequently a set of conserved adjacencies is denoted as *non-conflicting* if the above-defined property does not hold between any two of its members.

In the example of Fig. 1, (3, 4||4, 6) and (5, 6||5, 7) are the only conflicting conserved adjacencies. All other pairs are non-conflicting.

The different similarity measures that we consider in this work are expressed by the following three problem statements:

*Problem 1*

*(total adjacency model).*Given two genomes

*S*and

*T*and a gene connection graph

*G*(

*S*,

*T*), count the number of pairs of index positions \((i,i')\) in

*S*and \((j,j')\) in

*T*that form a conserved adjacency. In other words, compute

Because a gene connection graph *G*(*S*, *T*) is not limited to one-to-one connections between genes of genomes *S* and *T*, solutions to Problem 1 may biologically not be very plausible. Therefore we define a second measure, motivated by the one used in [10, 11], which asks for one-to-one correspondences between genes of *S* and *T* in its solutions:

*Problem 2*

*(gene matching model).*Given two genomes

*S*and

*T*, a gene connection graph

*G*(

*S*,

*T*) and a real-valued parameter \(\alpha \in [0,1]\), find a bipartite matching

*M*in

*G*(

*S*,

*T*) such that the induced sequences \(S^M\) and \(T^M\) maximize the measure

*M*. (The induced sequences \(S^M\) and \(T^M\) are the subsequences of

*S*and

*T*, respectively, that contain those characters incident to edges of

*M*.)

As we will see later in this paper, solving Problem 2 is NP-hard even for simple adjacencies. Therefore we define a third, intermediate measure, which is more efficient to compute in practice, while producing one-to-one correspondences between gene extremities. It is defined as the size of the largest subset of non-conflicting conserved adjacencies found in a pair of genomes:

*Problem 3*

*(adjacency matching model).* Given two genomes *S* and *T* and a gene connection graph *G*(*S*, *T*), let *C* be the set of conserved adjacencies between *S* and *T*. Compute the size \(|C^\star |\) of a maximum cardinality set of non-conflicting conserved adjacencies \(C^\star \subseteq C\).

## 3 Related Work

As mentioned above, the *gene connection graph* input format that we propose here is an intermediate between gene families and the family-free model. Indeed, we do not require the gene connection graph to be transitive, which is the main difference to the *gene family graph*, where vertices are assigned to genes and edges are drawn between genes from different genomes whenever they belong to the same family, thus forming bipartite cliques. (This graph has not been introduced under this name in the literature, but is implicitly mentioned already in [15] and later more explicitly in [10].) On the other end, the *gene similarity graph* [11] is a weighted version of the gene connection graph, increasing the expression power by its ability to represent different strengths of gene connections.

The only previous use of such an intermediate model in comparative genomics that we are aware of is in the form of *indeterminate strings* in [12].

**Definition 5**

**(indeterminate string, signed indeterminate string).** Given an alphabet \(\mathcal {A}\), a string *S* over the power set \(\mathcal {P}(\mathcal {A}) {\setminus } \{\emptyset \}\) is called an *indeterminate string* over \(\mathcal {A}\). In other words, for \(1\le i\le n\), \(\emptyset \ne S[i] \subseteq \mathcal {A}\). In a *signed indeterminate string**S*, any index position *i* has a sign \( sgn _S(i)\), which therefore is the same for all characters at that position.

Given two genomes *S* and *T* and a gene connection graph *G*(*S*, *T*), it is easy to create a pair of signed indeterminate strings \(S'\) and \(T'\) over an alphabet \(\mathcal {A}'\) that contain the same set of conserved adjacencies as *S* and *T*: For any edge \(e = (S[i],T[j])\) of *G*(*S*, *T*), create one symbol \(e' \in \mathcal {A}'\) and let \(e' \in S'[i]\) and \(e' \in T'[j]\). The signs are just transferred from *S* and *T* to \(S'\) and \(T'\), respectively: \( sgn _{S'}[i] = sgn _S[i]\) for all *i*, \(1 \le i \le |S|\), and \( sgn _{T'}[j] = sgn _T[j]\) for all *j*, \(1 \le j \le |T|\).

Conversely, given two indeterminate strings \(S'\) and \(T'\), we can easily create sequences *S* and *T* and the corresponding gene connection graph with the same set of conserved adjacencies. In order to do this, let \(\mathcal {A} = \{1,2,\ldots ,|S'|, 1', 2', \ldots ,|T'|'\}\), set \(S = sgn _{S'[1]} 1,\ldots , sgn _{S'[|S'|]} |S'|\), \(T = sgn _{T'[1]} 1',\ldots , sgn _{T'[|T'|]} |T'|'\), and create in *G*(*S*, *T*) an edge \(e = (S[i],T[j])\) whenever \(S'[i] \cap T'[j] \not = \emptyset \).

Clearly, all the information about conserved adjacencies between these two representations is identical, while sometimes the graph representation and sometimes the representation as signed indeterminate string is more concise.

Indeterminate strings in [12] were used to identify regions of common gene content (*gene clusters*) in two genomes, which is important in functional genomics. Here our focus is on conserved adjacencies (which can be seen as small clusters of just two genes) for defining whole-genome similarities. Similar measures are known for singleton gene families as the *breakpoint distance* [16, 17], have been extended to gene families in [5, 7, 15] and were defined for the family-free model in [10].

## 4 An Optimal Solution for Problem 1

In order to solve Problem 1, we construct a list *L* of edges of *G*(*S*, *T*) using connection function *t*(*i*) for \(1 \le i \le |S|\). In doing so, we assume that the elements of *t*(*i*), \(1 \le i \le |S|\), are sorted in increasing order. If this is not given as input, it can always be achieved by applying counting sort to all lists *t*(*i*) in overall \(O(|S| + |T| + |G(S,T)|)\) time, which is proportional to the input size.

*e*, \(e'\), \(e''\) into list

*L*. These pointers simultaneously traverse

*L*while reporting any pair of adjacent parallel edges \((e,e')\) or crossing edges \((e,e'')\).

*Correctness.* Given a pair \((i,j) \in L\), there are overall four cases for the signs of index *i* in *S* and index *j* in *T*, each with two sub-cases for the signs of index \(i+1\) in *S* and index \(j+1\) or index \(j-1\) in *T*, listed in the following.

- (1)
If \( sgn _S(i) = +\) and \( sgn _T(j) = +\), then we have a conserved adjacency \((i,i+1 || j,j+1)\) if and only if \((i+1,j+1) \in L\) and either \( sgn _S(i+1) = +\) and \( sgn _T(j+1) = +\) or \( sgn _S(i+1) = -\) and \( sgn _T(j+1) = -\).

- (2)
If \( sgn _S(i) = +\) and \( sgn _T(j) = -\), then we have a conserved adjacency \((i,i+1 || j-1,j)\) if and only if \((i+1,j-1) \in L\) and either \( sgn _S(i+1) = +\) and \( sgn _T(j-1) = -\) or \( sgn _S(i+1) = -\) and \( sgn _T(j-1) = +\).

- (3)
If \( sgn _S(i) = -\) and \( sgn _T(j) = +\), then we have a conserved adjacency \((i,i+1 || j-1,j)\) if and only if \((i+1,j-1) \in L\) and either \( sgn _S(i+1) = -\) and \( sgn _T(j-1) = +\) or \( sgn _S(i+1) = +\) and \( sgn _T(j-1) = -\).

- (4)
If \( sgn _S(i) = -\) and \( sgn _T(j) = -\), then we have a conserved adjacency \((i,i+1 || j,j+1)\) if and only if \((i+1,j+1) \in L\) and either \( sgn _S(i+1) = -\) and \( sgn _T(j+1) = -\) or \( sgn _S(i+1) = +\) and \( sgn _T(j+1) = +\).

Clearly, cases 1 and 4 and cases 2 and 3 can be summarized to the two cases given in Algorithm 1.

*Runtime Analysis.* The list *L* has length |*G*(*S*, *T*)| and can be constructed and sorted in linear time \(O(|S| + |T| + |G(S,T)|)\), as discussed above. Each of the three edge pointers *e*, \(e'\) and \(e''\) traverses *L* once from the beginning to the end, so that the **for** loop in lines 3–19 takes *O*(|*L*|) time. Therefore the overall running time is \(O(|S| + |T| + |G(S,T)|)\).

*Space Analysis.* The algorithm needs space only for the two input strings *S* and *T*, the list *L* and some constant-space variables. Therefore the space usage is of order \(O(|S|+|T|+|G(S,T)|)\).

*Extension to Generalized Adjacencies.* Algorithm 1’ solves Problem 1 for generalized adjacencies. Following the same strategy as Algorithm 1, the extension requires next to the main pointer *e* additional \(2\theta \) pointers into list *L* that are denoted \(e'_t\) and \(e''_t\), \(1 \le t \le \theta \). While it traverses through each element (*i*, *j*) in the list using pointer *e*, each pointer \(e'_t\), \(1 \le t \le \theta \), is subsequently increased to point to the smallest element larger than or equal to \((i+t, j+1)\) in *L*. A copy \(\hat{e}\) of pointer \(e'_t\) is then used to find candidates \((i+t, j+1), \dots , (i+t, j+\theta )\). Likewise, pointers \(e''_t\), \(1 \le t \le \theta \), are incremented to the smallest element larger than or equal to \((i+t,j-\theta )\), whereupon copy \(\hat{e}\) of \(e''_t\) is used to find candidates \((i+t, j-\theta ), \dots , (i+t, j-1)\).

*e*, \(e'_t\), and \(e''_t\), \(1 \le t \le \theta \) are continuously increased, thus each traversing

*L*once. Any instance of pointer \(\hat{e}\) visits at most \(\theta \) elements in each iteration, thus leading to an overall running time of \(O(\theta ^2 |G(S, T)|)\). The running time is asymptotically optimal in the sense of worst case analysis, since there can be just as many \(\theta \)-adjacencies in graph

*G*(

*S*,

*T*). Algorithm 1’ requires \(O(\theta + |S| + |T|+\theta ^2|G(S, T)|)\) space.

## 5 Complexity of Problem 2

While one may hope that the intermediate status of the gene connection graph between the gene family graph and the gene similarity graph allows more efficient algorithms than for the more complex gene similarity graph, this is not the case for the gene matching model.

Only for \(\alpha =0\), we have \(\mathcal {F}_\alpha (M) = edg (M) = |M|\) and therefore Problem 2 reduces to computing a maximum bipartite matching, which is possible in polynomial time [18]. However, this case is not very interesting because it completely ignores conserved adjacencies and just compares the gene content of the two genomes. All interesting cases are more difficult to solve, as the following theorem shows:^{1}

**Theorem 1**

Problem 2 is NP-hard for \(0 < \alpha \le 1\).

*Proof*

We will focus on simple adjacencies (\(\theta = 1\)), as this is sufficient to prove Theorem 1. Inspired by the proof of Bryant [5] for the family-based case, we provide a P-reduction from Vertex Cover: Given a graph \(\mathcal {G} = (V,E)\) and an integer \(\lambda \), does there exist a subset \(V' \subseteq V\) such that \(|V'| = \lambda \) and each edge in *E* is adjacent to at least one vertex in \(V'\)?

Our reduction transforms an instance of Vertex Cover into an instance of the decision version of Problem 2: Given strings *S* and *T*, a gene connection graph *G*(*S*, *T*), a real value \(\alpha \), \(0 < \alpha \le 1\), and a real value \(F \ge 0\), does there exist a bipartite matching *M* in *G*(*S*, *T*) such that \(\mathcal {F}_\alpha (M) \ge F\)?

*S*and

*T*are constructed as follows:

*G*(

*S*,

*T*) has an edge for each pair of identical symbols

*S*[

*i*] and

*T*[

*j*]. The parameter \(\alpha \) may be chosen arbitrarily within the range \(0 < \alpha \le 1\).

First, we show that among the matchings maximizing the value \(\mathcal {F}_\alpha \) for this problem, there is always at least one which is a maximal matching. Let *M* be a non-maximal matching in *G*(*S*, *T*) maximizing \(\mathcal {F}_\alpha \) and consider an edge \(\ell \not \in M\) that may be added to *M*, forming a new matching \(M' = M \cup \{\ell \}\). Clearly, \(\ell \) can dismiss at most two adjacencies of *M* in \(M'\), so \(adj(M')\ge adj(M)-2\). But in our construction, where the symbols of \(\mathcal {A}\) (except the \(e_i\) and \(e_i'\)) are in reverse order in *S* related to *T*, and furthermore each \(e_i\) and each \(e_i'\) is between \(x_i\) and \(x_{i+1}\) in *S*, any new edge \(\ell \) added to *M* can dismiss at most one adjacency: If \(\ell \) is adjacent to a symbol *a* and the symbol \(a'\) is adjacent to another edge \(\ell '\in M\) (or vice-versa) then \(adj(M')= adj(M)+1\). Moreover, if two partner edges \(\ell ,\ell '\not \in M\) are added to *M* and thus \(M'=M\cup \{\ell ,\ell '\}\), then \(adj(M') \ge adj(M)\) and \(edg(M')=edg(M)+2\). Therefore \(\mathcal {F}_\alpha (M') > \mathcal {F}_\alpha (M)\) for \(\alpha <1\) and \(\mathcal {F}_\alpha (M')\ge \mathcal {F}_\alpha (M)\) for \(\alpha =1\).

Next, we show that there is a vertex cover of size \(\lambda \) for a graph \(\mathcal {G}\) if and only if Problem 2 has a solution with \(F = \alpha (2m+1+(n-\lambda )) + (1-\alpha ) (2n+4m+2)\). Note that by construction of *S*, *T* and *G*(*S*, *T*), conserved adjacencies in a maximal matching are only possible between pairs of the same symbol of \(\mathcal {A}\), i.e. \(v_i v_i'\), \(e_i e_i'\) or \(x_i x_i'\). Therefore we can simplify the notation and represent an adjacency \((i,i'||j,j')\) by the pair of elements in *S*, \(S[i] S[i']\). Clearly, any maximal matching of *G*(*S*, *T*) has \(|S| = 2n+4m+2\) edges. Moreover, any maximal matching realizes at least the \(2m+1\) conserved adjacencies \(e_i e_i'\) and \(x_i x_i'\). The other possible adjacencies are the \(v_i v_i'\). If there exists a solution with value \(F = \alpha (2m+1+(n-\lambda )) + (1-\alpha )|S|\), then there are at least \(n-\lambda \) adjacencies involving \(v_i v_i'\). These adjacencies are possible if the respective edges of \(\mathcal {G}\) are covered by \(\lambda \) vertices. If we do not have a solution with value *F*, then \(\mathcal {G}\) does not have a vertex cover of size \(\lambda \). \(\square \)

Solving Problem 2 for simple adjacencies, we make use of a method described in [19], that was originally developed for solving the gene family-free variant of Problem 2. In doing so, it constructs an integer linear program (ILP) similar to program FFAdj-Int described in [10]. It includes a preprocessing algorithm that identifies small components in gene similarity graphs which are part of an optimal solution. This approach enables the computation of optimal solutions for small and medium-sized gene similarity graphs. However, as the method is specifically tailored for gene family-free analysis, it does not perform very efficiently on gene connection graphs, as we will see in Sect. 7. We refer to this ILP and its preprocessing step as Algorithm 2.

We further believe it will be difficult to develop a practical algorithm solving Problem 2 for generalized adjacencies.

## 6 Computing Exact Solutions for Problem 3

We present a polynomial time algorithm solving Problem 3 for simple adjacencies which makes use of the following graph structure:

**Definition 6**

**(conserved adjacencies graph).** Given two genomes *S* and *T* and a set \(C = \{(i_1,i'_1 || j_1,j'_1),\)\(\ldots ,\)\((i_n,i'_n || j_n,j'_n)\}\) of conserved adjacencies between *S* and *T*, the *conserved adjacencies graph*\(A_C(S,T)\) is a bipartite graph with one vertex for each gene adjacency \((i,i')\) of *S* that occurs in *C* and one vertex for each gene adjacency \((j,j')\) of *T* that occurs in *C*. The edges correspond to the conserved adjacencies in *C*.

*n*vertices and

*m*edges is possible in \(O(m \sqrt{n})\) time [18]. In our case \(n \le |S|+|T|-2\) and \(m \le n^2\), therefore Algorithm 3 takes overall \(O((|S|+|T|)^{5/2})\) time.

*Extension to Generalized Adjacencies.* Other than for the first two problems, the properties of Problem 3 change drastically when generalized adjacencies are considered. Because a \(\theta \)-adjacency corresponds to an interval of up to \(\theta +1\) consecutive genes, the intervals of two \(\theta \)-adjacencies for \(\theta \ge 2\) can overlap on more than two genes, or even be contained in one another. The complexity of Problem 3 for \(\theta \ge 2\) remains an open question.

*i*,

*j*) in the gene connection graph

*G*(

*S*,

*T*), and \(\mathbf b (i, i', j, j')\) for each \(\theta \)-adjacency \((i, i'|| j, j')\) in \(C_\theta \). We say that a binary variable is

*saturated*if it is assigned value 1. While maximizing the number of saturated \(\mathbf b (.)\) variables (which represents the output of the program), our ILP imposes matching constraints \((\texttt {C.01})\) for the set of edges in selected \(\theta \)-adjacencies. Further constraints \((\texttt {C.02})\) ensure that for each \(\theta \)-adjacency \((i, i'||j, j')\) (a) both edges between its corresponding genes are saturated and (b) no saturated edge is incident to a gene in interval \([i+1, i'-1]\) of genome

*S*(i.e. a possibly empty interval corresponding to all genes between

*i*and \(i'\)) and interval \([j+1, j'-1]\) of genome

*T*, respectively.

## 7 Experimental Results

*Genomic Dataset.* We study genomes of 18 rosid species (see Table 1). Rosids are a prominent subclass of flowering plants to which also many agricultural crops belong. The genomic sequences of the studied species were obtained from *Phytozyme* [20]^{2}, an online resource of the Joint Genome Institute providing databases and tools for comparative genomics analyses of plant genomes. Most of the studied plant genomes are partially assembled, comprising up to 5,000 scaffolds covering one or more annotated protein coding genes. While the smallest genome in our data set contains roughly 24,500 genes, the largest spans with 56,000 genes more than twice as many. Rosids, just like many other plants, met their evolutionary fate through multiple events of whole genome duplication, followed by periods of fractionation, in which many duplicated genes were lost again.

*Construction of Gene Connection and Gene Family Graphs.*Next to the genomic sequences and gene annotations, Phytozyme also provides gene family information in form of co-orthologous clusters computed by InParanoid [21]. InParanoid follows a seed-based strategy by identifying pairs of orthologous genes (the “seeds”) through reciprocal best BLASTP hits. These are subsequently used to recruit inparalogs, eventually forming groups of co-orthologous genes.

The genomic dataset of 18 rosid species used in our experiments.

Species | Version | # genes | # scaffolds | Reference |
---|---|---|---|---|

| TAIR10 | 27,416 | 7 | [22] |

| FPSc v1.3 | 40,492 | 669 | [20] |

| v1.2 | 27,416 | 854 | [20] |

| v1.0 | 24,533 | 94 | [23] |

| v1.0 | 26,521 | 123 | [24] |

| v1.1 | 36,376 | 1,315 | [25] |

| v1.0 | 26,351 | 61 | [26] |

| v1.1 | 32,831 | 8 | [27] |

| Wm82.a2 | 56,044 | 147 | [28] |

| v2.1 | 37,505 | 133 | [29] |

| v1.0 | 43,471 | 1,028 | [30] |

| Mt4.0v1 | 50,894 | 1,033 | [31] |

| v1.0 | 27,864 | 59 | [32] |

| v3.0 | 41,335 | 379 | [33] |

| v1.0 | 27,197 | 91 | [34] |

| v0.1 | 31,221 | 4,962 | [35] |

| v1.1 | 29,452 | 99 | [36] |

| Genoscope.12X | 26,346 | 33 | [37] |

We ran BLASTP on all genes of our dataset using an e-value threshold of \(10^{-5}\) and otherwise default parameter settings. We then constructed gene connection graphs for all 153 genome pairs by establishing edges between vertices whose corresponding genes share reciprocal BLASTP hits. We refer to these graphs as *BLASTP GC graphs*. Similarly, we constructed pairwise gene family graphs using InParanoid’s homology assignment, which we refer to as *InParanoid GF graphs*.

Unsurprisingly, the BLASTP GC graphs are much larger in size than the InParanoid GF graphs. We observed average sizes of 150,000 edges for the former, whereas the latter graphs had on average only one fifth of this size. Moreover, only 4 % of edges in InParanoid GF graphs were not contained in their BLASTP GC counterparts. Lacking ground truth of homologies in our dataset, we take a conservative stance by assuming that InParanoid’s homology assignment can be considered true, or, in other words, that it contains only a negligible number of false positives. However, we conclude from a previous study [38], in which InParanoid (as well as all other gene family prediction tools in that study) exhibited a poor recall, that the homology assignment may be incomplete. That being said, we regard the edges of BLASTP GC graphs with suspicion. In doing so, we assume many of them leading to false positive homology assignments. We perform subsequent analysis to outline a possible procedure of identifying additional potential homologies that are supported by conservation in gene order in BLASTP GC graphs.

*Implementation.* All computations were performed on a Linux machine using a single 2.3 GHz CPU. We implemented Algorithms 1, 1’, 3, and 3’ in Python. For Algorithm 2 we used the implementation of [19]. In Algorithm 3, the maximum cardinality matching was computed using an implementation of Hopcroft and Karp’s algorithm [18] provided by the Python-based NetworkX^{3} library. The ILPs of Algorithms 2 and 3’ were run using CPLEX^{4}, a solver for various types of linear and quadratic programs.

*Runtimes.* The runtimes of Algorithms 1 and 3 are shown in Fig. 3 (left). The runtime analysis was repeated 5 times and is visualized by whisker plots. For each of the 153 BLASTP GC graphs in our dataset, the computation was finished in less than 50 CPU seconds. Moreover, our evaluation reveals that the enumeration of the set of conserved adjacencies in our dataset requires often more time than the subsequent computation of the maximum matching for Algorithm 3. The plot on the right side of Fig. 3 shows that the runtimes of Algorithm 1’ for \(\theta =2,3,4\) increase only moderately for higher values of \(\theta \).

*R. communis*and

*V. vinifera*within 36 hours of computation. Surprisingly, running Algorithm 2 with \(\alpha =0.1\) just as long, we were able to obtain a suboptimal solution of which CPLEX reported an optimality gap of only 1.89 %. Nevertheless, as a reference for comparison with our various models it would be even more informative to have optimal solutions of these problems. We leave it as an open problem whether it is possible to improve our ILPs in order to achieve this.

Further, we were able to compute exact results for Problem 3 and \(\theta =2\) with Algorithm 3’ for all 153 but 19 BLASTP GC graphs and all but 16 InParanoid GC graphs, limiting computation time to two hours per graph instance.

*Gene Connection vs. Gene Family Graphs.*The overlap between the set of conserved simple adjacencies identified in BLASTP GC graphs and in InParanoid GF graphs is visualized in the left plot of Fig. 4. Overall, 70 % of the conserved adjacencies of the InParanoid GF graphs were also found in the BLASTP GC graphs whereas we find in the latter 90 % more conserved adjacencies than in the former. Investigating the high number of InParanoid adjacencies that are missing in BLASTP GC graphs, we discovered that many generalized adjacencies of the former span genes that are connected (and therefore breaking the surrounding adjacency) in their BLASTP GC counterparts. However, the mean number of connected intervening genes was only 1.4. In fact, the overlap of 2-adjacencies in BLASTP GC graphs with 1-adjacencies of InParanoid GF graphs was at 83 % of all adjacencies in the latter (Fig. 4, right plot).

## 8 Conclusion

We have presented new similarity measures for complete genomes, thereby defining gene connections as an intermediate model of genome similarity representations, between gene families and the gene family-free approach. Our theoretical results with some problem variants being polynomial and others being NP-hard show that we are very close to the hardness border of similarity computations between genomes with unrestricted gene content. On the practical side we could show that the computation of genomic similarities in the gene connection model gives meaningful results and is possible in reasonable time, if the measures and algorithms are designed carefully.

A few questions remain open, though. While Problem 3 is polynomial for \(\theta =1\), the complexity for larger values of \(\theta \) is unknown. Moreover, it is always difficult to choose optimal values for parameters like the gap threshold \(\theta \). It will certainly be worthwhile to examine whether parameter estimation methods for generalized adjacencies as the ones developed in [39] can be adapted to the gene connection model.

Various model extensions can also be envisaged. The adjacency matching model (Problem 3) removes inconsistencies from the output of the total adjacencies model (Problem 1) by computing a maximum matching on it. It could be tested whether other criteria to remove genes from the genomes and thus derive consistent sets of conserved adjacencies yield even better results. Moreover, the resulting reduced genomes with conserved adjacencies could be used to predict orthologies between the involved genes, not only to compute genomic similarities.

An alternative objective function for our problem formulations, instead of counting (generalized) gene adjacencies, could be a variant of the *summed adjacency disruption number* [40] that also allows to distinguish between small and larger distortions in gene order.

Finally, Algorithm 3 can easily be generalized for weighted gene similarities (instead of gene connections). It remains to be evaluated if such a more fine-grained measure in the spirit of a family-free analysis has advantages compared to the unit-cost measures studied in this paper.

## Footnotes

## Notes

### Acknowledgements

The research of LABK and SD is partially supported by FAPERJ and CNPq. This work was performed while JS was on sabbatical as Special Visiting Researcher at UFF in Niteri, Brazil, funded by Cincia sem Fronteiras/CAPES.

### References

- 1.Sankoff, D.: Edit distance for genome comparison based on non-local operations. In: Apostolico, A., Galil, Z., Manber, U., Crochemore, M. (eds.) CPM 1992. LNCS, vol. 644, pp. 121–135. Springer, Heidelberg (1992)CrossRefGoogle Scholar
- 2.Hannenhalli, S., Pevzner, P.A.: Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. J. ACM
**46**(1), 1–27 (1999)CrossRefMATHMathSciNetGoogle Scholar - 3.Yancopoulos, S., Attie, O., Friedberg, R.: Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics
**21**(16), 3340–3346 (2005)CrossRefGoogle Scholar - 4.Bergeron, A., Mixtacki, J., Stoye, J.: A new linear time algorithm to compute the genomic distance via the double cut and join distance. Theor. Comput. Sci.
**410**(51), 5300–5316 (2009)CrossRefMATHMathSciNetGoogle Scholar - 5.Bryant, D.: The complexity of calculating exemplar distances. In: Sankoff, D., Nadeau, J.H. (eds.) Comparative Genomics. Computational Biology Series, vol. 1, pp. 207–211. Kluwer Academic Publishers, London (2000)CrossRefGoogle Scholar
- 6.Chen, X., Zheng, J., Fu, Z., Nan, P., Zhong, Y., Lonardi, S., Jiang, T.: Assignment of orthologous genes via genome rearrangement. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2**(4), 302–315 (2005)CrossRefGoogle Scholar - 7.Angibaud, S., Fertin, G., Rusu, I., Thevenin, A., Vialette, S.: Efficient tools for computing the number of breakpoints and the number of adjacencies between two genomes with duplicate genes. J. Comput. Biol.
**15**(8), 1093–1115 (2008)CrossRefMathSciNetGoogle Scholar - 8.Bulteau, L., Jiang, M.: Inapproximability of (1,2)-exemplar distance. IEEE/ ACM Trans. Comput. Biol. Bioinform.
**10**(6), 1384–1390 (2012)CrossRefGoogle Scholar - 9.Shao, M., Lin, Y., Moret, B.M.E.: An exact algorithm to compute the double-cut-and-join distance for genomes with duplicate genes. J. Comput. Biol.
**22**(5), 425–435 (2015)CrossRefMathSciNetGoogle Scholar - 10.Doerr, D., Thvenin, A., Stoye, J.: Gene family assignment-free comparative genomics. BMC Bioinform.
**13**(Suppl. 19), S3 (2012)CrossRefGoogle Scholar - 11.Braga, M.D.V., Chauve, C., Doerr, D., Jahn, K., Stoye, J., Thvenin, A., Wittler, R.: The potential of family-free genome comparison. In: Chauve, C., El-Mabrouk, N., Tannier, E. (eds.) Models and Algorithms for Genome Evolution. Computational Biology Series, vol. 19, pp. 287–307. Springer, London (2013)CrossRefGoogle Scholar
- 12.Doerr, D., Stoye, J., Bcker, S., Jahn, K.: Identifying gene clusters by discovering common intervals in indeterminate strings. BMC Bioinform.
**15**(Suppl. 6), S2 (2014)Google Scholar - 13.Martinez, F.V., Feijo, P., Braga, M.D.V., Stoye, J.: On the family-free DCJ distance and similarity. Algorithms Mol. Biol.
**10**, 13 (2015)CrossRefGoogle Scholar - 14.Zhu, Q., Adam, Z., Choi, V., Sankoff, D.: Generalized gene adjacencies, graph bandwidth, and clusters in yeast evolution. IEEE/ACM Trans. Comput. Biol. Bioinform.
**6**(2), 213–220 (2009)CrossRefGoogle Scholar - 15.Sankoff, D.: Genome rearrangement with gene families. Bioinformatics
**15**(11), 909–917 (1999)CrossRefGoogle Scholar - 16.Blanchette, M., Kunisawa, T., Sankoff, D.: Gene order breakpoint evidence in animal mitochondrial phylogeny. J. Mol. Evol.
**49**(2), 193–203 (1999)CrossRefGoogle Scholar - 17.Tannier, E., Zheng, C., Sankoff, D.: Multichromosomal median and halving problems under different genomic distances. BMC Bioinform.
**10**, 120 (2009)CrossRefGoogle Scholar - 18.Hopcroft, J.E., Karp, R.M.: An \(n^{5/2}\) algorithm for maximum matchings in bipartite graphs. SIAM J. Comput.
**2**(4), 225–231 (1973)CrossRefMATHMathSciNetGoogle Scholar - 19.Doerr, D.: Gene family-free genome comparison. Ph.D. thesis, Faculty of Technology, Bielefeld University, Germany (2015)Google Scholar
- 20.Goodstein, D.M., Shu, S., Howson, R., Neupane, R., Hayes, R.D., Fazo, J., Mitros, T., Dirks, W., Hellsten, U., Putnam, N., Rokhsar, D.S.: Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res.
**40**(Database issue), D1178–D1186 (2012)CrossRefGoogle Scholar - 21.Sonnhammer, E.L.L., Östlund, G.: Inparanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Res.
**43**(Database issue), D234–D239 (2015)CrossRefGoogle Scholar - 22.Lamesch, P., Berardini, T.Z., Li, D., Swarbreck, D., Wilks, C., Sasidharan, R., Muller, R., Dreher, K., Alexander, D.L., Garcia-Hernandez, M., Karthikeyan, A.S., Lee, C.H., Nelson, W.D., Ploetz, L., Singh, S., Wensel, A., Huala, E.: The arabidopsis information resource (tair): improved gene annotation and new tools. Nucleic Acids Res.
**40**(Database issue), D1202–D1210 (2011)Google Scholar - 23.Wu, G.A., Prochnik, S., Jenkins, J., Salse, J., Hellsten, U., Murat, F., Perrier, X., Ruiz, M., Scalabrin, S., Terol, J., Takita, M.A., Labadie, K., Poulain, J., Couloux, A., Jabbari, K., Cattonaro, F., Del Fabbro, C., Pinosio, S., Zuccolo, A., Chapman, J., Grimwood, J., Tadeo, F.R., Estornell, L.H., Muñoz-Sanz, J.V., Ibanez, V., Herrero-Ortega, A., Aleza, P., Pérez-Pérez, J., Ramón, D., Brunel, D., Luro, F., Chen, C., Farmerie, W.G., Desany, B., Kodira, C., Mohiuddin, M., Harkins, T., Fredrikson, K., Burns, P., Lomsadze, A., Mark, B., Reforgiato, G., Freitas-Astúa, J., Quetier, F., Navarro, L., Roose, M., Wincker, P., Schmutz, J., Morgante, M., Machado, M.A., Talón, M., Jaillon, O., Ollitrault, P., Gmitter, F., Rokhsar, D.: Sequencing of diverse mandarin, pummelo and orange genomes reveals complex history of admixture during citrus domestication. Nat. Biotechnol.
**32**(7), 656–662 (2014)CrossRefGoogle Scholar - 24.Slotte, T., Hazzouri, K.M., Ågren, J.A., Koenig, D., Maumus, F., Guo, Y.-L., Steige, K., Platts, A.E., Escobar, J.S., Newman, L.K., Wang, W., Mandáková, T., Vello, E., Smith, L.M., Henz, S.R., Steffen, J., Takuno, S., Brandvain, Y., Coop, G., Andolfatto, P., Hu, T.T., Blanchette, M., Clark, R.M., Quesneville, H., Nordborg, M., Gaut, B.S., Lysak, M.A., Jenkins, J., Grimwood, J., Chapman, J., Prochnik, S., Shu, S., Rokhsar, D., Schmutz, J., Weigel, D., Wright, S.I.: The Capsella rubella genome and the genomic consequences of rapid mating system evolution. Nat. Genet.
**45**(7), 831–835 (2013)CrossRefGoogle Scholar - 25.Bartholomé, J., Mandrou, E., Mabiala, A., Jenkins, J., Nabihoudine, I., Klopp, C., Schmutz, J., Plomion, C., Gion, J.-M.: High-resolution genetic maps of eucalyptus improve Eucalyptus grandis genome assembly. New Phytol
**206**(4), 1283–1296 (2015)CrossRefGoogle Scholar - 26.Yang, R., Jarvis, D.E., Chen, H., Beilstein, M.A., Grimwood, J., Jenkins, J., Shu, S., Prochnuk, S., Xin, M., Ma, C., Schmutz, J., Wing, R.A., Mitchell-Olds, T., Schumaker, K.S., Wang, X.: The reference genome of the halophytic plant
*Eutrema salsugineum*. Front. Plant Sci.**4**, 46 (2013)Google Scholar - 27.Shulaev, V., Sargent, D.J., Crowhurst, R.N., Mockler, T.C., Folkerts, O., Delcher, A.L., Jaiswal, P., Mockaitis, K., Liston, A., Mane, S.P., Burns, P., Davis, T.M., Slovin, J.P., Bassil, N., Hellens, R.P., Evans, C., Harkins, T., Kodira, C., Desany, B., Crasta, O.R., Jensen, R.V., Allan, A.C., Michael, T.P., Setubal, J.C., Celton, J.-M., Rees, D.J.G., Williams, K.P., Holt, S.H., Rojas, J.J.R., Chatterjee, M., Liu, B., Silva, H., Meisel, L., Adato, A., Filichkin, S.A., Troggio, M., Viola, R., Ashman, T.-L., Wang, H., Dharmawardhana, P., Elser, J., Raja, R., Priest, H.D., Bryant, D.W., Fox, S.E., Givan, S.A., Wilhelm, L.J., Naithani, S., Christoffels, A., Salama, D.Y., Carter, J., Girona, E.L., Zdepski, A., Wang, W., Kerstetter, R.A., Schwab, W., Korban, S.S., Davik, J., Monfort, A., Denoyes-Rothan, B., Arus, P., Mittler, R., Flinn, B., Aharoni, A., Bennetzen, J.L., Salzberg, S.L., Dickerman, A.W., Velasco, R., Borodovsky, M., Veilleux, R.E., Folta, K.M.: The genome of woodland strawberry (Fragaria vesca). Nat. Genet.
**43**(2), 109–116 (2011)CrossRefGoogle Scholar - 28.Schmutz, J., Cannon, S.B., Schlueter, J., Ma, J., Mitros, T., Nelson, W., Hyten, D.L., Song, Q., Thelen, J.J., Cheng, J., Xu, D., Hellsten, U., May, G.D., Yu, Y., Sakurai, T., Umezawa, T., Bhattacharyya, M.K., Sandhu, D., Valliyodan, B., Lindquist, E., Peto, M., Grant, D., Shu, S., Goodstein, D., Barry, K., Futrell-Griggs, M., Abernathy, B., Du, J., Tian, Z., Zhu, L., Gill, N., Joshi, T., Libault, M., Sethuraman, A., Zhang, X.-C., Shinozaki, K., Nguyen, H.T., Wing, R.A., Cregan, P., Specht, J., Grimwood, J., Rokhsar, D., Stacey, G., Shoemaker, R.C., Jackson, S.A.: Genome sequence of the palaeopolyploid soybean. Nature
**463**(7278), 178–183 (2010)CrossRefGoogle Scholar - 29.Paterson, A.H., Wendel, J.F., Gundlach, H., Guo, H., Jenkins, J., Jin, D., Llewellyn, D., Showmaker, K.C., Shu, S., Udall, J., Yoo, M.-J., Byers, R., Chen, W., Doron-Faigenboim, A., Duke, M.V., Gong, L., Grimwood, J., Grover, C., Grupp, K., Hu, G., Lee, T.-H., Li, J., Lin, L., Liu, T., Marler, B.S., Page, J.T., Roberts, A.W., Romanel, E., Sanders, W.S., Szadkowski, E., Tan, X., Tang, H., Xu, C., Wang, J., Wang, Z., Zhang, D., Zhang, L., Ashrafi, H., Bedon, F., Bowers, J.E., Brubaker, C.L., Chee, P.W., Das, S., Gingle, A.R., Haigler, C.H., Harker, D., Hoffmann, L.V., Hovav, R., Jones, D.C., Lemke, C., Mansoor, S., Rahman, M.U., Rainville, L.N., Rambani, A., Reddy, U.K., Rong, J.-K., Saranga, Y., Scheffler, B.E., Scheffler, J.A., Stelly, D.M., Triplett, B.A., Van Deynze, A., Vaslin, M.F.S., Waghmare, V.N., Walford, S.A., Wright, R.J., Zaki, E.A., Zhang, T., Dennis, E.S., Mayer, K.F.X., Peterson, D.G., Rokhsar, D.S., Wang, X., Schmutz, J.: Repeated polyploidization of gossypium genomes and the evolution of spinnable cotton fibres. Nature
**492**(7429), 423–427 (2012)CrossRefGoogle Scholar - 30.Wang, Z., Hobson, N., Galindo, L., Zhu, S., Shi, D., McDill, J., Yang, L., Hawkins, S., Neutelings, G., Datla, R., Lambert, G., Galbraith, D.W., Grassa, C.J., Geraldes, A., Cronk, Q.C., Cullis, C., Dash, P.K., Kumar, P.A., Cloutier, S., Sharpe, A.G., Wong, G.K.S., Wang, J., Deyholos, M.K.: The genome of flax (Linum usitatissimum) assembled de novo from short shotgun sequence reads. Plant J.
**72**(3), 461–473 (2012)CrossRefGoogle Scholar - 31.Young, N.D., Debellé, F., Oldroyd, G.E.D., Geurts, R., Cannon, S.B., Udvardi, M.K., Benedito, V.A., Mayer, K.F.X., Gouzy, J., Schoof, H., Van de Peer, Y., Proost, S., Cook, D.R., Meyers, B.C., Spannagl, M., Cheung, F., De Mita, S., Krishnakumar, V., Gundlach, H., Zhou, S., Mudge, J., Bharti, A.K., Murray, J.D., Naoumkina, M.A., Rosen, B., Silverstein, K.A.T., Tang, H., Rombauts, S., Zhao, P.X., Zhou, P., Barbe, V., Bardou, P., Bechner, M., Bellec, A., Berger, A., Bergès, H., Bidwell, S., Bisseling, T., Choisne, N., Couloux, A., Denny, R., Deshpande, S., Dai, X., Doyle, J.J., Dudez, A.-M., Farmer, A.D., Fouteau, S., Franken, C., Gibelin, C., Gish, J., Goldstein, S., González, A.J., Green, P.J., Hallab, A., Hartog, M., Hua, A., Humphray, S.J., Jeong, D.-H., Jing, Y., Jöcker, A., Kenton, S.M., Kim, D.-J., Klee, K., Lai, H., Lang, C., Lin, S., Macmil, S.L., Magdelenat, G., Matthews, L., McCorrison, J., Monaghan, E.L., Mun, J.-H., Najar, F.Z., Nicholson, C., Noirot, C., O’Bleness, M., Paule, C.R., Poulain, J., Prion, F., Qin, B., Qu, C., Retzel, E.F., Riddle, C., Sallet, E., Samain, S., Samson, N., Sanders, I., Saurat, O., Scarpelli, C., Schiex, T., Segurens, B., Severin, A.J., Sherrier, D.J., Shi, R., Sims, S., Singer, S.R., Sinharoy, S., Sterck, L., Viollet, A., Wang, B.-B., Wang, K., Wang, M., Wang, X., Warfsmann, J., Weissenbach, J., White, D.D., White, J.D., Wiley, G.B., Wincker, P., Xing, Y., Yang, L., Yao, Z., Ying, F., Zhai, J., Zhou, L., Zuber, A., Dénarié, J., Dixon, R.A., May, G.D., Schwartz, D.C., Rogers, J., Quetier, F., Town, C.D., Roe, B.A.: The medicago genome provides insight into the evolution of rhizobial symbioses. Nature
**480**(7378), 520–524 (2011)Google Scholar - 32.Verde, I., Abbott, A.G., Scalabrin, S., Jung, S., Shu, S., Marroni, F., Zhebentyayeva, T., Dettori, M.T., Grimwood, J., Cattonaro, F., Zuccolo, A., Rossini, L., Jenkins, J., Vendramin, E., Meisel, L.A., Decroocq, V., Sosinski, B., Prochnik, S., Mitros, T., Policriti, A., Cipriani, G., Dondini, L., Ficklin, S., Goodstein, D.M., Xuan, P., Del Fabbro, C., Aramini, V., Copetti, D., Gonzalez, S., Horner, D.S., Falchi, R., Lucas, S., Mica, E., Maldonado, J., Lazzari, B., Bielenberg, D., Pirona, R., Miculan, M., Barakat, A., Testolin, R., Stella, A., Tartarini, S., Tonutti, P., Arus, P., Orellana, A., Wells, C., Main, D., Vizzotto, G., Silva, H., Salamini, F., Schmutz, J., Morgante, M., Rokhsar, D.S.: The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution. Nat. Genet.
**45**(5), 487–494 (2013)CrossRefGoogle Scholar - 33.Du, Q., Wang, L., Yang, X., Gong, C., Zhang, D.: Populus endo-\(\beta \)-1,4-glucanases gene family: genomic organization, phylogenetic analysis, expression profiles and association mapping. Planta
**241**(6), 1417–1434 (2015)CrossRefGoogle Scholar - 34.Schmutz, J., McClean, P.E., Mamidi, S., Wu, G.A., Cannon, S.B., Grimwood, J., Jenkins, J., Shu, S., Song, Q., Chavarro, C., Torres-Torres, M., Geffroy, V., Moghaddam, S.M., Gao, D., Abernathy, B., Barry, K., Blair, M., Brick, M.A., Chovatia, M., Gepts, P., Goodstein, D.M., Gonzales, M., Hellsten, U., Hyten, D.L., Jia, G., Kelly, J.D., Kudrna, D., Lee, R., Richard, M.M.S., Miklas, P.N., Osorno, J.M., Rodrigues, J., Thareau, V., Urrea, C.A., Wang, M., Yu, Y., Zhang, M., Wing, R.A., Cregan, P.B., Rokhsar, D.S., Jackson, S.A.: A reference genome for common bean and genome-wide analysis of dual domestications. Nat. Genet.
**46**(7), 707–713 (2014)CrossRefGoogle Scholar - 35.Chan, A.P., Crabtree, J., Zhao, Q., Lorenzi, H., Orvis, J., Puiu, D., Melake-Berhan, A., Jones, K.M., Redman, J., Chen, G., Cahoon, E.B., Gedil, M., Stanke, M., Haas, B.J., Wortman, J.R., Fraser-Liggett, C.M., Ravel, J., Rabinowicz, P.D.: Draft genome sequence of the oilseed species Ricinus communis. Nat. Biotechnol.
**28**(9), 951–956 (2010)CrossRefGoogle Scholar - 36.Motamayor, J.C., Mockaitis, K., Schmutz, J., Haiminen, N., Livingstone, D., Cornejo, O., Findley, S.D., Zheng, P., Utro, F., Royaert, S., Saski, C., Jenkins, J., Podicheti, R., Zhao, M., Scheffler, B.E., Stack, J.C., Feltus, F.A., Mustiga, G.M., Amores, F., Phillips, W., Marelli, J.P., May, G.D., Shapiro, H., Ma, J., Bustamante, C.D., Schnell, R.J., Main, D., Gilbert, D., Parida, L., Kuhn, D.N.: The genome sequence of the most widely cultivated cacao type and its use to identify candidate genes regulating pod color. Genome Biol.
**14**(6), r53 (2012)CrossRefGoogle Scholar - 37.Jaillon, O., Aury, J.-M., Noel, B., Policriti, A., Clepet, C., Casagrande, A., Choisne, N., Aubourg, S., Vitulo, N., Jubin, C., Vezzi, A., Legeai, F., Hugueney, P., Dasilva, C., Horner, D., Mica, E., Jublot, D., Poulain, J., Bruyère, C., Billault, A., Segurens, B., Gouyvenoux, M., Ugarte, E., Cattonaro, F., Anthouard, V., Vico, V., Del Fabbro, C., Alaux, M., Di Gaspero, G., Dumas, V., Felice, N., Paillard, S., Juman, I., Moroldo, M., Scalabrin, S., Canaguier, A., Le Clainche, I., Malacrida, G., Durand, E., Pesole, G., Laucou, V., Chatelet, P., Merdinoglu, D., Delledonne, M., Pezzotti, M., Lecharny, A., Scarpelli, C., Artiguenave, F., Pè, M.E., Valle, G., Morgante, M., Caboche, M., Adam-Blondon, A.-F., Weissenbach, J., Quetier, F., Wincker, P.: The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature
**449**(7161), 463–467 (2007)CrossRefGoogle Scholar - 38.Lechner, M., Hernandez-Rosales, M., Doerr, D., Wieseke, N., Thvenin, A., Stoye, J., Hartmann, R.K., Prohaska, S.J., Stadler, P.F.: Orthology detection combining clustering and synteny for very large datasets. PLoS ONE
**9**(8), e10515 (2014)CrossRefGoogle Scholar - 39.Yang, Z., Sankoff, D.: Natural parameter values for generalized gene adjacencies. J. Comput. Biol.
**17**(9), 1113–1128 (2010)CrossRefGoogle Scholar - 40.Delgado, J., Lynce, I., Manquinho, V.: Computing the summed adjacency disruption number between two genomes with duplicate genes. J. Comput. Biol.
**17**(9), 1243–1265 (2010)CrossRefMathSciNetGoogle Scholar