New Genome Similarity Measures Based on Conserved Gene Adjacencies

  • Luis Antonio B. Kowada
  • Daniel Doerr
  • Simone Dantas
  • Jens Stoye
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9649)

Abstract

Many important questions in molecular biology, evolution and biomedicine can be addressed by comparative genomics approaches. One of the basic tasks when comparing genomes is the definition of measures of similarity (or dissimilarity) between two genomes, for example to elucidate the phylogenetic relationships between species.

The power of different genome comparison methods varies with the underlying formal model of a genome. The simplest models impose the strong restriction that each genome under study must contain the same genes, each in exactly one copy. More realistic models allow several copies of a gene in a genome. One speaks of gene families, and comparative genomics methods that allow this kind of input are called gene family-based. The most powerful – but also most complex – models avoid this preprocessing of the input data and instead integrate the family assignment within the comparative analysis. Such methods are called gene family-free.

In this paper, we study an intermediate approach between family-based and family-free genomic similarity measures. The model, called gene connections, is on the one hand more flexible than the family-based model, on the other hand the resulting data structure is less complex than in the family-free approach. This intermediate status allows us to achieve results comparable to those for family-free methods, but at running times similar to those for the family-based approach.

Within the gene connection model, we define three variants of genomic similarity measures that have different expression power. We give polynomial-time algorithms for two of them, while we show NP-hardness of the third, most powerful one. We also generalize the measures and algorithms to make them more robust against recent local disruptions in gene order. Our theoretical findings are supported by experimental results, proving the applicability and performance of our newly defined similarity measures.

1 Introduction

Many important questions in molecular biology, evolution and biomedicine can be addressed by comparative genomics approaches. One of the basic tasks in this area is the definition of measures of similarity between two genomes. Direct applications of such measures are the computation of phylogenetic trees or the reconstruction of ancestral genomes, but also more indirect tasks like the prediction of orthologous gene pairs (derived from the same ancestor gene through speciation) or the transfer of gene function across species profit immensely from accurate genome comparison methods.

Indeed, over the past forty-or-so years, many methods have been proposed to quantify the similarity of single genes, mostly based on pairwise or multiple sequence alignments. However, in many situations similarity measures based on whole genomes are more meaningful than gene-based measures, because they give a more representative picture and are more robust against side effects such as horizontal gene transfer. Therefore, in this paper we develop and analyze methods for whole genome comparison, based on the physical structure (gene order) of the genomes.

The most simple picture of a genome is one where in a set of genomes under study orthologous genes have been identified beforehand, and only groups of orthologous genes (also known as gene families) are considered that have exactly one member in each genome. In this model, a variety of genomic similarity (or distance) measures have been studied and are relatively easy to compute [1, 2, 3, 4]. However, the singleton gene family is a great oversimplification compared to what we find in nature. Therefore, more general models have been devised where several genes from the same family can exist in one genome. The computation of genomic similarities in these cases is generally much more difficult, though. In fact, many problem variants are NP-hard [5, 6, 7, 8, 9].

Another biological inaccuracy arises from the fact that a gene family assignment is not always without dispute, because orthology is usually not known but just predicted, and most prediction methods require some arbitrary threshold, deciding when two genes belong to the same family and when not. Therefore gene family-free measures have recently been proposed, based on pairwise similarities between genes [10, 11, 12, 13]. While the resulting similarity measures are very promising, their computation is usually not easier than for the family-based models and therefore NP-hard as well [10, 13].

In this paper, we study an intermediate approach between family-based and family-free genomic similarity measures, gene connections. It requires some preprocessing of the genes contained in the genomes under study, but in a less stringent way than in the family-based approach. On the other hand, the resulting data structure is less complex than in the family-free approach, where arbitrary (real-valued) similarities between genes are considered. This intermediate status allows us to achieve results comparable to those for family-free methods, but at time complexities similar to those for the family-based approach.

The paper is structured as follows. We first define three new genome similarity measures based on conserved gene adjacencies (Sect. 2), followed by some pointers to related literature (Sect. 3). Each of the three following sections is then devoted to one of the similarity measures. We show that the first problem can be computed in polynomial time, but is biologically quite simplistic. The second one, while avoiding some of the weaknesses of the first, is NP-hard to compute and can therefore not be applied for genomes of realistic size. The third measure, finally, provides a compromise between biological relevance and computational complexity. In Sect. 7 we compare the results obtained with our similarity measures experimentally, using a large data set of plant (rosid) genomes. The last section concludes the paper.

The implemented algorithms used in this work as well as the studied dataset are available for download from http://bibiserv.cebitec.uni-bielefeld.de/newdist.

2 Basic Definitions

An alphabet is a finite set of characters. A string over an alphabet \(\mathcal {A}\) is a sequence of characters from \(\mathcal {A}\). Given a string S, S[i] refers to the ith character of S and |S| is the length of S, i.e., the number of characters in S. In a signed stringS, each character is labeled with a sign, denoted \( sgn _S(i)\) for the character at index position i. A sign is either positive (\(+\)) or negative (\(-\)). In comparative genomics, for example, the signs may indicate the orientations of genes on their genomic sequences, which themselves are represented as strings. Therefore in this paper we use the term gene as a synonym for “signed character” and the term genome as a synonym for “signed string”.

Definition 1

(gene connection graph). Given two genomes S and T, a gene connection graphG(ST) of S and T is a bipartite graph with one vertex for each gene of S and one vertex for each gene of T. An edge between two vertices, one from S and one from T, indicates that there is some connection between the two genes represented by these vertices.

The term connection in the above definition is not very specific. Depending on the data set and context, connections may be defined based on gene homology, sequence similarity, functional relatedness, or any other similarity measure between genes.

For ease of notation, we let S[i] denote both the ith gene of genome S, as well as the vertex of G representing this gene. Similar for T[j]. The set of edges of a graph G is denoted by E(G). The size of a graph G is the number of its edges, \(|G| = |E(G)|\). Further, we define a connection functiont that returns for an index position i of S the list t(i) of index positions in T that are connected to S[i] by an edge in G(ST). That is, \(t(i) = [j \mid (i,j) \in E(G(S,T)) \; \text {for} \; 1 \le j \le |T|]\). The function s(j) for an index position of T is defined analogously.

A pair of adjacent index positions \((i,i')\) with \(i' = i+1\) in a string is called an adjacency. Note that this definition of adjacency only considers direct neighborhood of genes (\(i' = i+1\)), while all our following uses of this term refer to an extended definition given by Zhu et al. [14], who introduced generalized gene adjacencies as follows:

Definition 2

(adjacency). Given an integer \(\theta \ge 1\), a pair of index positions \((i,i')\) with \(i' \le i+\theta \) in a string is a (\(\theta \)-) adjacency.

In other words, two genes of the same genome form a \(\theta \)-adjacency if the number of genes between them is less than \(\theta \). In the following we will frequently differentiate between simple adjacencies (\(\theta =1\)) and generalized adjacencies (\(\theta \ge 1\)).

As mentioned in the Introduction, in this paper we are interested in defining measures of similarity to compare pairs of genomes. A simple approach is based on their number of conserved adjacencies. Although below we will study different variants of similarities, they all use the following basic notion of conserved adjacency:

Definition 3

(conserved adjacency). Given two genomes S and T and a gene connection graph G(ST), a pair of adjacencies \((i,i')\) in S and \((j,j')\) in T is called a conserved adjacency, denoted \((i,i' || j,j')\), if one of the following two holds:
  1. (a)

    \((i,j) \in E(G(S,T))\), \((i',j') \in E(G(S,T))\), \( sgn _S(i) = sgn _T(j)\) and \( sgn _S(i') = sgn _T(j')\); or

     
  2. (b)

    \((i,j') \in E(G(S,T))\), \((i',j) \in E(G(S,T))\), \( sgn _S(i) \ne sgn _T(j')\) and \( sgn _S(i') \ne sgn _T(j)\).

     
For an illustration of these definitions, see Fig. 1.
Fig. 1.

Gene connection graph of two genomes \(S = (+a, +b, +c, -d, -e, +f)\) (top row) and \(T = (+t, +u, -v, +w, -x, -y, +z)\) (bottom row). Conserved 2-adjacencies are (1, 2||1, 2), (2, 3||2, 4), (3, 4||4, 6) and (5, 6||5, 7). Note that (2, 3||1, 3), (2, 3||2, 3), (4, 5||6, 7) and (4, 6||5, 6) are no conserved 2-adjacencies because the signs do not match the definition.

We further denote two conserved adjacencies as conflicting if their intervals in either genome are overlapping:

Definition 4

(conflicting conserved adjacencies). Two conserved adjacencies \((i,i' || j,j')\) and \((k,k' || l,l')\) are conflicting if (1) \((i, i' || j, j') \ne (k, k' || l, l')\) and (2) \([i,i'-1] \cap [k,k'-1] \ne \emptyset \) or \([j,j'-1] \cap [l,l'-1] \ne \emptyset \).

Subsequently a set of conserved adjacencies is denoted as non-conflicting if the above-defined property does not hold between any two of its members.

In the example of Fig. 1, (3, 4||4, 6) and (5, 6||5, 7) are the only conflicting conserved adjacencies. All other pairs are non-conflicting.

The different similarity measures that we consider in this work are expressed by the following three problem statements:

Problem 1

(total adjacency model). Given two genomes S and T and a gene connection graph G(ST), count the number of pairs of index positions \((i,i')\) in S and \((j,j')\) in T that form a conserved adjacency. In other words, compute
$$ adj (S,T) = |\{(i,i' || j,j') \mid 1 \le i < i' \le |S| \text { and } 1 \le j < j' \le |T|\}|.$$

Because a gene connection graph G(ST) is not limited to one-to-one connections between genes of genomes S and T, solutions to Problem 1 may biologically not be very plausible. Therefore we define a second measure, motivated by the one used in [10, 11], which asks for one-to-one correspondences between genes of S and T in its solutions:

Problem 2

(gene matching model). Given two genomes S and T, a gene connection graph G(ST) and a real-valued parameter \(\alpha \in [0,1]\), find a bipartite matching M in G(ST) such that the induced sequences \(S^M\) and \(T^M\) maximize the measure
$$\mathcal {F}_\alpha (M) = \alpha \cdot adj (S^M,T^M) + (1-\alpha ) \cdot edg (M),$$
where \( edg (M) = |M|\) is the size of matching M. (The induced sequences \(S^M\) and \(T^M\) are the subsequences of S and T, respectively, that contain those characters incident to edges of M.)

As we will see later in this paper, solving Problem 2 is NP-hard even for simple adjacencies. Therefore we define a third, intermediate measure, which is more efficient to compute in practice, while producing one-to-one correspondences between gene extremities. It is defined as the size of the largest subset of non-conflicting conserved adjacencies found in a pair of genomes:

Problem 3

(adjacency matching model). Given two genomes S and T and a gene connection graph G(ST), let C be the set of conserved adjacencies between S and T. Compute the size \(|C^\star |\) of a maximum cardinality set of non-conflicting conserved adjacencies \(C^\star \subseteq C\).

3 Related Work

As mentioned above, the gene connection graph input format that we propose here is an intermediate between gene families and the family-free model. Indeed, we do not require the gene connection graph to be transitive, which is the main difference to the gene family graph, where vertices are assigned to genes and edges are drawn between genes from different genomes whenever they belong to the same family, thus forming bipartite cliques. (This graph has not been introduced under this name in the literature, but is implicitly mentioned already in [15] and later more explicitly in [10].) On the other end, the gene similarity graph [11] is a weighted version of the gene connection graph, increasing the expression power by its ability to represent different strengths of gene connections.

The only previous use of such an intermediate model in comparative genomics that we are aware of is in the form of indeterminate strings in [12].

Definition 5

(indeterminate string, signed indeterminate string). Given an alphabet \(\mathcal {A}\), a string S over the power set \(\mathcal {P}(\mathcal {A}) {\setminus } \{\emptyset \}\) is called an indeterminate string over \(\mathcal {A}\). In other words, for \(1\le i\le n\), \(\emptyset \ne S[i] \subseteq \mathcal {A}\). In a signed indeterminate stringS, any index position i has a sign \( sgn _S(i)\), which therefore is the same for all characters at that position.

Given two genomes S and T and a gene connection graph G(ST), it is easy to create a pair of signed indeterminate strings \(S'\) and \(T'\) over an alphabet \(\mathcal {A}'\) that contain the same set of conserved adjacencies as S and T: For any edge \(e = (S[i],T[j])\) of G(ST), create one symbol \(e' \in \mathcal {A}'\) and let \(e' \in S'[i]\) and \(e' \in T'[j]\). The signs are just transferred from S and T to \(S'\) and \(T'\), respectively: \( sgn _{S'}[i] = sgn _S[i]\) for all i, \(1 \le i \le |S|\), and \( sgn _{T'}[j] = sgn _T[j]\) for all j, \(1 \le j \le |T|\).

Conversely, given two indeterminate strings \(S'\) and \(T'\), we can easily create sequences S and T and the corresponding gene connection graph with the same set of conserved adjacencies. In order to do this, let \(\mathcal {A} = \{1,2,\ldots ,|S'|, 1', 2', \ldots ,|T'|'\}\), set \(S = sgn _{S'[1]} 1,\ldots , sgn _{S'[|S'|]} |S'|\), \(T = sgn _{T'[1]} 1',\ldots , sgn _{T'[|T'|]} |T'|'\), and create in G(ST) an edge \(e = (S[i],T[j])\) whenever \(S'[i] \cap T'[j] \not = \emptyset \).

Clearly, all the information about conserved adjacencies between these two representations is identical, while sometimes the graph representation and sometimes the representation as signed indeterminate string is more concise.

Indeterminate strings in [12] were used to identify regions of common gene content (gene clusters) in two genomes, which is important in functional genomics. Here our focus is on conserved adjacencies (which can be seen as small clusters of just two genes) for defining whole-genome similarities. Similar measures are known for singleton gene families as the breakpoint distance [16, 17], have been extended to gene families in [5, 7, 15] and were defined for the family-free model in [10].

4 An Optimal Solution for Problem 1

In order to solve Problem 1, we construct a list L of edges of G(ST) using connection function t(i) for \(1 \le i \le |S|\). In doing so, we assume that the elements of t(i), \(1 \le i \le |S|\), are sorted in increasing order. If this is not given as input, it can always be achieved by applying counting sort to all lists t(i) in overall \(O(|S| + |T| + |G(S,T)|)\) time, which is proportional to the input size.

We present with Algorithm 1 a solution to Problem 1 for simple adjacencies and subsequently extend this approach for the generalized case. Our algorithm is a simple, linear time procedure which uses three pointers e, \(e'\), \(e''\) into list L. These pointers simultaneously traverse L while reporting any pair of adjacent parallel edges \((e,e')\) or crossing edges \((e,e'')\).

Correctness. Given a pair \((i,j) \in L\), there are overall four cases for the signs of index i in S and index j in T, each with two sub-cases for the signs of index \(i+1\) in S and index \(j+1\) or index \(j-1\) in T, listed in the following.

  1. (1)

    If \( sgn _S(i) = +\) and \( sgn _T(j) = +\), then we have a conserved adjacency \((i,i+1 || j,j+1)\) if and only if \((i+1,j+1) \in L\) and either \( sgn _S(i+1) = +\) and \( sgn _T(j+1) = +\) or \( sgn _S(i+1) = -\) and \( sgn _T(j+1) = -\).

     
  2. (2)

    If \( sgn _S(i) = +\) and \( sgn _T(j) = -\), then we have a conserved adjacency \((i,i+1 || j-1,j)\) if and only if \((i+1,j-1) \in L\) and either \( sgn _S(i+1) = +\) and \( sgn _T(j-1) = -\) or \( sgn _S(i+1) = -\) and \( sgn _T(j-1) = +\).

     
  3. (3)

    If \( sgn _S(i) = -\) and \( sgn _T(j) = +\), then we have a conserved adjacency \((i,i+1 || j-1,j)\) if and only if \((i+1,j-1) \in L\) and either \( sgn _S(i+1) = -\) and \( sgn _T(j-1) = +\) or \( sgn _S(i+1) = +\) and \( sgn _T(j-1) = -\).

     
  4. (4)

    If \( sgn _S(i) = -\) and \( sgn _T(j) = -\), then we have a conserved adjacency \((i,i+1 || j,j+1)\) if and only if \((i+1,j+1) \in L\) and either \( sgn _S(i+1) = -\) and \( sgn _T(j+1) = -\) or \( sgn _S(i+1) = +\) and \( sgn _T(j+1) = +\).

     

Clearly, cases 1 and 4 and cases 2 and 3 can be summarized to the two cases given in Algorithm 1.

Runtime Analysis. The list L has length |G(ST)| and can be constructed and sorted in linear time \(O(|S| + |T| + |G(S,T)|)\), as discussed above. Each of the three edge pointers e, \(e'\) and \(e''\) traverses L once from the beginning to the end, so that the for loop in lines 3–19 takes O(|L|) time. Therefore the overall running time is \(O(|S| + |T| + |G(S,T)|)\).

Space Analysis. The algorithm needs space only for the two input strings S and T, the list L and some constant-space variables. Therefore the space usage is of order \(O(|S|+|T|+|G(S,T)|)\).

Extension to Generalized Adjacencies. Algorithm 1’ solves Problem 1 for generalized adjacencies. Following the same strategy as Algorithm 1, the extension requires next to the main pointer e additional \(2\theta \) pointers into list L that are denoted \(e'_t\) and \(e''_t\), \(1 \le t \le \theta \). While it traverses through each element (ij) in the list using pointer e, each pointer \(e'_t\), \(1 \le t \le \theta \), is subsequently increased to point to the smallest element larger than or equal to \((i+t, j+1)\) in L. A copy \(\hat{e}\) of pointer \(e'_t\) is then used to find candidates \((i+t, j+1), \dots , (i+t, j+\theta )\). Likewise, pointers \(e''_t\), \(1 \le t \le \theta \), are incremented to the smallest element larger than or equal to \((i+t,j-\theta )\), whereupon copy \(\hat{e}\) of \(e''_t\) is used to find candidates \((i+t, j-\theta ), \dots , (i+t, j-1)\).

All pointers e, \(e'_t\), and \(e''_t\), \(1 \le t \le \theta \) are continuously increased, thus each traversing L once. Any instance of pointer \(\hat{e}\) visits at most \(\theta \) elements in each iteration, thus leading to an overall running time of \(O(\theta ^2 |G(S, T)|)\). The running time is asymptotically optimal in the sense of worst case analysis, since there can be just as many \(\theta \)-adjacencies in graph G(ST). Algorithm 1’ requires \(O(\theta + |S| + |T|+\theta ^2|G(S, T)|)\) space.

5 Complexity of Problem 2

While one may hope that the intermediate status of the gene connection graph between the gene family graph and the gene similarity graph allows more efficient algorithms than for the more complex gene similarity graph, this is not the case for the gene matching model.

Only for \(\alpha =0\), we have \(\mathcal {F}_\alpha (M) = edg (M) = |M|\) and therefore Problem 2 reduces to computing a maximum bipartite matching, which is possible in polynomial time [18]. However, this case is not very interesting because it completely ignores conserved adjacencies and just compares the gene content of the two genomes. All interesting cases are more difficult to solve, as the following theorem shows:1

Theorem 1

Problem 2 is NP-hard for \(0 < \alpha \le 1\).

Proof

We will focus on simple adjacencies (\(\theta = 1\)), as this is sufficient to prove Theorem 1. Inspired by the proof of Bryant [5] for the family-based case, we provide a P-reduction from Vertex Cover: Given a graph \(\mathcal {G} = (V,E)\) and an integer \(\lambda \), does there exist a subset \(V' \subseteq V\) such that \(|V'| = \lambda \) and each edge in E is adjacent to at least one vertex in \(V'\)?

Our reduction transforms an instance of Vertex Cover into an instance of the decision version of Problem 2: Given strings S and T, a gene connection graph G(ST), a real value \(\alpha \), \(0 < \alpha \le 1\), and a real value \(F \ge 0\), does there exist a bipartite matching M in G(ST) such that \(\mathcal {F}_\alpha (M) \ge F\)?

Let \(\mathcal {G} = (V,E)\) and \(\lambda \) be an instance of Vertex Cover with \(V = \{v_1,v_2,\ldots ,v_n\}\) and \(E = \{e_1,e_2,\ldots ,e_m\}\). Then we construct an alphabet \(\mathcal {A}\) of size \(2n + 4m + 2\) given by
$$ \mathcal {A} = V \cup \{v_i' \mid v_i \in V\} \cup E \cup \{e_i' \mid e_i \in E\} \cup \{ x_i, x_i' \mid 1 \le i \le m+1\}. $$
The two genomes S and T are constructed as follows:
$$ S = v_1 v_1' v_2 v_2' \ldots v_n v_n' x_1 x_1' e_1 e_1' x_2 x_2' e_2 e_2' x_3 x_3' \ldots x_m x_m' e_m e_m' x_{m+1} x_{m+1}' $$
and
$$ T = x_{m+1} x_{m+1}' x_m x_m' \ldots x_2 x_2' x_1 x_1' v_n \mathcal {E}_n v_n' v_{n-1} \mathcal {E}_{n-1} v_{n-1}' \ldots v_1 \mathcal {E}_1 v_1' $$
where \(\mathcal {E}_i\) is a string of the symbol pairs \(e_j e_j'\) for the edges \(e_j\) that are adjacent to \(v_i\). The gene connection graph G(ST) has an edge for each pair of identical symbols S[i] and T[j]. The parameter \(\alpha \) may be chosen arbitrarily within the range \(0 < \alpha \le 1\).

First, we show that among the matchings maximizing the value \(\mathcal {F}_\alpha \) for this problem, there is always at least one which is a maximal matching. Let M be a non-maximal matching in G(ST) maximizing \(\mathcal {F}_\alpha \) and consider an edge \(\ell \not \in M\) that may be added to M, forming a new matching \(M' = M \cup \{\ell \}\). Clearly, \(\ell \) can dismiss at most two adjacencies of M in \(M'\), so \(adj(M')\ge adj(M)-2\). But in our construction, where the symbols of \(\mathcal {A}\) (except the \(e_i\) and \(e_i'\)) are in reverse order in S related to T, and furthermore each \(e_i\) and each \(e_i'\) is between \(x_i\) and \(x_{i+1}\) in S, any new edge \(\ell \) added to M can dismiss at most one adjacency: If \(\ell \) is adjacent to a symbol a and the symbol \(a'\) is adjacent to another edge \(\ell '\in M\) (or vice-versa) then \(adj(M')= adj(M)+1\). Moreover, if two partner edges \(\ell ,\ell '\not \in M\) are added to M and thus \(M'=M\cup \{\ell ,\ell '\}\), then \(adj(M') \ge adj(M)\) and \(edg(M')=edg(M)+2\). Therefore \(\mathcal {F}_\alpha (M') > \mathcal {F}_\alpha (M)\) for \(\alpha <1\) and \(\mathcal {F}_\alpha (M')\ge \mathcal {F}_\alpha (M)\) for \(\alpha =1\).

Next, we show that there is a vertex cover of size \(\lambda \) for a graph \(\mathcal {G}\) if and only if Problem 2 has a solution with \(F = \alpha (2m+1+(n-\lambda )) + (1-\alpha ) (2n+4m+2)\). Note that by construction of S, T and G(ST), conserved adjacencies in a maximal matching are only possible between pairs of the same symbol of \(\mathcal {A}\), i.e. \(v_i v_i'\), \(e_i e_i'\) or \(x_i x_i'\). Therefore we can simplify the notation and represent an adjacency \((i,i'||j,j')\) by the pair of elements in S, \(S[i] S[i']\). Clearly, any maximal matching of G(ST) has \(|S| = 2n+4m+2\) edges. Moreover, any maximal matching realizes at least the \(2m+1\) conserved adjacencies \(e_i e_i'\) and \(x_i x_i'\). The other possible adjacencies are the \(v_i v_i'\). If there exists a solution with value \(F = \alpha (2m+1+(n-\lambda )) + (1-\alpha )|S|\), then there are at least \(n-\lambda \) adjacencies involving \(v_i v_i'\). These adjacencies are possible if the respective edges of \(\mathcal {G}\) are covered by \(\lambda \) vertices. If we do not have a solution with value F, then \(\mathcal {G}\) does not have a vertex cover of size \(\lambda \).    \(\square \)

Solving Problem 2 for simple adjacencies, we make use of a method described in [19], that was originally developed for solving the gene family-free variant of Problem 2. In doing so, it constructs an integer linear program (ILP) similar to program FFAdj-Int described in [10]. It includes a preprocessing algorithm that identifies small components in gene similarity graphs which are part of an optimal solution. This approach enables the computation of optimal solutions for small and medium-sized gene similarity graphs. However, as the method is specifically tailored for gene family-free analysis, it does not perform very efficiently on gene connection graphs, as we will see in Sect. 7. We refer to this ILP and its preprocessing step as Algorithm 2.

We further believe it will be difficult to develop a practical algorithm solving Problem 2 for generalized adjacencies.

6 Computing Exact Solutions for Problem 3

We present a polynomial time algorithm solving Problem 3 for simple adjacencies which makes use of the following graph structure:

Definition 6

(conserved adjacencies graph). Given two genomes S and T and a set \(C = \{(i_1,i'_1 || j_1,j'_1),\)\(\ldots ,\)\((i_n,i'_n || j_n,j'_n)\}\) of conserved adjacencies between S and T, the conserved adjacencies graph\(A_C(S,T)\) is a bipartite graph with one vertex for each gene adjacency \((i,i')\) of S that occurs in C and one vertex for each gene adjacency \((j,j')\) of T that occurs in C. The edges correspond to the conserved adjacencies in C.

Pseudocode of our algorithm is shown in Algorithm 3. Clearly its running time is dominated by the time to compute a maximum matching in line 3, which in unweighted bipartite graphs with n vertices and m edges is possible in \(O(m \sqrt{n})\) time [18]. In our case \(n \le |S|+|T|-2\) and \(m \le n^2\), therefore Algorithm 3 takes overall \(O((|S|+|T|)^{5/2})\) time.

Extension to Generalized Adjacencies. Other than for the first two problems, the properties of Problem 3 change drastically when generalized adjacencies are considered. Because a \(\theta \)-adjacency corresponds to an interval of up to \(\theta +1\) consecutive genes, the intervals of two \(\theta \)-adjacencies for \(\theta \ge 2\) can overlap on more than two genes, or even be contained in one another. The complexity of Problem 3 for \(\theta \ge 2\) remains an open question.

Solving Problem 3 for generalized adjacencies, we propose Algorithm 3’ that follows the same strategy as its counterpart for simple adjacencies. However, while for the latter it was possible to find a maximum subset of non-conflicting \(\theta \)-adjacencies using a maximum matching approach, here we propose an ILP, described in Fig. 2. The ILP makes use of two types of binary variables, \(\mathbf a (i, j)\) for each edge (ij) in the gene connection graph G(ST), and \(\mathbf b (i, i', j, j')\) for each \(\theta \)-adjacency \((i, i'|| j, j')\) in \(C_\theta \). We say that a binary variable is saturated if it is assigned value 1. While maximizing the number of saturated \(\mathbf b (.)\) variables (which represents the output of the program), our ILP imposes matching constraints \((\texttt {C.01})\) for the set of edges in selected \(\theta \)-adjacencies. Further constraints \((\texttt {C.02})\) ensure that for each \(\theta \)-adjacency \((i, i'||j, j')\) (a) both edges between its corresponding genes are saturated and (b) no saturated edge is incident to a gene in interval \([i+1, i'-1]\) of genome S (i.e. a possibly empty interval corresponding to all genes between i and \(i'\)) and interval \([j+1, j'-1]\) of genome T, respectively.
Fig. 2.

Integer linear program for finding a maximum subset of non-conflicting conserved adjacencies of a given set \(C_\theta \).

7 Experimental Results

Genomic Dataset. We study genomes of 18 rosid species (see Table 1). Rosids are a prominent subclass of flowering plants to which also many agricultural crops belong. The genomic sequences of the studied species were obtained from Phytozyme [20]2, an online resource of the Joint Genome Institute providing databases and tools for comparative genomics analyses of plant genomes. Most of the studied plant genomes are partially assembled, comprising up to 5,000 scaffolds covering one or more annotated protein coding genes. While the smallest genome in our data set contains roughly 24,500 genes, the largest spans with 56,000 genes more than twice as many. Rosids, just like many other plants, met their evolutionary fate through multiple events of whole genome duplication, followed by periods of fractionation, in which many duplicated genes were lost again.

Construction of Gene Connection and Gene Family Graphs. Next to the genomic sequences and gene annotations, Phytozyme also provides gene family information in form of co-orthologous clusters computed by InParanoid [21]. InParanoid follows a seed-based strategy by identifying pairs of orthologous genes (the “seeds”) through reciprocal best BLASTP hits. These are subsequently used to recruit inparalogs, eventually forming groups of co-orthologous genes.
Table 1.

The genomic dataset of 18 rosid species used in our experiments.

Species

Version

# genes

# scaffolds

Reference

A. thaliana

TAIR10

27,416

7

[22]

B. rapa

FPSc v1.3

40,492

669

[20]

B. stricta

v1.2

27,416

854

[20]

C. clementina

v1.0

24,533

94

[23]

C. rubella

v1.0

26,521

123

[24]

E. grandis

v1.1

36,376

1,315

[25]

E. salsugineum

v1.0

26,351

61

[26]

F. vesca

v1.1

32,831

8

[27]

G. max

Wm82.a2

56,044

147

[28]

G. raimondii

v2.1

37,505

133

[29]

L. usitatissimum

v1.0

43,471

1,028

[30]

M. truncatula

Mt4.0v1

50,894

1,033

[31]

P. persica

v1.0

27,864

59

[32]

P. trichocarpa

v3.0

41,335

379

[33]

P. vulgaris

v1.0

27,197

91

[34]

R. communis

v0.1

31,221

4,962

[35]

T. cacao

v1.1

29,452

99

[36]

V. vinifera

Genoscope.12X

26,346

33

[37]

We ran BLASTP on all genes of our dataset using an e-value threshold of \(10^{-5}\) and otherwise default parameter settings. We then constructed gene connection graphs for all 153 genome pairs by establishing edges between vertices whose corresponding genes share reciprocal BLASTP hits. We refer to these graphs as BLASTP GC graphs. Similarly, we constructed pairwise gene family graphs using InParanoid’s homology assignment, which we refer to as InParanoid GF graphs.

Unsurprisingly, the BLASTP GC graphs are much larger in size than the InParanoid GF graphs. We observed average sizes of 150,000 edges for the former, whereas the latter graphs had on average only one fifth of this size. Moreover, only 4 % of edges in InParanoid GF graphs were not contained in their BLASTP GC counterparts. Lacking ground truth of homologies in our dataset, we take a conservative stance by assuming that InParanoid’s homology assignment can be considered true, or, in other words, that it contains only a negligible number of false positives. However, we conclude from a previous study [38], in which InParanoid (as well as all other gene family prediction tools in that study) exhibited a poor recall, that the homology assignment may be incomplete. That being said, we regard the edges of BLASTP GC graphs with suspicion. In doing so, we assume many of them leading to false positive homology assignments. We perform subsequent analysis to outline a possible procedure of identifying additional potential homologies that are supported by conservation in gene order in BLASTP GC graphs.

Implementation. All computations were performed on a Linux machine using a single 2.3 GHz CPU. We implemented Algorithms 1, 1’, 3, and 3’ in Python. For Algorithm 2 we used the implementation of [19]. In Algorithm 3, the maximum cardinality matching was computed using an implementation of Hopcroft and Karp’s algorithm [18] provided by the Python-based NetworkX3 library. The ILPs of Algorithms 2 and 3’ were run using CPLEX4, a solver for various types of linear and quadratic programs.

Runtimes. The runtimes of Algorithms 1 and 3 are shown in Fig. 3 (left). The runtime analysis was repeated 5 times and is visualized by whisker plots. For each of the 153 BLASTP GC graphs in our dataset, the computation was finished in less than 50 CPU seconds. Moreover, our evaluation reveals that the enumeration of the set of conserved adjacencies in our dataset requires often more time than the subsequent computation of the maximum matching for Algorithm 3. The plot on the right side of Fig. 3 shows that the runtimes of Algorithm 1’ for \(\theta =2,3,4\) increase only moderately for higher values of \(\theta \).

Comparing our methods to the gene family-free approach, an implementation of a heuristic method described in [10] failed to return a result for the gene family free variant of Problem 2 on the BLASTP GC graph of R. communis and V. vinifera within 36 hours of computation. Surprisingly, running Algorithm 2 with \(\alpha =0.1\) just as long, we were able to obtain a suboptimal solution of which CPLEX reported an optimality gap of only 1.89 %. Nevertheless, as a reference for comparison with our various models it would be even more informative to have optimal solutions of these problems. We leave it as an open problem whether it is possible to improve our ILPs in order to achieve this.
Fig. 3.

Left: runtimes of Algorithms 1 and 3 for all 153 BLASTP GC graphs of the studied dataset. Right: runtimes of Algorithm 1’ for \(\theta =2,3,4\).

Further, we were able to compute exact results for Problem 3 and \(\theta =2\) with Algorithm 3’ for all 153 but 19 BLASTP GC graphs and all but 16 InParanoid GC graphs, limiting computation time to two hours per graph instance.

Gene Connection vs. Gene Family Graphs. The overlap between the set of conserved simple adjacencies identified in BLASTP GC graphs and in InParanoid GF graphs is visualized in the left plot of Fig. 4. Overall, 70 % of the conserved adjacencies of the InParanoid GF graphs were also found in the BLASTP GC graphs whereas we find in the latter 90 % more conserved adjacencies than in the former. Investigating the high number of InParanoid adjacencies that are missing in BLASTP GC graphs, we discovered that many generalized adjacencies of the former span genes that are connected (and therefore breaking the surrounding adjacency) in their BLASTP GC counterparts. However, the mean number of connected intervening genes was only 1.4. In fact, the overlap of 2-adjacencies in BLASTP GC graphs with 1-adjacencies of InParanoid GF graphs was at 83 % of all adjacencies in the latter (Fig. 4, right plot).
Fig. 4.

Overlap of conserved adjacencies between BLASTP GC and InParanoid GF graphs

Lastly, Fig. 5 visualizes the number of non-conflicting conserved adjacencies in BLASTP GC and InParanoid GF graphs computed for \(\theta =1\) using Algorithm 3 (left plot) and computed for \(\theta =2\) using Algorithm 3’ (right plot). For the former we observed on average \(42\,\%\) more non-conflicting conserved adjacencies in BLASTP GC graphs when compared to their InParanoid GF counterparts, whereas for the latter, this number dropped to \(32\,\%\). Nevertheless, from \(\theta =1\) to \(\theta =2\) the absolute number of non-conflicting conserved adjacencies increases on average by \(27\,\%\) for BLASTP GC graphs, respectively by \(37\,\%\) for InParanoid GF graphs.
Fig. 5.

Numbers of non-conflicting conserved adjacencies in BLASTP GC and InParanoid GF graphs for \(\theta =1\) (left) and \(\theta =2\) (right).

8 Conclusion

We have presented new similarity measures for complete genomes, thereby defining gene connections as an intermediate model of genome similarity representations, between gene families and the gene family-free approach. Our theoretical results with some problem variants being polynomial and others being NP-hard show that we are very close to the hardness border of similarity computations between genomes with unrestricted gene content. On the practical side we could show that the computation of genomic similarities in the gene connection model gives meaningful results and is possible in reasonable time, if the measures and algorithms are designed carefully.

A few questions remain open, though. While Problem 3 is polynomial for \(\theta =1\), the complexity for larger values of \(\theta \) is unknown. Moreover, it is always difficult to choose optimal values for parameters like the gap threshold \(\theta \). It will certainly be worthwhile to examine whether parameter estimation methods for generalized adjacencies as the ones developed in [39] can be adapted to the gene connection model.

Various model extensions can also be envisaged. The adjacency matching model (Problem 3) removes inconsistencies from the output of the total adjacencies model (Problem 1) by computing a maximum matching on it. It could be tested whether other criteria to remove genes from the genomes and thus derive consistent sets of conserved adjacencies yield even better results. Moreover, the resulting reduced genomes with conserved adjacencies could be used to predict orthologies between the involved genes, not only to compute genomic similarities.

An alternative objective function for our problem formulations, instead of counting (generalized) gene adjacencies, could be a variant of the summed adjacency disruption number [40] that also allows to distinguish between small and larger distortions in gene order.

Finally, Algorithm 3 can easily be generalized for weighted gene similarities (instead of gene connections). It remains to be evaluated if such a more fine-grained measure in the spirit of a family-free analysis has advantages compared to the unit-cost measures studied in this paper.

Footnotes

  1. 1.

    A weaker result, namely the NP-hardness of Problem 2 for values of \(\alpha \) between 0 and 1 / 3, can be found in [19].

  2. 2.

    The described experiments were performed on data sets of Phytozyme v10.3.

  3. 3.
  4. 4.

Notes

Acknowledgements

The research of LABK and SD is partially supported by FAPERJ and CNPq. This work was performed while JS was on sabbatical as Special Visiting Researcher at UFF in Niteri, Brazil, funded by Cincia sem Fronteiras/CAPES.

References

  1. 1.
    Sankoff, D.: Edit distance for genome comparison based on non-local operations. In: Apostolico, A., Galil, Z., Manber, U., Crochemore, M. (eds.) CPM 1992. LNCS, vol. 644, pp. 121–135. Springer, Heidelberg (1992)CrossRefGoogle Scholar
  2. 2.
    Hannenhalli, S., Pevzner, P.A.: Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. J. ACM 46(1), 1–27 (1999)CrossRefMATHMathSciNetGoogle Scholar
  3. 3.
    Yancopoulos, S., Attie, O., Friedberg, R.: Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics 21(16), 3340–3346 (2005)CrossRefGoogle Scholar
  4. 4.
    Bergeron, A., Mixtacki, J., Stoye, J.: A new linear time algorithm to compute the genomic distance via the double cut and join distance. Theor. Comput. Sci. 410(51), 5300–5316 (2009)CrossRefMATHMathSciNetGoogle Scholar
  5. 5.
    Bryant, D.: The complexity of calculating exemplar distances. In: Sankoff, D., Nadeau, J.H. (eds.) Comparative Genomics. Computational Biology Series, vol. 1, pp. 207–211. Kluwer Academic Publishers, London (2000)CrossRefGoogle Scholar
  6. 6.
    Chen, X., Zheng, J., Fu, Z., Nan, P., Zhong, Y., Lonardi, S., Jiang, T.: Assignment of orthologous genes via genome rearrangement. IEEE/ACM Trans. Comput. Biol. Bioinform. 2(4), 302–315 (2005)CrossRefGoogle Scholar
  7. 7.
    Angibaud, S., Fertin, G., Rusu, I., Thevenin, A., Vialette, S.: Efficient tools for computing the number of breakpoints and the number of adjacencies between two genomes with duplicate genes. J. Comput. Biol. 15(8), 1093–1115 (2008)CrossRefMathSciNetGoogle Scholar
  8. 8.
    Bulteau, L., Jiang, M.: Inapproximability of (1,2)-exemplar distance. IEEE/ ACM Trans. Comput. Biol. Bioinform. 10(6), 1384–1390 (2012)CrossRefGoogle Scholar
  9. 9.
    Shao, M., Lin, Y., Moret, B.M.E.: An exact algorithm to compute the double-cut-and-join distance for genomes with duplicate genes. J. Comput. Biol. 22(5), 425–435 (2015)CrossRefMathSciNetGoogle Scholar
  10. 10.
    Doerr, D., Thvenin, A., Stoye, J.: Gene family assignment-free comparative genomics. BMC Bioinform. 13(Suppl. 19), S3 (2012)CrossRefGoogle Scholar
  11. 11.
    Braga, M.D.V., Chauve, C., Doerr, D., Jahn, K., Stoye, J., Thvenin, A., Wittler, R.: The potential of family-free genome comparison. In: Chauve, C., El-Mabrouk, N., Tannier, E. (eds.) Models and Algorithms for Genome Evolution. Computational Biology Series, vol. 19, pp. 287–307. Springer, London (2013)CrossRefGoogle Scholar
  12. 12.
    Doerr, D., Stoye, J., Bcker, S., Jahn, K.: Identifying gene clusters by discovering common intervals in indeterminate strings. BMC Bioinform. 15(Suppl. 6), S2 (2014)Google Scholar
  13. 13.
    Martinez, F.V., Feijo, P., Braga, M.D.V., Stoye, J.: On the family-free DCJ distance and similarity. Algorithms Mol. Biol. 10, 13 (2015)CrossRefGoogle Scholar
  14. 14.
    Zhu, Q., Adam, Z., Choi, V., Sankoff, D.: Generalized gene adjacencies, graph bandwidth, and clusters in yeast evolution. IEEE/ACM Trans. Comput. Biol. Bioinform. 6(2), 213–220 (2009)CrossRefGoogle Scholar
  15. 15.
    Sankoff, D.: Genome rearrangement with gene families. Bioinformatics 15(11), 909–917 (1999)CrossRefGoogle Scholar
  16. 16.
    Blanchette, M., Kunisawa, T., Sankoff, D.: Gene order breakpoint evidence in animal mitochondrial phylogeny. J. Mol. Evol. 49(2), 193–203 (1999)CrossRefGoogle Scholar
  17. 17.
    Tannier, E., Zheng, C., Sankoff, D.: Multichromosomal median and halving problems under different genomic distances. BMC Bioinform. 10, 120 (2009)CrossRefGoogle Scholar
  18. 18.
    Hopcroft, J.E., Karp, R.M.: An \(n^{5/2}\) algorithm for maximum matchings in bipartite graphs. SIAM J. Comput. 2(4), 225–231 (1973)CrossRefMATHMathSciNetGoogle Scholar
  19. 19.
    Doerr, D.: Gene family-free genome comparison. Ph.D. thesis, Faculty of Technology, Bielefeld University, Germany (2015)Google Scholar
  20. 20.
    Goodstein, D.M., Shu, S., Howson, R., Neupane, R., Hayes, R.D., Fazo, J., Mitros, T., Dirks, W., Hellsten, U., Putnam, N., Rokhsar, D.S.: Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 40(Database issue), D1178–D1186 (2012)CrossRefGoogle Scholar
  21. 21.
    Sonnhammer, E.L.L., Östlund, G.: Inparanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Res. 43(Database issue), D234–D239 (2015)CrossRefGoogle Scholar
  22. 22.
    Lamesch, P., Berardini, T.Z., Li, D., Swarbreck, D., Wilks, C., Sasidharan, R., Muller, R., Dreher, K., Alexander, D.L., Garcia-Hernandez, M., Karthikeyan, A.S., Lee, C.H., Nelson, W.D., Ploetz, L., Singh, S., Wensel, A., Huala, E.: The arabidopsis information resource (tair): improved gene annotation and new tools. Nucleic Acids Res. 40(Database issue), D1202–D1210 (2011)Google Scholar
  23. 23.
    Wu, G.A., Prochnik, S., Jenkins, J., Salse, J., Hellsten, U., Murat, F., Perrier, X., Ruiz, M., Scalabrin, S., Terol, J., Takita, M.A., Labadie, K., Poulain, J., Couloux, A., Jabbari, K., Cattonaro, F., Del Fabbro, C., Pinosio, S., Zuccolo, A., Chapman, J., Grimwood, J., Tadeo, F.R., Estornell, L.H., Muñoz-Sanz, J.V., Ibanez, V., Herrero-Ortega, A., Aleza, P., Pérez-Pérez, J., Ramón, D., Brunel, D., Luro, F., Chen, C., Farmerie, W.G., Desany, B., Kodira, C., Mohiuddin, M., Harkins, T., Fredrikson, K., Burns, P., Lomsadze, A., Mark, B., Reforgiato, G., Freitas-Astúa, J., Quetier, F., Navarro, L., Roose, M., Wincker, P., Schmutz, J., Morgante, M., Machado, M.A., Talón, M., Jaillon, O., Ollitrault, P., Gmitter, F., Rokhsar, D.: Sequencing of diverse mandarin, pummelo and orange genomes reveals complex history of admixture during citrus domestication. Nat. Biotechnol. 32(7), 656–662 (2014)CrossRefGoogle Scholar
  24. 24.
    Slotte, T., Hazzouri, K.M., Ågren, J.A., Koenig, D., Maumus, F., Guo, Y.-L., Steige, K., Platts, A.E., Escobar, J.S., Newman, L.K., Wang, W., Mandáková, T., Vello, E., Smith, L.M., Henz, S.R., Steffen, J., Takuno, S., Brandvain, Y., Coop, G., Andolfatto, P., Hu, T.T., Blanchette, M., Clark, R.M., Quesneville, H., Nordborg, M., Gaut, B.S., Lysak, M.A., Jenkins, J., Grimwood, J., Chapman, J., Prochnik, S., Shu, S., Rokhsar, D., Schmutz, J., Weigel, D., Wright, S.I.: The Capsella rubella genome and the genomic consequences of rapid mating system evolution. Nat. Genet. 45(7), 831–835 (2013)CrossRefGoogle Scholar
  25. 25.
    Bartholomé, J., Mandrou, E., Mabiala, A., Jenkins, J., Nabihoudine, I., Klopp, C., Schmutz, J., Plomion, C., Gion, J.-M.: High-resolution genetic maps of eucalyptus improve Eucalyptus grandis genome assembly. New Phytol 206(4), 1283–1296 (2015)CrossRefGoogle Scholar
  26. 26.
    Yang, R., Jarvis, D.E., Chen, H., Beilstein, M.A., Grimwood, J., Jenkins, J., Shu, S., Prochnuk, S., Xin, M., Ma, C., Schmutz, J., Wing, R.A., Mitchell-Olds, T., Schumaker, K.S., Wang, X.: The reference genome of the halophytic plant Eutrema salsugineum. Front. Plant Sci. 4, 46 (2013)Google Scholar
  27. 27.
    Shulaev, V., Sargent, D.J., Crowhurst, R.N., Mockler, T.C., Folkerts, O., Delcher, A.L., Jaiswal, P., Mockaitis, K., Liston, A., Mane, S.P., Burns, P., Davis, T.M., Slovin, J.P., Bassil, N., Hellens, R.P., Evans, C., Harkins, T., Kodira, C., Desany, B., Crasta, O.R., Jensen, R.V., Allan, A.C., Michael, T.P., Setubal, J.C., Celton, J.-M., Rees, D.J.G., Williams, K.P., Holt, S.H., Rojas, J.J.R., Chatterjee, M., Liu, B., Silva, H., Meisel, L., Adato, A., Filichkin, S.A., Troggio, M., Viola, R., Ashman, T.-L., Wang, H., Dharmawardhana, P., Elser, J., Raja, R., Priest, H.D., Bryant, D.W., Fox, S.E., Givan, S.A., Wilhelm, L.J., Naithani, S., Christoffels, A., Salama, D.Y., Carter, J., Girona, E.L., Zdepski, A., Wang, W., Kerstetter, R.A., Schwab, W., Korban, S.S., Davik, J., Monfort, A., Denoyes-Rothan, B., Arus, P., Mittler, R., Flinn, B., Aharoni, A., Bennetzen, J.L., Salzberg, S.L., Dickerman, A.W., Velasco, R., Borodovsky, M., Veilleux, R.E., Folta, K.M.: The genome of woodland strawberry (Fragaria vesca). Nat. Genet. 43(2), 109–116 (2011)CrossRefGoogle Scholar
  28. 28.
    Schmutz, J., Cannon, S.B., Schlueter, J., Ma, J., Mitros, T., Nelson, W., Hyten, D.L., Song, Q., Thelen, J.J., Cheng, J., Xu, D., Hellsten, U., May, G.D., Yu, Y., Sakurai, T., Umezawa, T., Bhattacharyya, M.K., Sandhu, D., Valliyodan, B., Lindquist, E., Peto, M., Grant, D., Shu, S., Goodstein, D., Barry, K., Futrell-Griggs, M., Abernathy, B., Du, J., Tian, Z., Zhu, L., Gill, N., Joshi, T., Libault, M., Sethuraman, A., Zhang, X.-C., Shinozaki, K., Nguyen, H.T., Wing, R.A., Cregan, P., Specht, J., Grimwood, J., Rokhsar, D., Stacey, G., Shoemaker, R.C., Jackson, S.A.: Genome sequence of the palaeopolyploid soybean. Nature 463(7278), 178–183 (2010)CrossRefGoogle Scholar
  29. 29.
    Paterson, A.H., Wendel, J.F., Gundlach, H., Guo, H., Jenkins, J., Jin, D., Llewellyn, D., Showmaker, K.C., Shu, S., Udall, J., Yoo, M.-J., Byers, R., Chen, W., Doron-Faigenboim, A., Duke, M.V., Gong, L., Grimwood, J., Grover, C., Grupp, K., Hu, G., Lee, T.-H., Li, J., Lin, L., Liu, T., Marler, B.S., Page, J.T., Roberts, A.W., Romanel, E., Sanders, W.S., Szadkowski, E., Tan, X., Tang, H., Xu, C., Wang, J., Wang, Z., Zhang, D., Zhang, L., Ashrafi, H., Bedon, F., Bowers, J.E., Brubaker, C.L., Chee, P.W., Das, S., Gingle, A.R., Haigler, C.H., Harker, D., Hoffmann, L.V., Hovav, R., Jones, D.C., Lemke, C., Mansoor, S., Rahman, M.U., Rainville, L.N., Rambani, A., Reddy, U.K., Rong, J.-K., Saranga, Y., Scheffler, B.E., Scheffler, J.A., Stelly, D.M., Triplett, B.A., Van Deynze, A., Vaslin, M.F.S., Waghmare, V.N., Walford, S.A., Wright, R.J., Zaki, E.A., Zhang, T., Dennis, E.S., Mayer, K.F.X., Peterson, D.G., Rokhsar, D.S., Wang, X., Schmutz, J.: Repeated polyploidization of gossypium genomes and the evolution of spinnable cotton fibres. Nature 492(7429), 423–427 (2012)CrossRefGoogle Scholar
  30. 30.
    Wang, Z., Hobson, N., Galindo, L., Zhu, S., Shi, D., McDill, J., Yang, L., Hawkins, S., Neutelings, G., Datla, R., Lambert, G., Galbraith, D.W., Grassa, C.J., Geraldes, A., Cronk, Q.C., Cullis, C., Dash, P.K., Kumar, P.A., Cloutier, S., Sharpe, A.G., Wong, G.K.S., Wang, J., Deyholos, M.K.: The genome of flax (Linum usitatissimum) assembled de novo from short shotgun sequence reads. Plant J. 72(3), 461–473 (2012)CrossRefGoogle Scholar
  31. 31.
    Young, N.D., Debellé, F., Oldroyd, G.E.D., Geurts, R., Cannon, S.B., Udvardi, M.K., Benedito, V.A., Mayer, K.F.X., Gouzy, J., Schoof, H., Van de Peer, Y., Proost, S., Cook, D.R., Meyers, B.C., Spannagl, M., Cheung, F., De Mita, S., Krishnakumar, V., Gundlach, H., Zhou, S., Mudge, J., Bharti, A.K., Murray, J.D., Naoumkina, M.A., Rosen, B., Silverstein, K.A.T., Tang, H., Rombauts, S., Zhao, P.X., Zhou, P., Barbe, V., Bardou, P., Bechner, M., Bellec, A., Berger, A., Bergès, H., Bidwell, S., Bisseling, T., Choisne, N., Couloux, A., Denny, R., Deshpande, S., Dai, X., Doyle, J.J., Dudez, A.-M., Farmer, A.D., Fouteau, S., Franken, C., Gibelin, C., Gish, J., Goldstein, S., González, A.J., Green, P.J., Hallab, A., Hartog, M., Hua, A., Humphray, S.J., Jeong, D.-H., Jing, Y., Jöcker, A., Kenton, S.M., Kim, D.-J., Klee, K., Lai, H., Lang, C., Lin, S., Macmil, S.L., Magdelenat, G., Matthews, L., McCorrison, J., Monaghan, E.L., Mun, J.-H., Najar, F.Z., Nicholson, C., Noirot, C., O’Bleness, M., Paule, C.R., Poulain, J., Prion, F., Qin, B., Qu, C., Retzel, E.F., Riddle, C., Sallet, E., Samain, S., Samson, N., Sanders, I., Saurat, O., Scarpelli, C., Schiex, T., Segurens, B., Severin, A.J., Sherrier, D.J., Shi, R., Sims, S., Singer, S.R., Sinharoy, S., Sterck, L., Viollet, A., Wang, B.-B., Wang, K., Wang, M., Wang, X., Warfsmann, J., Weissenbach, J., White, D.D., White, J.D., Wiley, G.B., Wincker, P., Xing, Y., Yang, L., Yao, Z., Ying, F., Zhai, J., Zhou, L., Zuber, A., Dénarié, J., Dixon, R.A., May, G.D., Schwartz, D.C., Rogers, J., Quetier, F., Town, C.D., Roe, B.A.: The medicago genome provides insight into the evolution of rhizobial symbioses. Nature 480(7378), 520–524 (2011)Google Scholar
  32. 32.
    Verde, I., Abbott, A.G., Scalabrin, S., Jung, S., Shu, S., Marroni, F., Zhebentyayeva, T., Dettori, M.T., Grimwood, J., Cattonaro, F., Zuccolo, A., Rossini, L., Jenkins, J., Vendramin, E., Meisel, L.A., Decroocq, V., Sosinski, B., Prochnik, S., Mitros, T., Policriti, A., Cipriani, G., Dondini, L., Ficklin, S., Goodstein, D.M., Xuan, P., Del Fabbro, C., Aramini, V., Copetti, D., Gonzalez, S., Horner, D.S., Falchi, R., Lucas, S., Mica, E., Maldonado, J., Lazzari, B., Bielenberg, D., Pirona, R., Miculan, M., Barakat, A., Testolin, R., Stella, A., Tartarini, S., Tonutti, P., Arus, P., Orellana, A., Wells, C., Main, D., Vizzotto, G., Silva, H., Salamini, F., Schmutz, J., Morgante, M., Rokhsar, D.S.: The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution. Nat. Genet. 45(5), 487–494 (2013)CrossRefGoogle Scholar
  33. 33.
    Du, Q., Wang, L., Yang, X., Gong, C., Zhang, D.: Populus endo-\(\beta \)-1,4-glucanases gene family: genomic organization, phylogenetic analysis, expression profiles and association mapping. Planta 241(6), 1417–1434 (2015)CrossRefGoogle Scholar
  34. 34.
    Schmutz, J., McClean, P.E., Mamidi, S., Wu, G.A., Cannon, S.B., Grimwood, J., Jenkins, J., Shu, S., Song, Q., Chavarro, C., Torres-Torres, M., Geffroy, V., Moghaddam, S.M., Gao, D., Abernathy, B., Barry, K., Blair, M., Brick, M.A., Chovatia, M., Gepts, P., Goodstein, D.M., Gonzales, M., Hellsten, U., Hyten, D.L., Jia, G., Kelly, J.D., Kudrna, D., Lee, R., Richard, M.M.S., Miklas, P.N., Osorno, J.M., Rodrigues, J., Thareau, V., Urrea, C.A., Wang, M., Yu, Y., Zhang, M., Wing, R.A., Cregan, P.B., Rokhsar, D.S., Jackson, S.A.: A reference genome for common bean and genome-wide analysis of dual domestications. Nat. Genet. 46(7), 707–713 (2014)CrossRefGoogle Scholar
  35. 35.
    Chan, A.P., Crabtree, J., Zhao, Q., Lorenzi, H., Orvis, J., Puiu, D., Melake-Berhan, A., Jones, K.M., Redman, J., Chen, G., Cahoon, E.B., Gedil, M., Stanke, M., Haas, B.J., Wortman, J.R., Fraser-Liggett, C.M., Ravel, J., Rabinowicz, P.D.: Draft genome sequence of the oilseed species Ricinus communis. Nat. Biotechnol. 28(9), 951–956 (2010)CrossRefGoogle Scholar
  36. 36.
    Motamayor, J.C., Mockaitis, K., Schmutz, J., Haiminen, N., Livingstone, D., Cornejo, O., Findley, S.D., Zheng, P., Utro, F., Royaert, S., Saski, C., Jenkins, J., Podicheti, R., Zhao, M., Scheffler, B.E., Stack, J.C., Feltus, F.A., Mustiga, G.M., Amores, F., Phillips, W., Marelli, J.P., May, G.D., Shapiro, H., Ma, J., Bustamante, C.D., Schnell, R.J., Main, D., Gilbert, D., Parida, L., Kuhn, D.N.: The genome sequence of the most widely cultivated cacao type and its use to identify candidate genes regulating pod color. Genome Biol. 14(6), r53 (2012)CrossRefGoogle Scholar
  37. 37.
    Jaillon, O., Aury, J.-M., Noel, B., Policriti, A., Clepet, C., Casagrande, A., Choisne, N., Aubourg, S., Vitulo, N., Jubin, C., Vezzi, A., Legeai, F., Hugueney, P., Dasilva, C., Horner, D., Mica, E., Jublot, D., Poulain, J., Bruyère, C., Billault, A., Segurens, B., Gouyvenoux, M., Ugarte, E., Cattonaro, F., Anthouard, V., Vico, V., Del Fabbro, C., Alaux, M., Di Gaspero, G., Dumas, V., Felice, N., Paillard, S., Juman, I., Moroldo, M., Scalabrin, S., Canaguier, A., Le Clainche, I., Malacrida, G., Durand, E., Pesole, G., Laucou, V., Chatelet, P., Merdinoglu, D., Delledonne, M., Pezzotti, M., Lecharny, A., Scarpelli, C., Artiguenave, F., Pè, M.E., Valle, G., Morgante, M., Caboche, M., Adam-Blondon, A.-F., Weissenbach, J., Quetier, F., Wincker, P.: The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449(7161), 463–467 (2007)CrossRefGoogle Scholar
  38. 38.
    Lechner, M., Hernandez-Rosales, M., Doerr, D., Wieseke, N., Thvenin, A., Stoye, J., Hartmann, R.K., Prohaska, S.J., Stadler, P.F.: Orthology detection combining clustering and synteny for very large datasets. PLoS ONE 9(8), e10515 (2014)CrossRefGoogle Scholar
  39. 39.
    Yang, Z., Sankoff, D.: Natural parameter values for generalized gene adjacencies. J. Comput. Biol. 17(9), 1113–1128 (2010)CrossRefGoogle Scholar
  40. 40.
    Delgado, J., Lynce, I., Manquinho, V.: Computing the summed adjacency disruption number between two genomes with duplicate genes. J. Comput. Biol. 17(9), 1243–1265 (2010)CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Luis Antonio B. Kowada
    • 1
  • Daniel Doerr
    • 2
  • Simone Dantas
    • 1
  • Jens Stoye
    • 1
    • 3
  1. 1.Universidade Federal FluminenseNiteróiBrazil
  2. 2.École Polytechnique Fédérale de LausanneLausanneSwitzerland
  3. 3.Faculty of Technology, Center for BiotechnologyBielefeld UniversityBielefeldGermany

Personalised recommendations