Introduction

The classical approach of phylogenetic categorization relies on the analysis of rRNA sequences as introduced by C. Woese (Woese and Fox 1977). However, if the sequences are too similar, it is not possible to determine an evolutionary relationship precisely. This is frequently the case when studying closely related species. For these applications, genome-based phylogenies are superior: the number of mutations separating species will increase with the number of genes analyzed. In addition, methods exploiting a larger number of genes are less affected by horizontal gene transfer (HGT), variable mutation rates, or misalignments (Snel et al. 1999; Fitz-Gibbon and House 1999). For these reasons, phylogenomic methods that use a large set of sequences have become the de facto standard for reconstructing phylogenies (Ciccarelli et al. 2006; Daubin et al. 2002), especially for closely related species (Oshima and Nishida 2007). The algorithms for genome-based phylogeny can be grouped according to their concepts. There are methods that compare genomic DNA sequences on the whole and methods that evaluate gene content or gene order. So far, sequence methods have been used most frequently. In this case, genomes are compared pairwise at the DNA level (Kurtz et al. 2004; Darling et al. 2004). These methods can be extended to construct phylogenetic trees (Henz et al. 2005).

The classical gene content methods are, compared to the above, more complex and require several steps. First, orthology of genes has to be determined. Then the occurrence or absence of genes has to be evaluated and used to infer a phylogenetic tree (see Snel et al. 1999; Bapteste et al. 2004; Lin et al. 2009). Alternatively, the sequences of a small set of genes (Ciccarelli et al. 2006; Konstantinidis et al. 2006) or of a core genome can be concatenated prior to phylogenetic analysis. It has been shown that 20 genes were sufficient for a phylogenetic analysis of eight yeast taxa (Rokas et al. 2003). However, these approaches have been criticized: For a set of 205 γ-proteobacterial core genes it has been demonstrated that their history is unknown in many cases and that these genes rarely favor one phylogenetic tree (Susko et al. 2006; Bapteste et al. 2008).

In addition to content, gene order can be utilized to compare genomes. Quite sophisticated theoretical concepts have been developed for the assessment of genomic rearrangements (Sankoff 1992; Hannenhalli et al. 1995) and implemented for their analysis (Tesler 2002; Dalevi and Eriksen 2008). However, to the best of our knowledge, the only approach used so far for comparison of complete microbial genomes is the SHOT server, which exploits the occurrence of gene pairs (Korbel et al. 2002).

One goal of whole-genome comparison is the determination of the true evolutionary distance, i.e., the actual number of mutational events separating two genomes. Unfortunately, this distance cannot be inferred. As a substitute, an edit distance may be computed. The edit distance is the minimal number of evolutionary events selected from a predefined set of operations that transform one genome into the other one. However, even the computation of an edit distance is known to be NP-hard for genomes with unequal content (Xin et al. 2005). Definitely, this time complexity is a severe hindrance for using exact methods to analyze native genomes. Exact methods suffer from a second restriction: these algorithms consider the problem of determining the identity of genes as being solved. As a prerequisite, each gene has to be labeled with a number indicating the orthology class it belongs to. If sequence comparison is the basis to identify orthologues, it may fail: due to gene duplication, several paralogues may exist. In these cases, sequence similarity is no clear indicator of evolutionary relationship.

In addition, it has been proved plausible that HGT is a major force shaping the content of microbial genomes (see, e.g., Ochman et al. 2000 and references therein). Lawrence and Ochman (1997) proposed that at least 15% of the E. coli genome is atypical and may have arisen by recent transfer events. It has been concluded that 25% of the Thermotoga maritima genes are more closely related to archeal genes and may signal gene transfer between these lineages (Nelson et al. 1999). For M. mazei it has been postulated that up to 30% of the genome may have been acquired via HGT (Deppenmeier et al. 2002). However, the extent and the long-term impact of HGT on individual genomes are still a matter of debate (see, e.g., Kurland et al. 2003).

It has been shown that genes may be replaced in situ with nonorthologous ones (Omelchenko et al. 2003). In these cases, the function of the gene product remains the same, which may not be detectable when comparing sequences. How such an event must be assessed with respect to phylogenetic analysis is debatable.

Due to these arguments, we introduce and apply a new paradigm for genome comparison: we consider genomes as a set of gene series implementing certain functions. The algorithm, which we named GO4genome, is based on the pairwise comparison of gene function and gene order. For assessment of function, it does not consider homology, i.e., evolutionary relationship. Instead, the algorithm utilizes gene ontology. For comparison of gene order, we introduce a heuristic approach. The algorithm identifies the longest series of genes possessing the most similar function. The number and length of these series are then used to compute pairwise genomic distances, which are the basis for phylogenetic inference. Thus, GO4genome comprises additional events like genomic rearrangements for comparison of genomes, which are beyond those exploited by methods of molecular phylogeny. We demonstrate for several groups of microbes that the inferred phylogenetic relationship is, in most cases, in agreement with the outcome of classical methods. A novel grouping of species was observed, e.g., in rapidly evolving genomes like those of Yersinia pestis strains or in Shigella.

Materials and Methods

Datasets

For all analyses, entries downloaded from the Genome Reviews database (http://www.ebi.ac.uk/GenomeReviews/) were utilized; it provides comprehensively annotated data, including gene ontology (GO) terms. The following datasets were used (accession numbers in parentheses).

Escherichia coli Dataset

The E. coli dataset was comprised of E. coli EDL933 (AE005174_GR.gbk), E. coli K-12 (U00096_GR.gbk), E. coli Sakai (BA000007_GR.gbk), E. coli UTI89 (CP000243_GR.gbk), E. coli O1K1 / APEC (CP000468_GR.gbk), E. coli CFT073 (AE014075_GR.gbk), E. coli 536 (CP000247_GR.gbk), S. boydii strain Sb227(CP000036_GR.gbk), S. dysenteriae strain Sd197 (CP000034_GR.gbk), S. flexneri ATCC 700930 (AE014073_GR.gbk), S. flexneri strain 301 (AE005674_GR.gbk), S. flexneri strain 8401 (CP000266_GR.gbk), S. sonnei strain Ss046 (CP000038_GR.gbk), S. typhimurium LT2 (AE006468_GR.gbk), B. aphidicola (BA000003_GR.gbk), R. conorii (AE006914_GR.gbk), R. prowazekii (AJ235269_GR.gbk), Y. pestis Antiqua (CP000308_GR.gb), and A. pernix (BA000002_GR.gbk).

Streptococcus Dataset

The Streptococcus dataset included S. agalactiae III (AL732656_GR.gbk), S. agalactiae Ia (CP000114_GR.gbk), S. agalactiae V (AE009948_GR.gbk), S. mutans (AE014133_GR.gbk), S. pneumoniae NCTC7466 (CP000410_GR.gbk), S. pneumoniae R6 (AE007317_GR.gbk), S. pneumoniae TIGR4 (AE005672_GR.gbk), S. pyogenes M12 MGAS2096 (CP000261_GR.gbk), S. pyogenes M12 MGAS9429 (CP000259_GR.gbk), S. pyogenes M2 MGAS10270 (CP000260_GR.gbk), S. pyogenes M4 MGAS10750 (CP000262_GR.gbk), S. pyogenes M5 (AM295007_GR.gbk), S. pyogenes M1 MGAS5005 (CP000017_GR.gbk), S. pyogenes M1 ATCC700294 (AE004092_GR.gbk), S. pyogenes M18 MGAS8232 (AE009949_GR.gbk), S. pyogenes M28 MGAS6180 (CP000056_GR.gbk), S. pyogenes M3 MGAS315 (AE014074_GR.gbk), S. pyogenes M3 SSI_1 (BA000034_GR.gbk), S. pyogenes M6 MGAS10394 (CP000003_GR.gbk), S. sanguinis (CP000387_GR.gbk), S. suis 05ZYH33 (CP000407_GR.gbk), S. suis 98HAH33 (CP000408_GR.gbk), S. thermophilus LMG18311 (CP000023_GR.gbk), S. thermophilus LMD9 (CP000419_GR.gbk), and S. thermophilus CNRZ1066 (CP000024_GR.gbk).

Methanosarcina Dataset

The Methanosarcina dataset comprised M. mazei (AE008384_GR.gbk), M. barkeri (CP000099_GR.gbk), M. acetivorans (AE010299_GR.gbk), M. thermophila (CP000477_GR.gbk), M. hungatei (CP000254_GR.gbk), M. marisnigri (CP000562_GR.gbk), M. labreanum (CP000559_GR.gbk), T. acidophilum (AL139299_GR.gbk), T. volcanium (BA000011_GR.gbk), P. horikoshii (BA000001_GR.gbk), P. abyssi (AL096836_GR.gbk), and P. furiosus (AE009950_GR.gbk).

Yersinia Dataset

The Yersinia dataset included Y. enterocolitica (AM286415_GR.gbk), Y. pestis Antiqua (CP000308_GR.gb), Y. pestis Nepal516 (CP000305_GR.gbk), Y. pestis Mediaevalis 91001 (AE017042_GR.gbk), Y. pestis Orientalis CO-92 (AL590842_GR.gbk), Y. pestis Pestoides F (CP000668_GR.gbk), Y. pestis Mediaevalis KIM5 (AE009952_GR.gbk), and Y. pseudotuberculosis (BX936398_GR.gbk).

Computing funSim Values for Genomes

For each genome, a file was created containing, in multiple FASTA format, GO terms for each gene separated according to the three GO categories “cellular component,” “biological process” (BP), and “molecular function” (MF). The program funSim (Schlicker et al. 2006) version 1.0 was used to compare genomes pairwise. The score was deduced from the categories BP and MF. The output of funSim is a distance matrix storing for each pair of genes a i , b j the value funSim(a i , b j ). For each pair of genomes Gk, Gl belonging to a dataset under study, such a matrix (GkGl_matrix) was computed.

Computing Phylogenetic Distance Matrices by Means of GO4genome

According to the dataset G1…Gn to be analyzed, GO4genome reads the respective GkGl_matrices and computes for each pair of genomes a Dist GO value as described under Results and according to Formula (6). The set of Dist GO values is written to a file in Nexus format (Maddison et al. 1997). The source code for the generation of Nexus-formatted distance matrices and the yersiniae dataset can be downloaded from http://www-bioinf.uni-regensburg.de.

Creating Neighbor Nets

For the visualization of results, we utilized the program SplitsTree4 (version 4.8) (Huson and Bryant 2006). The output of GO4genome was fed into SplitsTree4. Neighbor nets were created by using default parameters.

Results

Toward a Novel Algorithm of Genome Comparison Based on Gene Ontology Annotations

As it was our aim to develop a method for the comparison of genomes which exploits encoded function, we first focused on an adequate scoring scheme. So far, gene content methods have been based exclusively on the concept of homology. For assessment of this approach, the following characteristics have to be considered. (1) This categorization of genes (gene products) is a binary one. Definitely, a scoring scheme with finer granularity supports a more precise comparison of genomic content, which is less error prone also. (2) The classification may fail on paralogues. It was shown that gene duplication is an important factor in genome evolution (Snel et al. 2002). (3) This classification is based on a common evolution of respective genes. In cases where a nonorthologous in situ replacement of a gene via HGT preserves function, the analysis of homology will report disparate genome content.

With the advent of gene ontology (GO), this binary classification scheme can easily be replaced by a continuous one. GO is a standardized vocabulary permitting a coherent annotation of gene products. It is now common to supply genes and gene products with a set of GO terms annotating, e.g., function or their involvement in biological processes. Recently, methods for comparing sets of GO terms have been introduced (Del Pozo et al. 2008; Schlicker et al. 2006). The latter method relies on two similarity measures; one, named funSim, can be used to characterize the functional similarity of gene products. It has been shown that this identification of functionally related proteins is independent of their evolutionary relationship (Schlicker et al. 2006). The outcome of funSim is, for each pair of genes a i , b j , a score 0.0 ≤ funSim(a i , b j ) < 1.0. For the following, we assume that genome G1 consists of n genes a 1,a 2,…,a n , and genome G2 of m genes b 1,b 2,,b m , being annotated with GO terms. In addition, it is assumed that m ≤ n, which can always be ascertained by changing indices, if necessary. A matrix GO_S[a 1a n ][b 1b m ] can be computed, which harbors all funSim(a i , b j ) values. In analogy to classical scoring matrices, GO_S constitutes a basis for the comparison of G1 and G2 in gene function.

For analyses described below, we utilized the annotations deposited in the Genome Reviews database of the EBI (see “Materials and Methods”), which provides comprehensively annotated genomes. A typical example is Escherichia coli K-12 (accession number U00096_GR.gbk). This dataset contained 4277 genes; 3496 have been annotated with GO terms. Of the remaining 781 genes, 462 have been described as “hypothetical” or “uncharacterized”; most of the other annotations are nonspecific. Therefore, one can assume that the largest fraction of shared genes has been provided with GO terms, putting an analysis on a sound basis.

As explained above, current algorithms for genome comparison are based on a binary classification of genes. Additionally, those classical algorithms for sequence comparison (Smith and Waterman 1981) which can utilize a scoring system cannot deal with inversions. However, this kind of genetic rearrangement occurs quite frequently, even in closely related genomes (Hughes 2000; Belda et al. 2005). Therefore, we propose a novel method which rests on the identification of high-scoring segments as BLAST does (Altschul et al. 1990).

Identifying Gene Series of Maximal Length with the Most Similar Function

An approximation for computing an edit distance is the construction of a cover (Swenson et al. 2008). A cover consists of a series of genes that exist in both genome G1 and genome G2. A cover is said to be optimal if it corresponds to the minimal number of edit operations needed to transform G1 into G2. However, the computation of an optimal cover is NP-hard (see Swenson et al. 2008). Therefore, a minimal cover that consists of the smallest number of series is used as a surrogate (Swenson et al. 2008).

Here we propose an algorithm that identifies a functionally minimal cover for the genomes G1 and G2. The algorithm utilizes the matrix GO_S[a 1a n ][b 1b m ]. GO_S values were used to identify high scoring 3-tuples of genes (called HS3Ts or A_HS3Ts). We selected tuples of length 3, as these are the shortest n-mers allowing the identification of local optima. HS3Ts were determined according to the following rules and stored in a matrix TG of size n × m:

$$ HS3T[i,j] = \left\{ {\begin{array}{*{20}c} 1 & {{\text{if}}\;diag(i,j) = {\text{true}}} \\ 0 & {{\text{if}}\;diag(i,j) = {\text{false}}} \\ \end{array} } \right. $$
(1)

The value of diag(i,j) originated from the following expression (compare Supplementary Fig. S1, Panel A):

$$ \begin{gathered} \left( {GO\_S\left[ {a_{i} ,b_{j} } \right] \ge GO\_S\left[ {a_{i} ,b_{j + 1} } \right]} \right) \wedge \left( {GO\_S\left[ {a_{i} ,b_{j} } \right] \ge GO\_S\left[ {a_{i} ,b_{j - 1} } \right]} \right) \hfill \\ \wedge \left( {GO\_S\left[ {a_{i} ,b_{j} } \right] \ge GO\_S\left[ {a_{i - 1} ,b_{j} } \right]} \right) \wedge \left( {GO\_S\left[ {a_{i} ,b_{j} } \right] \ge GO\_S\left[ {a_{i + 1} ,b_{j} } \right]} \right) \hfill \\ \wedge \left( {GO\_S\left[ {a_{i - 1} ,b_{j - 1} } \right] \ge GO\_S\left[ {a_{i - 1} ,b_{j} } \right]} \right) \wedge \left( {GO\_S\left[ {a_{i - 1} ,b_{j - 1} } \right] \ge GO\_S\left[ {a_{i} ,b_{j - 1} } \right]} \right) \hfill \\ \wedge \left( {GO\_S\left[ {a_{i + 1} ,b_{j + 1} } \right] \ge GO\_S\left[ {a_{i + 1} ,b_{j} } \right]} \right) \wedge \left( {GO\_S\left[ {a_{i + 1} ,b_{j + 1} } \right] \ge GO\_S\left[ {a_{i} ,b_{j + 1} } \right]} \right) \hfill \\ \end{gathered} $$
(2)

For HS3T[i,j] = 1, three neighboring elements of TG were set to 1 according to TG[i,j] = TG[i + 1,j + 1] = TG[i − 1,j − 1] = 1.

Analogously, stretches indicating genomic inversions were identified:

$$ A\_HS3T\left[ {i,j} \right] = \left\{ {\begin{array}{*{20}c} 1 & {{\text{if}}\;A\_diag(i,j) = {\text{true}}} \\ 0 & {{\text{if}}\;A\_diag(i,j) = {\text{false}}} \\ \end{array} } \right. $$
(3)

A_diag(i,j) is the result of the following term (compare Supplementary Fig. S1, Panel B):

$$ \begin{gathered} \left( {GO\_S\left[ {a_{i} ,b_{j} } \right] \ge GO\_S\left[ {a_{i} ,b_{j + 1} } \right]} \right) \wedge \left( {GO\_S\left[ {a_{i} ,b_{j} } \right] \ge GO\_S\left[ {a_{i} ,b_{j - 1} } \right]} \right) \hfill \\ \wedge \left( {GO\_S\left[ {a_{i} ,b_{j} } \right] \ge GO\_S\left[ {a_{i - 1} ,b_{j} } \right]} \right) \wedge \left( {GO\_S\left[ {a_{i} ,b_{j} } \right] \ge GO\_S\left[ {a_{i + 1} ,b_{j} } \right]} \right) \hfill \\ \wedge \left( {GO\_S\left[ {a_{i + 1} ,b_{j - 1} } \right] \ge GO\_S\left[ {a_{i + 1} ,b_{j} } \right]} \right) \wedge \left( {GO\_S\left[ {a_{i + 1} ,b_{j - 1} } \right] \ge GO\_S\left[ {a_{i} ,b_{j - 1} } \right]} \right) \hfill \\ \wedge \left( {GO\_S\left[ {a_{i - 1} ,b_{j + 1} } \right] \ge GO\_S\left[ {a_{i} ,b_{j + 1} } \right]} \right) \wedge \left( {GO\_S\left[ {a_{i - 1} ,b_{j + 1} } \right] \ge GO\_S\left[ {a_{i - 1} ,b_{j} } \right]} \right) \hfill \\ \end{gathered} $$
(4)

If A_HS3T[i,j] was 1, the content of TG was altered according to TG[i,j] = TG[i − 1,j + 1] = TG[i + 1,j − 1] = 1.

It is reasonable to prevent the further assessment of a pair of genes a i , b j that do not have similar function. Therefore, we introduced a lower limit GO_cut_off when filling GO_S. Besides unrelated function, low funSim values might originate from inadequate annotation quality, from inconsistencies in the ontology, or from errors in the funSim implementation. To assess funSim values, we utilized GO4genome to compare all genes a j of those 19 genomes Gi constituting the E. coli dataset (see below) with themselves and determined the distribution of funSim GiGi (a j ,a j ) values. Altogether 48,746 gene pairs were analyzed; less than 5% had funSim values <0.59, and more than 90% a funSim value ≥0.87. Therefore, we selected GO_cut_off = 0.59. These results also confirmed that the annotations as deposited in the Genome Reviews database as well as the implementation of funSim are of high quality. We confirmed that the outcome of GO4genome does not depend critically on this parameter. Supplementary Fig. S2 allows comparison of analyses of the E. coli dataset based on GO_cut_off values of 0.59, 0.68, and 0.75.

If HS3Ts overlapped, longer diagonal elements diag(a i ,b j ,a k ,b l ) resulted, extending from position i, j to position k, l. The same could be the case for A_HS3Ts. All diagonal elements occurring in TG were sorted according to their length and stored in a list, DIAG_LIST. In the next step, an optimal set of diagonal elements was selected in order to label genes b 1 to b m . Starting with the element diag(a i ,b j ,a k ,b l ) of maximal length, genes b j to b l were labeled. In addition, all elements of any diag m belonging to the corresponding intervals a i a k or b j b l were removed. Entries in DIAG_LIST were processed until all genes b 1 to b m were labeled or until DIAG_LIST was empty. The result of this process is a set of diagonal elements (a functionally minimal cover) S_DIAG that contains all genes b j of G2 possessing a significant functional similarity to genes of G1. Please note that, due to this filter, gene pairs a i ,b j possessing the highest funSim values are not necessarily elements of S_DIAG. This set may contain crosswise-arranged elements, which could be separated by gaps of arbitrary lengths (compare Supplementary Fig. S3). Figure 1 shows that the set of genes constituting S_DIAG and those sequences aligned by MUMmer or generated by a pairwise BLAST analysis overlap to a great extent. Using the above results, a distance Dist GO for G1 and G2 was calculated according to the following formulae:

$$ sim_{GO} \left( {diag_{k} } \right) = \sum\limits_{{a_{i} ,b_{j} \in diag_{k} }} {funSim(a_{i} ,b_{j} )} $$
(5)
$$ Dist_{GO} \left( {G1,G2} \right) = - \log \left( {\sum\limits_{{diag_{k} \in S\_DIAG}} {\left( {\frac{{sim_{GO} (diag_{k} )}}{weighted\_gsize(G1,G2)}} \right)}^{\lambda } } \right) $$
(6)
$$ weighted\_gsize\left( {G1,G2} \right) = \frac{\sqrt 2 \cdot size(G1) \cdot size(G2)}{{\sqrt {size(G1)^{2} + size(G2)^{2} } }} $$
(7)

sim GO (diag k ) is the sum of all funSim values for those gene pairs constituting one element kS_DIAG. If two neighboring elements diag k , diag l occupied the same diagonal line, sim GO -values were merged (see Supplementary Fig. S3). For the computation of a distance, sim GO (diag k ) values were divided by the weighted average genome size, weighted_gsize(G1,G2), in analogy to Korbel et al. (2002). For Formula (6) we propose to use a λ which is >1.0. In this case, any combination of two or more normalized sim GO (diag k ) values (indicating rearrangements) will sum up to a value which is <1.0. The comparison of trees deduced for the E. coli dataset (data not shown) proved that λ = 1.05 is appropriate.

Fig. 1
figure 1

Whole-genome comparison of Y. pestis CO-92 and Y. pestis Antiqua using three different methods. To identify genomic regions showing maximal synteny, three plots were generated. These originated from pairwise BLAST hits (left column), MUMmer (middle column), and GO4genome (right column). Y. pestis genomes contain a large number of transposases, contributing to the regular pattern in the BLAST plot and the “noise” in the MUMmer plots. Due to rigorous filtering, which is due to the specific selection of diagonal elements, these duplicates do not occur in the GO4genome plot. For determination of dot-plots based on MUMmer or BLAST hits, we utilized the tools offered at the Comparative Tools page of the JCVI (http://cmr.jcvi.org/). The genome of Y. pestis CO-92 is plotted on the abscissa

The evolutionary distance Dist GO (G1,G2) was deduced from the estimated similarity by applying the negative logarithm, as proposed by Korbel et al. (2002). Please note that short fragments contribute only marginally to the distance value; see Formula (6). Therefore, we did not consider elements consisting of fewer than three gene pairs; compare Formulae (2) and (4). For the E. coli dataset (see below), the number of elements making up individual sets S_DIAG varied between 1 and 208.

For a set of genomes G1…Gn, the outcome of all pairwise comparisons Gi, Gj is a distance matrix of size n × n. A frequently used method for the construction of a tree is some variant of a neighbor joining algorithm (Saitou and Nei 1987). The resulting tree will be free of ambiguities, if the distance matrix is additive. However, for the general case, we did not expect additive matrices when comparing several genomes via GO4genome. If conflicting signals (i.e., distances) exist, a neighbor net can be used for indication. We utilized the version implemented with SplitsTree4 (Huson and Bryant 2006).

GO4genome Deduced a Sound Phylogeny for E. coli and Close Relatives

As the first case, we analyzed a dataset containing GO terms of all completely sequenced E. coli genomes, those of Salmonella typhimurium, Shigella boydii, Shigella dysenteriae, three strains of Shigella flexneri, Shigella sonnei, Yersinia pestis, Buchnera aphidicola, Rickettsia prowazekii, Rickettsia conorii, and Aeropyrum pernix. Figure 2 shows the resulting neighbor net. The net indicates that some conflicting signals exist. However, for E. coli species and close relatives, their phylogenetic relation could be resolved unambiguously. The uropathogenic strains E. coli 536, E. coli UTI89, and E. coli CFT073 and the avian pathogenic strain E. coli O1:K1/APEC form a subtree as well as the two enterohemorrhagic strains E. coli O157:H7/EDL933 and E. coli 0157:H7/str. Sakai and E. coli K-12. The relationship of E. coli K-12, E. coli O157:H7, and E. coli CFT073 is in agreement with findings deduced from the comparison of DNA sequences (Elena et al. 2005) and tRNA genes (Withers et al. 2006). The observation that the genome composition of the avian E. coli O1:K1 strain is most similar to that of UTI89 followed by E. coli 536, E. coli CFT073, and E. coli K-12 is in agreement with results deduced from genome content (Johnson et al. 2007).

Fig. 2
figure 2

A neighbor net of E. coli, shigellae, and several other microbial species deduced from encoded gene function. GO4genome was used to compute a distance matrix. SplitsTree4 was utilized to generate and display a neighbor net. A local net-like structure indicates ambiguities. Thus, regions of unclear topology can be visualized. See “Materials and Methods” for species names

The position of S. flexneri and S. typhimurium corresponds to previous findings: S. flexneri is assumed to originate from an ancestral E. coli strain (Rolland et al. 1998). According to a phylogenetic analysis of gyrB gene sequences, S. flexneri is a closer relative of E. coli than of S. typhimurium (Fukushima et al. 2002). The relation of S. flexneri and the last-mentioned E. coli strains is in agreement with a whole-genome tree and an average nucleotide identity tree (see Konstantinidis et al. 2006). All Shigella genomes were grouped together; the three S. flexneri strains cluster in one distinct group. S. boydii, S. dysenteriae, and S. sonnei constitute a second cluster. A phylogenetic analysis of shigellae, based on smaller sets of gene sequences, resulted in inconsistent phylogenies (see Yang et al. 2007); see the “Discussion”.

Buchnera, Y. pestis, Rickettsia, and A. pernix were more distant from the other species. The positioning of Buchnera is a specific challenge, as the genome of this endosymbiont has undergone massive genome reduction since the divergence from a free-living γ-proteobacterial ancestor. High substitution rates and biased nucleotide patterns have been the reason for the deviant tree topologies computed for individual sequences. A tree deduced from a concatenation of 205 protein sequences gave the same relationship as shown in Fig. 2 for E. coli, S. typhimurium, Y. pestis, and Buchnera (Lerat et al. 2003). In summary, these consistencies demonstrate that the above method of analyzing gene function and order generates a sound phylogeny, which is in most cases consistent with classical methods. As expected, the topology of the GO4genome net is less resolved for distantly related species (compare Fig. 2). Gene order conservation is lost rapidly when comparing species which are less related (Tamames 2001).

Streptococci form Distinct Groups

The genus Streptococcus is one of the most diverse and important human and agricultural pathogens. The genomes of streptococci exhibit extreme levels of evolutionary plasticity accompanied by a high level of gene gain and loss. It has been shown that recombination is an important factor in the evolution of Streptococcus genomes (Lefébure and Stanhope 2007). Based on gene gain, loss, and duplication, core-based phylogenies have been determined for Streptococcus and, more specifically, for S. agalactiae and S. pyogenes strains (Lefébure and Stanhope 2007). According to this approach, S. pyogenes and S. agalactiae are closely related, as well as S. pneumoniae and S. suis. Additionally, a tree for Streptococcus has been deduced from a joint analysis of 504 single-copy genes (Anisimova et al. 2007). In this case, S. pyogenes and S. agalactiae have been most similar, as well as S. thermophilus and S. pneumoniae. Genome organization as deduced by GO4genome is in agreement with these findings and additionally identifies the genome structure of S. sanguinis as most similar to that of S. pneumoniae and S. suis; see Fig. 3. In addition, the net topology is concordant with findings deduced from an analysis of dnaJ and gyrB sequences (Itoh et al. 2006).

Fig. 3
figure 3

A phylogenetic classification of streptococci based on encoded gene function and gene location. GO4genome was used to compute a distance matrix according to Formula (6). By means of SplitsTree4, a neighbor net was generated and plotted. Among S. pyogenes strains, several clusters are discernible

According to Lefébure and Stanhope (2007), among S. pyogenes strains the pairs (MGAS9429, MGAS2096; MGAS315, SSI-1) and (M1 GAS, MGAS5005) are most related. GO4genome predicted the same relationship; compare Fig. 3. However, for some species, like (MGAS8232, MGAS10394), the predictions differ. Additionally, the net indicates that the serovars M3 and M18 form one group, and M1, M2, M4, M12, and M28 a second group, which is less homogeneous. M5 and M6 lie isolated. In summary, the phylogenetic net showed a relatively low level of ambiguities. The analysis of genome organization clearly separated individual Streptococcus species and allowed the grouping of serovars. As can be seen, gene gain and loss had no major impact on the overall genome organization of the species.

Horizontal Gene Transfer Has Little Effect on the Genome Organization of Methanosarcina

So far, three genomes of Methanosarcina have been analyzed. The genomes differ significantly in size: the genome of M. mazei contains 3370 genes; that of M. barkeri, 3606 genes; and that of M. acetivorans, 4540 genes. It has been postulated that up to 30% of the M. mazei genes have been acquired via HGT (Deppenmeier et al. 2002). For M. mazei, 8.1% of its genes constitute larger genomic islands with atypical codon usage; for M. acetivorans this fraction is 10.8% (Merkl 2004). Thus, these genomes represent an appropriate set for testing the robustness of GO4genome against HGT and variations in genome size. We compiled a dataset consisting of the above Methanosarcina and Methanosaeta thermophila (a distantly related methanosarcinales), Methanospirillum hungatei, Methanoculleus marisnigri, Methanocorpusculum labreanum (three methanomicrobiales), three pyrococci, and two thermoplasmata. Figure 4 shows the resulting neighbor net. All species belonging to the same order were grouped in distinct subnets; the only exception was M. thermophila. It is known that the evolutionary relationship to Methanosarcina is a distant one: analysis of the 16S RNA gave the same local topology as shown in Fig. 4 for M. mazei, M. thermophila, and M. hungatei (Sekiguchi et al. 1998). Notably, the Methanosarcina species form a distinct subgroup, indicating that variations in genome size and larger amounts of HGT have only a minor effect on the resolving power of GO4genome.

Fig. 4
figure 4

A whole-genome phylogeny for methanosarcinales and other archaea. GO4genome was used to determine a distance matrix. SplitsTree4 was utilized for computation of a neighbor net and visualization. The suffixes indicate the lineage: MS methanosarcinales, MM methanomicrobiales, TP thermoplasmatales, TC thermococcales. See “Materials and Methods” for species names

GO4genome Groups Yersinia in a Novel Way

Yersinia pestis is a Gram-negative bacterium and the causative agent of plague. Y. pestis is considered a recently emerged clone of Y. pseudotuberculosis, which evolved during the last 9000–40,000 years (Achtman et al. 2004). Originally, yersiniae were grouped into a “nonclassical” subspecies (containing Microtus) and three “classical” biovars, based on their ability to reduce nitrate and utilize glycerol: Antiqua (positive for both markers), Mediaevalis (do not reduce nitrate but utilize glycerol), and Orientalis (positive for nitrate reduction but do not utilize glycerol). Due to the latest analytical methods and molecular relatedness, Y. pestis strains were split into three major branches (Achtman et al. 2004; Auerbach et al. 2007). Branch 0 contains Y. pestoides isolates and the Microtus isolate 91001. 1.ORI subsumes bacteria related to Orientalis strains, classical Mediaevalis strains are referred to 2.MED, and Antiqua isolates are split into two distinct groups, 1.ANT and 2.ANT, which were isolated in Africa and East Asia, respectively. A MLVA analysis suggested that 2.MED and 2.ANT represent sister clades (Achtman et al. 2004). Based on the analysis of several parameters like SNPs and the genome-specific inactivation of genes, it has been postulated that the Antiqua and CO-92 strains belong to one branch, and KIM and Nepal516 to the second one. According to this analysis, 1.ANT is closely related to the Orientalis strain CO-92, while 2.ANT (represented by the Asian Antiqua strain Nepal516) is more closely related to the Mediaevalis strain KIM (Chain et al. 2006). Figure 5 shows that GO4genome proposed a different topology: one split separated Y. pestis Mediaevalis KIM5, Y. pestis biovar Microtus 91001, and Y. pestis Orientalis CO-92; a second one, the two Antiqua strains Y. pestis Antiqua and Y. pestis Nepal516; and a third, Y. pseudotuberculosis and Y. enterocolitica. This result indicates that the genome organization of biovars Mediaevalis (including Microtus) and Orientalis (represented by CO-92) is most similar; the same holds for the two representatives of the Antiqua biovar. For strain 91001, evolution from an ancient Y. pestis strain in a different lineage has been postulated (Song et al. 2004). According to GO4genome, its genome organization most resembles CO-92 and KIM5.

Fig. 5
figure 5

A whole-genome phylogeny for Yersinia strains. GO4genome was used to determine a distance matrix. SplitsTree4 was utilized for computation of a neighbor net and visualization. The net indicates that the genomes of the two Mediaevalis strains and of CO-92, as well as those of the two Antiqua strains, are most similar, respectively, when compared regarding gene function and their location

Conclusions that can Be Drawn from Genome Organization

The analyses introduced above exemplify the application of GO4genome and indicate the types of problems that can be studied. In the following, we summarize some results. The crenarchaeon Sulfolobus solfataricus and the euryarchaeon Thermoplasma acidophilum inhabit the same ecological niche. There is evidence for a large amount of HGT between these species (Ruepp et al. 2000); many genes are closely related (e.g., trpA and trpB [Merkl 2007]). However, Fig. 4 clearly indicates that the genome composition of these species is quite dissimilar. For M. mazei, 8.1% of its genes constitute larger genomic islands with atypical codon usage; for M. acetivorans this figure is 10.8% (Merkl 2004). Figure 4 shows, that despite these islands, their overall genome composition is still highly similar. Both findings suggest that HGT restructures genome content only locally.

Shigellae do not have a single evolutionary origin; however, many of their characteristics indicate convergent evolution (Pupo et al. 2000). Figure 2 makes clear that convergent evolution can be seen on the level of genome organization. More specifically, genome composition separates the three S. flexneri strains from S. boydii, S. sonnei, and S. dysenteriae, which constitute a separate cluster. In the case of Y. pestis, the similarity of genome organization proposes a convergent evolution of the Antiqua strains. The effect is detectable on the genome level; compare Fig. 5.

Discussion

What Is the Outcome of Classical Methods for the Cases Considered?

At first glance, it seems trivial to deduce the relationship of closely related prokaryotes. However, a comparison of the outcome of state-of the-art methods makes clear that this is not always a trivial task. Several cases are discussed below. The first example is the E. coli group. According to the analysis of tRNA genes (Withers et al. 2006) and 36 randomly chosen genomic regions (Elena et al. 2005), E. coli O157:H7 is a closer relative of E. coli K-12 than S. flexneri. However, maximum likelihood analyses of core genomes and the ANI method identify S. flexneri as being more closely related to E. coli K-12 than to E. coli O157:H7 (Konstantinidis et al. 2006). These differences might be due to the specific fate of individual genes. As has been pointed out, not all γ-proteobacterial core genes bear a similar phylogenetic signal supporting the same tree topology (Susko et al. 2006).

Shigellae have long been known to be closely related to E. coli. Due to biotyping, the genus has been divided into the four species S. boydii, S. dysenteriae, S. flexneri, and S. sonnei. Based on the analysis of eight housekeeping genes, it has been postulated that shigellae do not have a single evolutionary origin, which indicates convergent evolution of phenotypic properties (Pupo et al. 2000). An analysis of 23 housekeeping genes (Yang et al. 2007) has confirmed the clustering of shigellae into three main clusters, C1, C2, and C3. Clusters C1 and C2 consisted of S. dysenteriae and S. boydii strains; most of the strains (like F2a used here) constituting C3 were S. flexneri. The S. boydii strain Sb277 (used here) belonged to C1. The S. sonnei strain Ss046 (used here) was a direct neighbor of C1. The S. dysenteriae strain Sd197 (used here) laid isolated; the closest neighbors were E. coli EDL933 and E. coli Sakai. Contrariwise, an analysis of four chromosomal genes which were particularly polymorphic grouped Sd197 close to C1 and Ss046 close to C2 and C3 (Yang et al. 2007).

Based on the analysis of SNPs, a new nomenclature has been proposed for yersiniae (Achtman et al. 2004); see above. It has been postulated that Antiqua and CO-92 belong to one branch, and KIM and Nepal516 to a second one (Chain et al. 2006). A DNA microarray analysis of 22 strains of Y. pestis indicated that the two biovar strains Antiqua and Mediaevalis showed the most divergence from the CO-92 strain, and KIM and Nepal516 were clustered together (Hinchliffe et al. 2003). An analysis of CRISPR elements suggested that the Orientalis lineage branched out of the Antiqua strain earlier than the Mediaevalis biovar; the relative position of African Antiqua strains could not be fixed (Vergnaud et al. 2007). In summary, the above examples indicate that the phylogenetic signals studied highlight different aspects of genome evolution. This observation is in agreement with recent findings deduced from several methods of whole-genome phylogeny (McCann et al. 2008).

What Distinguishes Whole-Genome Analysis from Traditional Methods?

Several aspects of genome organization are not covered by classical methods. In many cases, bacteriophages are involved in the transfer of genomic islands. For S. flexneri 2a, 314 IS elements have been identified, which is more than sevenfold the content of E. coli K-12 (Jin et al. 2002). A comparison of the Y. pseudotuberculosis genome with CO-92 and KIM10+ indicated that an extraordinary expansion of IS families has occurred since their divergence. It was deduced that the least common ancestor of CO-92 and KIM10+ carried 109 IS elements. Since their divergence from Y. pseudotuberculosis, KIM10+ and CO-92 have undergone 10 or 18 rearrangements, respectively (Chain et al. 2004). Thus, it is quite likely that the insertion elements and/or the subsequent rearrangements they have generated played an important role in the speciation of Y. pestis strains (Chain et al. 2004). Y. pestis is actively undergoing reductive evolution and there is some evidence for convergent evolution (Chain et al. 2006).

In addition to the acquisition of novel genes by means of HGT, genetic rearrangements alter the position, the orientation, or the coding strand with respect to the origin of replication. As a consequence, gene dosage may be affected, as has been demonstrated for inversions in the genome of E. coli (Hill and Gray 1988). Depending on position, the effects of such rearrangements differ drastically (Esnault et al. 2007). Compared to E. coli sequences, 13 translocations and inversions of size >5 kb have been identified in the genome of S. flexneri. It has been assumed that these rearrangements allow reoptimization of promoters in order to cope with selective pressure (Jin et al. 2002). The impact of rearrangements and their high frequency indicated above demand whole-genome analysis. In contrast to this approach, the analysis of a few genes or of SNPs covers a different aspect of phylogenomics, namely, the historical lineage of genes or genomes.

As has been shown, analysis of the common gene content has disadvantages as a measure for determination of phylogenies (Tamames 2001). In contrast, gene order conservation defines the course of evolution more precisely. In addition, its analysis does not depend on the presence of a certain set of genes. Along these lines, GO4genome supports a completely different aspect of “genome similarity,” supplementing sequence-based methods and those elucidating the evolution of genes and genomes. As our approach assesses genomic signals which are influenced by more and different parameters than those related to the fate of single molecules, the grouping of species that differs from an analysis of classical markers is no surprise and does not judge the quality of any method. The networks resulting from GO4genome trace the evolutionary process of speciation based not on mutational events but on signal similarities in genome organization. As shown above for yersiniae, the genome organization of Y. pestis Antiqua and Y. pestis Antiqua Nepal516 is most similar; the same holds for KIM5, CO-92, and Mediaevalis 91001. Among Shigella, the genome organization of S. flexneri strains differs from that of S. boydii, S. dysenteriae, and S. sonnei, which form a cluster. Most likely, effects which shape genomes above the gene level are responsible for these similarities.

As is the case for many other algorithms, we cannot prove the liability of our approach sensu strictu. However, the concordance of a great portion of the net topologies with well-established phylogenetic relations makes our findings highly plausible. We have demonstrated for several cases of inconsistencies that independent findings indicate them as well. In addition, it is unlikely that, just by chance, the genomes of (say) the Antiqua strains or of shigellae cluster in the pattern observed.

Limitations and Further Improvements

For prokaryotes, the organization of their genes in operons (Jacob and Monod 1961) and uberoperons (Lathe et al. 2000) is well established and it is known that the degree of genomic rearrangements increases constantly with the time of divergence (Suyama and Bork 2001). This holds even though there are discordant processes like HGT or varying rates of evolution or gene loss. However, these processes have been shown to add noise rather than a directional bias (Dutilh et al. 2004). In summary, these findings argue for analysis of genome organization. The above method is the first one utilizing the overall genome structure for determination of phylogenetic trees. So far, gene order has been exploited for gene pairs (Korbel et al. 2002) or rearrangements have been studied for a reduced set of genes in γ-proteobacterial genomes (Belda et al. 2005). The approach introduced with GO4genome eliminates some of the pitfalls of sequence-based phylogenies by comparing genes on function. For pairwise comparison of the genomes, it is not necessary to compare the respective sequences, which avoids false assignments. Due to the “fuzzy” scoring function, the selection of paralogues has only a minor effect on the identification of conserved genomic segments. As ontology is exploited, in situ replacements of genes maintaining the function of gene products have little impact on the phylogenetic distance. We believe that assessing HGT events in this way is at least a considerable alternative. Microbial genomes may contain a substantial number of duplicated genes, which argues for filtering (cf. Fig. 1). The above findings show that the proposed processing is appropriate to identify relevant gene series which can surrogate a cover.

Optimal applications for GO4genome are the study of serovars (see Fig. 3) or of closely related species (see Fig. 2).

Several improvements of GO4genome are conceivable. So far, the algorithm assesses gene function and location but not gene orientation. When comparing two genomes, the transcriptional orientation of each gene pair can be the same (positive polarity) or different (negative polarity). However, how to integrate this signal into Formula (6) is unclear. The algorithm considers the length and size of rearrangements but not their location, e.g., with respect to the origin of replication. To do this, it would be necessary to model gene dosage for each species.

Above, we have focused on genomes consisting of a single chromosome. An analysis of several chromosomes is trivial; how to consider plasmids is unclear. Unfortunately, approaches exploiting gene order cannot be utilized for higher organisms: gene order is poorly conserved in eukaryotes (Huynen et al. 2001). The ultimate goal would be the comparison of all completely sequenced microbial genomes in order to compare genome organization. Due to the modular concept of our approach, such an analysis is feasible.