Introduction

The nitrogen (N) cycle is one of the most important biogeochemical processes underpinning the existence of life on Earth. A key step in this cycle is to convert relatively inert atmospheric dinitrogen (N2) into a bioaccessible form such as ammonia (NH3) through a process referred to as biological nitrogen fixation (BNF). BNF is performed only by a specialized subset of Bacteria and Archaea that possess the necessary cellular machinery to enzymatically reduce N2 into NH3. Some of these bacteria (termed rhizobia or root nodule bacteria) have evolved non-obligatory symbiotic relationships with legumes whereby the bacteria receive a carbon source from the plant and in return supply fixed N to the host [1]. Harnessing this association can boost soil N-inputs and therefore production yields of legumes, or non-legumes grown in subsequent years, without the need for supplementation with industrially synthesized N-based fertilizers [2].

Some of the most widely cultivated pasture legumes are members of the legume genus Trifolium (clover). The natural distribution of these species spans three centers of diversity, with an estimated 28% of species in the Americas, 57% in Eurasia and 15% in sub-Saharan Africa [3]. Approximately 30 species of clover, predominately of Eurasian origin, are widely grown as annual and perennial species in pasture systems in Mediterranean and temperate climatic zones [3]. Globally-important perennial species of clover include T. repens (white clover), T. pratense (red clover), T. fragiferum (strawberry clover) and T. hybridum (alsike clover). While clovers are known to form N2-fixing symbiotic associations with Rhizobium leguminosarum bv. trifolii, there exists wide variation in symbiotic compatibility across different strains and hosts from ineffective (non-N2-fixing) nodulation to fully effective N2-fixing partnerships.

Rhizobium leguminosarum bv. trifolii strain WSM1689 was isolated in 1995 from a nodule of the perennial clover Trifolium uniflorum collected on the edge of a valley 6 km from Eggares on the Greek Island of Naxos. T. uniflorum is one of small number of perennial Trifolium spp. found in the dry, Mediterranean basin. While WSM1689 has been shown to be either ineffective or unable to nodulate a range of annual and perennial Trifolium sp., it is a highly effective N2-fixing microsymbiont of T. uniflorum [4]. Therefore, R. leguminosarum bv. trifolii WSM1689 has a very narrow host range and thus represents a good isolate to study the genetic basis of symbiotic specificity. The availability of this sequence data also complements the already published genomes of the clover-nodulating R. leguminosarum bv. trifolii WSM1325 [5] and WSM2304 [6]. Here we present a summary classification and a set of general features for R. leguminosarum bv. trifolii strain WSM1689 together with the description of the complete genome sequence and its annotation.

Classification and features

R. leguminosarum bv. trifolii strain WSM1689 is a motile, non-sporulating, non-encapsulated, Gram-negative rod in the order Rhizobiales of the class Alphaproteobacteria. The rod-shaped form varies in size with dimensions of approximately 0.25–0.5 µm in width and 2.0 µm in length (Figure 1 Left and 1 Center). It is fast growing, forming colonies within 3–4 days when grown on half strength Lupin Agar (½LA) [7], tryptone-yeast extract agar (TY) [8] or a modified yeast-mannitol agar (YMA) [9] at 28°C. Colonies on ½LA are opaque, slightly domed and moderately mucoid with smooth margins (Figure 1 Right). Minimum Information about the Genome Sequence (MIGS) is provided in Table 1.

Figure 1.
figure 1

Images of Rhizobium leguminosarum bv. trifolii strain WSM1689 using scanning (Left) and transmission (Center) electron microscopy and the appearance of colony morphology on ½LA (Right).

Table 1. Classification and general features of Rhizobium leguminosarum bv. trifolii strain WSM1689 according to the MIGS recommendations [10,11].

Figure 2 shows the phylogenetic neighborhood of R. leguminosarum bv. trifolii strain WSM1689 in a 16S rRNA gene sequence based tree. This strain shares 100% (1362/1362 bp) sequence identity to the 16S rRNA gene of R. leguminosarum bv. trifolii strain WSM1325 [5] and R. leguminosarum bv. trifolii strain WSM2304 [6].

Figure 2.
figure 2

Phylogenetic tree showing the relationship of Rhizobium leguminosarum bv trifolii WSM1689 (shown in bold print) to other root nodulating Rhizobium spp. in the order Rhizobiales based on aligned sequences of the 16S rRNA gene (1,180 bp internal region). All positions containing gaps and missing data were eliminated. All sites were informative and there were no gap-containing sites. Phylogenetic analyses were performed using MEGA, version 5 [25]. The tree was built using the Maximum-Likelihood method with the General Time Reversible model [26]. Bootstrap analysis [27] with 500 replicates was performed to assess the support of the clusters. Type strains are indicated with a superscript T. Brackets after the strain name contain a DNA database accession number and/or a GOLD ID (beginning with the prefix G) for a sequencing project registered in GOLD [28]. Published genomes are indicated with an asterisk.

Symbiotaxonomy

R. leguminosarum bv. trifolii WSM1689 is a highly effective microsymbiont of the perennial Eurasian clover Trifolium uniflorum (Table 2). In contrast, WSM1689 does not nodulate the perennial T. fragiferum and forms white ineffective (Fix) nodules with other perennial and annual clovers of Eurasian origin. Moreover, WSM1689 is either Nod or Fix on clovers of North American or African origin. Therefore, WSM1689 is unusual in having an extremely narrow clover host range for the establishment of effective N2-fixing symbiosis.

Table 2. Compatibility of WSM1689 with both perennial and annual Trifolium genotypes for nodulation (Nod) and N2-Fixation (Fix). Data compiled from [4].

Genome sequencing and annotation

Genome project history

This organism was selected for sequencing on the basis of its environmental and agricultural relevance to issues in global carbon cycling, alternative energy production, and biogeochemical importance, and is part of the Community Sequencing Program at the U.S. Department of Energy, Joint Genome Institute (JGI) for projects of relevance to agency missions. The genome project is deposited in the Genomes OnLine Database [28] and a finished genome sequence in IMG/GEBA. Sequencing, finishing and annotation were performed by the JGI. A summary of the project information is shown in Table 3.

Table 3. Genome sequencing project information for Rhizobium leguminosarum bv. trifolii strain WSM1689.

Growth conditions and DNA isolation

Rhizobium leguminosarum bv. trifolii strain WSM1689 was grown to mid logarithmic phase in TY rich medium on a gyratory shaker at 28°C [29]. DNA was isolated from 60 mL of cells using a CTAB (Cetyl trimethyl ammonium bromide) bacterial genomic DNA isolation method [30].

Genome sequencing and assembly

The genome of Rhizobium leguminosarum bv. trifolii strain WSM1689 was sequenced at the Joint Genome Institute (JGI) using a combination of Illumina [31] and 454 technologies [32]. An Illumina GAii shotgun library which generated 73,565,648 reads totaling 5,591 Mbp, and a paired end 454 library with an average insert size of 12 Kbp which generated 376,185 reads totaling 93.4 Mbp of 454 data were generated for this genome. All general aspects of library construction and sequencing performed at the JGI can be found at [30]. The initial draft assembly contained 100 contigs in 4 scaffolds. The 454 paired end data was assembled with Newbler, version 2.6. The Newbler consensus sequences were computationally shredded into 2 Kbp overlapping fake reads (shreds). Illumina sequencing data was assembled with VELVET, version 1.1.05 [33], and the consensus sequence computationally shredded into 1.5 Kbp overlapping fake reads (shreds). We integrated the 454 Newbler consensus shreds, the Illumina VELVET consensus shreds and the read pairs in the 454 paired end library using parallel phrap, version SPS - 4.24 (High Performance Software, LLC). The software Consed [3436] was used in the following finishing process. Illumina data was used to correct potential base errors and increase consensus quality using the software Polisher developed at JGI (Alla Lapidus, unpublished). Possible mis-assemblies were corrected using gapResolution (Cliff Han, unpublished), Dupfinisher [37], or sequencing cloned bridging PCR fragments with subcloning. Gaps between contigs were closed by editing in Consed, by PCR and by Bubble PCR (J-F Cheng, unpublished) primer walks. A total of 93 additional reactions were necessary to close gaps and to raise the quality of the finished sequence. The total genome size is 6.9 Mbp and the final assembly is based on 57.3 Mbp of 454 draft data which provides an average 8.3× coverage of the genome and 5,345 Mbp of Illumina draft data which provides an average 774.6× coverage of the genome.

Genome annotation

Genes were identified using Prodigal [38] as part of the DOE-JGI genome annotation pipeline, followed by a round of manual curation using the JGI GenePRIMP pipeline [39]. The predicted CDSs were translated and used to search the National Center for Biotechnology Information (NCBI) nonredundant database, UniProt, TIGRFam, Pfam, PRIAM, KEGG, COG, and InterPro databases. These data sources were combined to assert a product description for each predicted protein. Non-coding genes and miscellaneous features were predicted using tRNAscan-SE [40], RNAMMer [41], Rfam [42], TMHMM [43], and SignalP [44]. Additional gene prediction analyses and functional annotation were performed within the Integrated Microbial Genomes (IMG-ER) platform [45,46].

Genome properties

The genome is 6,903,379 nucleotides with 60.94% GC content (Table 4 and Figures 3a,3b,3c,3d,3e and Figure 3f), and comprised of 6 replicons. From a total of 6,798 genes, 6,709 were protein encoding and 89 RNA only encoding genes. Within the genome, 206 pseudogenes were also identified. The majority of genes (79.52%) were assigned a putative function whilst the remaining genes were annotated as hypothetical. The distribution of genes into COGs functional categories is presented in Table 5.

Figure 3a.
figure 3a

Graphical circular map of Replicon WSM1689_Rleg3_Contig1814.1 of the Rhizobium leguminosarum bv. trifolii strain WSM1689 genome. From outside to the center: Genes on forward strand (color by COG categories as denoted by the IMG platform), Genes on reverse strand (color by COG categories), RNA genes (tRNAs green, sRNAs red, other RNAs black), GC content, GC skew.

Figure 3b.
figure 3b

Graphical circular map of replicon WSM1689_Rleg3_Contig1813.2 of the Rhizobium leguminosarum bv. trifolii strain WSM1689 genome. From outside to the center: Genes on forward strand (color by COG categories as denoted by the IMG platform), Genes on reverse strand (color by COG categories), RNA genes (tRNAs green, sRNAs red, other RNAs black), GC content, GC skew.

Figure 3c.
figure 3c

Graphical circular map of replicon WSM1689_Rleg3_Contig1812.3 of the Rhizobium leguminosarum bv. trifolii strain WSM1689 genome. From outside to the center: Genes on forward strand (color by COG categories as denoted by the IMG platform), Genes on reverse strand (color by COG categories), RNA genes (tRNAs green, sRNAs red, other RNAs black), GC content, GC skew.

Figure 3d.
figure 3d

Graphical circular map of replicon WSM1689_Rleg3_Contig1810.5 of the Rhizobium leguminosarum bv. trifolii strain WSM1689 genome. From outside to the center: Genes on forward strand (color by COG categories as denoted by the IMG platform), Genes on reverse strand (color by COG categories), RNA genes (tRNAs green, sRNAs red, other RNAs black), GC content, GC skew.

Figure 3e.
figure 3e

Graphical circular map of replicon WSM1689_Rleg3_Contig1811.4 of the Rhizobium leguminosarum bv. trifolii strain WSM1689 genome. From outside to the center: Genes on forward strand (color by COG categories as denoted by the IMG platform), Genes on reverse strand (color by COG categories), RNA genes (tRNAs green, sRNAs red, other RNAs black), GC content, GC skew.

Figure 3f.
figure 3f

Graphical circular map of replicon WSM1689_Rleg3_Contig1809.6 of the Rhizobium leguminosarum bv. trifolii strain WSM1689 genome. From outside to the center: Genes on forward strand (color by COG categories as denoted by the IMG platform), Genes on reverse strand (color by COG categories), RNA genes (tRNAs green, sRNAs red, other RNAs black), GC content, GC skew.

Table 4. Genome Statistics for Rhizobium leguminosarum bv. trifolii strain WSM1689.
Table 5. Number of protein coding genes of Rhizobium leguminosarum bv. trifolii strain WSM1689 associated with the general COG functional categories.