Background

The epsilon (ε-) proteobacteria comprise one of the five Classes within the phylum Proteobacteria [14]. These bacteria inhabit a wide variety of ecological niches ranging from gastrointestinal tracts of animals to water reservoirs, sewage, oil-field community and deep-sea hydrothermal vents [2, 510]. Recent studies show that ε-proteobacteria comprise a significant proportion of the microbial population in deep-sea hydrothermal vents where, because of their ability to carry out different types of metabolism using a variety of alternate electron donors (e.g. H2, formate, elemental sulfur, sulfide, thiosulfate) and acceptors (e.g. sulfite, elemental sulfur, nitrate), they play important role in carbon, nitrogen and sulfur cycles [7, 913]. A great deal of interest in these bacteria stems from the fact that many of these species are host-associated (Helicobacter, Campylobacter, Wolinella) and comprise important human and animal pathogens [1416]. Of these bacteria, Helicobacter pylori is the causative agent for gastric and peptic ulcers [17, 18] and infections with this and the related species H. hepaticus are important predisposing factors in gastric cancers in humans and liver cancers in rodents [16, 19, 20]. Campylobacter jejuni and C. coli are the most common causes of food-born illnesses such as diarrhea worldwide [15, 21]. C. jejuni infection can also lead to the neuromuscular disease Guillain-Barre syndrome [15, 21, 22], which causes weakness and paralysis of muscles. In contrast to the pathogenic nature of Helicobacter and Campylobacter, Wolinella succinogenes is a commensal in the gastrointestinal tract of cattle and it is not known to cause any illness in either animals or humans [2, 5, 14, 23]. In addition to the host-associated species, many free-living members which include chemolithotrophic and autotrophic bacteria (e.g., Thiomicrospira denitrificans, Arcobacter, Caminibacter, Nautilia, Thiovulum) also form part of the ε-proteobacterial group [4, 6, 8, 10, 12, 24].

The ε-proteobacteria are presently distinguished from other bacteria based their branching in the 16S rRNA trees [2, 46]. Although most of these bacteria assume a spiral shape sometime during their life cycle [5, 25] and they can also utilize a variety of electron donors and acceptors (noted above), these characteristics are not unique to this group [2, 46, 10]. Presently, there is no molecular or biochemical characteristic known that is unique to this group of bacteria. Within ε-proteobacteria, two main orders, Campylobacterales and Nautiliales, are presently recognized [8, 10, 12, 24] The Campylobacterales is made up of three families, Campylobacteraceae, Helicobacteraceae and Hydrogenimonaceae, whereas the Nautiliales order is comprised of three genera (Nautilia, Lebetimonas and Caminibacter) [8, 10, 12, 24]. Except for the 16S rRNA, very little sequence information is available for species belonging to the Hydrogenimonaceae family and the Nautiliales order.

In the past few years, genomic sequences of several ε-proteobacterial species from the Campylobacterales order have become available. The completely sequenced genomes include those from: Helicobacter pylori 26695 [26], H. pylori J99[27], H. hepaticus ATCC 51449 [28], Campylobacter jejuni NCTC 11168 [29], C. jejuni RM1221 [30],Wolinella succinogenes DSM 1740 [23] and Thiomicrospira denitrificans ATCC 33889 [31]. In addition, genomes of several Campylobacter species (viz. C. lari, C. coli, C. upsaliensis and C. fetus) and C. jejuni subsp. jejuni strains (viz.HB93-13, 84-25, CF93-6, 260.94, 11168 and 81-176) are now at assembly stage[30] and sequence information from them is available in the NCBI database. The availability of these sequences has opened new windows for discovering novel molecular characteristics that are unique to these bacteria and can be used for their diagnostics as well as for biochemical and functional studies. Earlier comparative genomic studies on ε-proteobacteria have examined a number of aspects of their gene/protein contents [14, 23, 2630, 3234]. Of these, the studies by Eppinger et al. [14] and Fouts et al. [30] are particularly detailed. In these works, genes/proteins that are unique to individual genomes were identified as well as genes that are commonly shared by, but not uniquely present, in a number of these bacteria (viz. H. pylori, H. hepaticus, C, jejuni and W. succinogenes). Pair-wise comparison of the gene contents of these bacteria, functional classification of their genomic inventory, synteny and co-linearity of genes in various genomes, and examples of gene losses as well as recombination were also reported [14, 30]. Additionally, Coenye and Vandamme [35] have carried out studies to identify genes that have been laterally transferred between ε-proteobacteria and other bacteria. However, thus far no comparative study has examined or identified genes/proteins that are uniquely found in ε-proteobacteria at different taxonomic levels. Such genes and proteins, because of their specificity, provide novel means for diagnostics and taxonomic studies [3639] and for discovering important physiological characteristics that are unique to these bacteria.

In our recent work, we have used comparative genomics to identify a large number of signature proteins that are specific for either alpha proteobacteria [40], chlamydiae [38] or Actinobacteria [39]. In the present work, we have carried out systematic BLAST searches on all open reading frames (ORF) in the genome of Wolinella succinogenes DSM 1740 and Campylobacter jejuni RM1221 to identify whole genes/proteins (i.e. signature proteins) that are unique to ε-proteobacteria. These studies have led to identification of 49 genes/proteins that are uniquely present in various sequenced ε-proteobacteria (including Thiomicrospira), as well as many other proteins that are limited to certain subgroups within the Campylobacterales order. Additionally, we also describe a number of conserved indels in widely distributed proteins that are specific for either all-available ε-proteobacteria or for certain subgroups among them. The identified signature proteins and indels comprise rare genetic changes that have been introduced at various stages during the evolution of Campylobacterales (ε-proteobacteria) and their species distribution patterns are supported by the branching order of these species in phylogenetic trees.

Results and discussion

These studies were undertaken to identify molecular characteristics that are uniquely shared by either all sequenced ε-proteobacteria species, or their subgroups, but which generally are not found in any other organism. Three different kinds of molecular signatures that are specific for ε-proteobacteria are described in the present work. The first of these consists of whole proteins or open reading frames (ORFs) that are uniquely found in ε-proteobacteria. The other two characteristics are comprised of rare genetic changes (RGCs) consisting of either conserved inserts or deletions (indels) in widely distributed proteins that are specific for the ε-proteobacterial homologs as well as a gene-fusion event within this group of bacteria. A brief description of these molecular signatures and their evolutionary significances are discussed below.

Whole proteins or ORFs that are unique for the epsilon-proteobacteria (Campylobacterales order) and Helicobacteraceae family

The ε-proteobacteria-specific proteins were identified as described in the Methods section. Generally, a protein was considered to be epsilon-proteobacteria specific if all significant alignments (or hits) in a PSI-BLAST search with the query protein were from ε-proteobacteria species. In a few cases, where the E values of 1 or 2 hits from other species also exhibited borderline significance, but there was a large increase in E value from the last ε-proteobacteria hit in the search to these other proteins, such proteins were also regarded as ε-proteobacteria-specific. In Table 1, I list some characteristics of 53 proteins that could be regarded as specific for most sequenced ε-proteobacteria based on these criteria. Forty-one of these 53 proteins were present in all sequenced ε-proteobacteria genomes and for them all significant alignments/hits were from this group. However, in three instances (viz. WS0216, WS0260 and WS1495) the E value for one ε-proteobacteria was just above the default threshold value (.005) for significance. For three other proteins, WS0316, WS1874 and WS2146, 1–3 hits from other bacteria exhibited borderline significance, but there was a large jump in E values from the last ε-proteobacteria hit to these other proteins (see Table 1), indicating that these proteins are also ε-proteobacteria-specific. Eight other proteins in this Table (WS0865, WS1211, WS1235, WS1329, WS1640, WS1752, WS1771 and WS2059) are missing in 1–2 ε-proteobacteria species, which could be due to selective gene loss [33]. Of these 8 proteins, WS1211, WS1752 and WS2059 are present in almost all sequenced ε-proteobacteria except T. denitrificans. The phylogenetic position of T. denitrificans within ε-proteobacteria is presently not clear (discussed later). Hence, absence of these proteins in T. denitrificans could be explained by either earlier divergence of this species in comparison to other sequenced ε-proteobacteria, or due to gene loss.

Table 1 Proteins that are uniquely present in most epsilon proteobacteria (Campylobacterales)

For the protein WS0230 listed in Table 1, in addition to various ε-proteobacteria, homologs with very low E values (e-90 range) were also found in two δ-proteobacteria belonging to the Desulfovibrio genus. In phylogenetic trees based on 16S rRNA [2, 41], various proteins [42, 43], and in analyses based on conserved indels [44], δ-proteobacteria generally branch in close proximity to the ε-proteobacteria. Hence, the shared presence of the WS0230 homologs in Desulfovibrio genus and ε-proteobacteria may reflect either a deep phylogenetic relationship that exist between these two groups [4345], or it could result from lateral gene transfer [46]. Based on the available data we are unable to distinguish between these possibilities. However, it is interesting to note that a 1 aa insert in a conserved region of the RecA protein, which was previously indicated to be specific for ε-proteobacteria [44], and is present in all available ε-proteobacteria homologs, is also commonly present in Desulfovibrio and Lawsonia species (belonging to Desulfovibrionaceae family) (results not shown).

Table 1 also lists the available information regarding possible cellular functions of these proteins. Most of these proteins are of unknown functions. However, in a number of cases weak but significant similarity is observed to conserved domains found in other proteins in the databases [47], or to particular COG families [48]. The information of this kind, along with the genomic context of these ORFs, provide useful leads for exploring the cellular functions of these conserved hypothetical proteins [4952]. Of the proteins that are found in all sequenced ε-proteobacteria, WS0266 and WS0802 were experimentally identified as plasminogen binding proteins [53]. It has been suggested that these proteins may enable these bacteria to coat their exterior surface with plasminogen and thus they could be involved in enhancing their virulence. The putative functions of several other proteins are indicated in Table 1 and they include a putative helicase (WS0086), a Cbb3 type cytochrome oxidase (WS0180), a protein related to the FixH family (WS0185) of Rhizobium, a protein WS0316 containing the RDD domain, two proteins (WS0476 and WS0480) which contain molybdopterin_binding (MopB) domain found in NADH oxidoreductase I. Also found were two proteins implicated in flagellar function (WS0490 and WS0575) [23], a protein (WS0520) with TonB domain and another protein (WS1874) containing a domain related to the DNA polymerase delta subunit, a protein (WS2146) showing some similarity to Sua5 domain involved in binding to double stranded DNA, and a protein WS0230 showing similarity to deacylase domain. In addition, several proteins are predicted to be either periplasmic or membrane proteins. It should be emphasized that most of these functional predictions or annotations are based on weak similarity to conserved domains (CD) as identified by the CD search program implemented with the BLAST program [47]. Although this information is very useful, the actual functions of most of these proteins, which exhibit very little similarity to other molecules in the database, remain to be determined. Among the proteins listed in Table 1 that are missing in some ε-proteobacteria, WS1211 is a homolog of the C. jejuni invasion antigen (CiaB), which is recognized as an important factor in its pathogenicity[14, 54]. Of the proteins listed in Table 1, 10 proteins (WS133-WS134, WS184-WS185, WS447-WS448, WS1039-WS1040 and WS1495-WS1496) are present in clusters of two in the genome, and they could be involved in related functions [51, 52].

Several of the proteins listed in Table 1 (e.g., WS0086 and WS2123) exhibit a high degree of sequence conservation across various ε-proteobacteria species. A partial nucleotide sequence alignment for the WS0086 coding sequence for various ε-proteobacterial species is shown in Figure 1. A large number of positions in the alignments are completely conserved in various Campylobacterales species and there are several long stretches (boxed) showing a high degree of sequence conservation. The PCR primers and other molecular probes based on these conserved regions could provide novel and specific means for identification of both new, as well existing Campylobacterales species and possibly different ε-proteobacteria.

Figure 1
figure 1

Partial nucleotide sequence alignment for an ε-proteobacterial specific protein WS0086. The initial part of this alignment, which is less conserved and some of which is also missing in C. lari, is not shown. The asterisks (*) denote residues that are completely conserved. A number of conserved regions that are suitable for designing PCR primers or other diagnostic probes are boxed.

The comparative analysis of W. succinogenes genome has also identified 11 proteins that are uniquely found in Wolinella and Helicobacter species (Table 2). Of these 11 proteins, the first 7 are present in all 4 of the sequenced species/strains from these genera, whereas the last 4 proteins are only found in W. succinogenes and H. hepaticus but missing in the two H. pylori strains. All of these proteins are of unknown function. The Wolinella and Helicobacter genera are part of the Helicobacteraceae family and these uniquely shared proteins provide potential molecular markers for this family.

Table 2 Proteins specific for the Wolinella and Helicobacter species (Helicobacteraceae family)

Our analysis also reveals that 99 proteins in the genome of W. succinogenes DSM 1740 show no significant similarity to any other protein in the databases [see Additional file 1]. Barr et al. [23] have previously indicated a much higher number (i.e. 490) of such proteins. However, since their analysis, genomes of several ε-proteobacteria as well as numerous other organisms have become available [28, 30, 31, 55]. Because of this, and our employment of more stringent criteria for identification of group-specific proteins, the number of such proteins is considerably smaller than indicated originally [23]. Sixteen of these proteins are present in seven clusters (WS0261-WS0262; WS0531-WS0532; WS1446-WS1447; WS1573-WS1674; WS1888-WS1889; WS2027-WS2028-WS2029; WS2032-WS2033-WS2034) in the W. succinogenes DSM 1740 genome.

Proteins specific for the Campylobacter genus

We have also performed BLAST searches on various proteins found in the genome of C. jejuni RM1221 to identify proteins that are unique to the Campylobacter species. Fouts et al. [30], who sequenced the genomes of several Campylobacter species/strains had reported comparative studies on them. Their work identified several proteins that were specific for the C. jejuni RM1221 and C. jejuni NCTC 11168 strains (Supplementary Table S7 in their paper), but they did not look for proteins that were uniquely shared by either all or different Campylobacter species. Our analyses have identified 15 proteins (Table 3) that are uniquely present in all of the sequenced Campylobacter species viz. C. fetus, C. lari, C. upsaliensis, C. coli and C. jejuni (NCTC 11168, RM1221, HB93-13, 84-25, CF93-6, 260.94, 11168 and 81-176). Three additional proteins listed in Table 3, CJE0368, CJE1499 and CJE1574 are missing in only one of the Campylobacter species, which is likely due to gene loss. Eighteen other proteins (Table 4) are present in all of the Campylobacter species, except C. fetus. Among the sequenced Campylobacter species, C. fetus exhibits deepest branching in various phylogenetic trees (see next section). Hence, the absence of these proteins in C. fetus could be explained by their introduction in a common ancestor of the other Campylobacter species after branching of C. fetus. Ten other proteins (Table 5) are commonly present in C. upsaliensis, C. coli and C. jejuni only indicating a closer relationship among these species. The genes for these proteins were likely introduced or evolved in a common ancestor of these three species. Likewise, 28 other proteins listed in Table 6, which are only found in C. coli and C. jejuni (different strains) points to a specific relationship between these species to the exclusion of all others. Most of these proteins are of unknown function. However, in a few cases, where any similarity to conserved domain present in other proteins has been identified by BLAST searches, such information is noted in various Tables.

Table 3 Proteins specific for all sequenced Campylobacter species
Table 4 Campylobacter-specific proteins that are missing in C. fetus
Table 5 Proteins uniquely found in C. jejuni, C. coli and C. upsalienesis
Table 6 Proteins unique to C. jejuni and C. coli

These analyses have also identified a large number of proteins that are specific for the C. jejuni species (Table 7). The first 5 proteins listed in this table are present in all sequenced C. jejuni strains (NCTC 11168, RM1221, HB93-13, 84-25, CF93-6, 260.94, 11168 and 81-176), whereas the remainder are missing or have been lost from a few of the strains.

Table 7 Proteins specific for Campylobacter jejuni

Conserved indels and other rare genetic changes specific for epsilon proteobacteria

Conserved indels in protein sequences provide another useful kind of molecular signatures for taxonomic and diagnostic studies. In our recent work, conserved indels that are distinctive characteristics of many different groups of bacteria (e.g., Chlamydiae, Proteobacteria, alpha proteobacteria, Actinobacteria, Cyanobacteria, Deinococcus-Thermus, Aquificae, etc.) have been identified [44, 5660]. To identify conserved indels that may be specific for ε-proteobacteria, the sequence alignments of various proteins constructed in earlier work were examined. These studies have led to identification of 4 conserved indels that are specific for this group. The characteristics of these indels and of the proteins in which they are found are briefly described below.

In Figure 2, I present sequence information for two conserved indels that are uniquely present in various sequenced ε-proteobacterial homologs, but which are not found in the corresponding proteins from any other organism. The first of these indels is a 3 aa insert in the B protein of the Uvr ABC system (Fig. 2A), which plays a key role in the nucleotide excision repair process [61]. The second indel consists of a 2 aa deletion in the enzyme phenylalanyl-tRNA synthetase (Fig. 2B), which is required for protein synthesis. Both these proteins are widely distributed in bacteria and sequence information for only representative species from other bacteria is presented. The indels in both these proteins are flanked by highly conserved regions and the unique presence of these indels in all available ε-proteobacteria homologs strongly indicate that they are distinctive molecular characteristics of these bacteria. Two additional conserved indels that are specific for only certain ε-proteobacteria are shown in Figure 3. The top panel in this Figure shows a 1 aa insert in the FtsH protease that is uniquely present in all sequenced ε-proteobacteria, except T. denitrificans. The absence of this indel in various other bacteria as well T. denitrificans indicates that this indel is an insert that was introduced in a common ancestor of Helicobacter, Campylobacter and Wolinella, after the branching of T. denitrificans. The lower panel in Fig. 3 shows a highly conserved insert in the β '-subunit of RNA polymerase (RpoC) that is uniquely present in various Campylobacter species, except C. fetus. RpoC homologs are present in all sequenced genomes and the identified insert is not found in any other ε-proteobacteria or other organism. This insert was likely introduced in a common ancestor of the Campylobacter after branching of C. fetus.

Figure 2
figure 2

Partial sequence alignments of the B protein from exinuclease ABC complex (A) and phenylalanyl-tRNA synthetase (B) showing two conserved indels that are specific for ε-Proteobacteria and not found in other organisms. The dashes (-) in the alignment show identity with the amino acid on the top line. The accession numbers of the sequences (second column) and position of the sequence in C. jejuni homolog (on top) are indicated. Sequence information for only representative species is shown.

Figure 3
figure 3

Partial sequence alignments of the FtsH protease (A) and RNA polymerase β' subunit (B) showing two conserved indels that are specific for the indicated subgroups of ε-Proteobacteria. The dashes (-) denote identity with the amino acid on the top line. Sequence information for only representative species is shown.

In addition to these conserved indels, Zakaharova et al. [62] have identified a rare genetic event that causes fusion of two different genes within certain groups of ε-proteobacteria. The two largest and highly conserved subunits of RNA polymerase (RpoB and RpoC, each approximately 1400 aa) are encoded by two distinct genes in various bacteria [62]. However, a rare genetic event has led to the fusion of these genes in Helicobacter and Wolinella species, such that RpoB and RpoC are now made as a single large polypeptide (≈ 2900 aa) (Fig. 4). In contrast, in Campylobacte r and T. denitrificans, similar to other bacteria, separate genes encode for these proteins. This rare genetic event provides evidence of a specific relationship between Helicobacter and Wolinella species, which are part of the Helicobacteraceae family.

Figure 4
figure 4

Diagrammatic representation of the arrangements of two largest subunits of RNA polymerase, i.e. β subunit (RpoB) and β' subunit (RpoC) in different bacteria. In contrast to other bacteria where these proteins are made as distinct polypeptides, in Helicobacter and Wolinella a rare genetic event has led to fusion/joining of the genes for these proteins so that they are now made as a single large polypeptide.

Evolutionary significance of the signature proteins and conserved indels

It is important to understand at what point during the evolution of ε-proteobacteria, the above-described molecular characteristics evolved or were introduced. To determine their evolutionary significance, phylogenetic trees were constructed for the sequenced ε-proteobacteria species based on 16S rRNA and a concatenated dataset of sequences for 9 highly conserved proteins (viz. RpoB, RpoC, Hsp70, Hsp60, elongation factor (EF)-Tu, EF-G, Gyrase A, Gyrase B and alanyl-tRNA synthetase). In the 16S rRNA tree, the ε-proteobacterial species under consideration formed two clades (Fig. 5A). One clade consisted of various Campylobacter species whereas the other clade included Helicobacter, Wolinella and T. denitrificans. In the latter clade, T. denitrificans formed a deep branching outgroup of the Helicobacter and Wolinella species, but a specific association of T. denitrificans to these species was not supported by the bootstrap score of the node (<50%) (Fig. 5A) [8, 12]. In contrast to the rRNA tree, in the tree based on concatenated protein sequences, all of the internal nodes were reliably resolved. In this tree, T. denitrificans formed a deep branching lineage showing no specific relationship to either the Helicobacter/Wolinella clade or to the Campylobacter species (Fig. 5B). A similar deep branching of T. denitrificans in comparison to other sequenced ε-proteobacteria is observed in phylogenetic trees based on Hsp70, RpoC, Gyrase A, Gyrase B and EF-Tu protein sequences (results not shown).

Figure 5
figure 5

Phylogenetic trees based on (A) 16S rRNA and (B) concatenated sequences for 9 proteins (AlaRS, Gyrase A, Gyrase B, EF-Tu, EF-G, Hsp60, Hsp70, RpoB and RpoC) containing 7919 aligned positions. The sequences were bootstrapped either 100 (A) or 500 times (B) and bootstrap scores for all nodes above 50% are shown. (C) A model depicting the evolutionary stages where different Campylobacterales- (or ε-proteobacteria) specific proteins and other RGCs were introduced.

Using the above trees as reference points, the evolutionary stages where different ε-proteobacteria-specific genes/proteins or other molecular signatures likely evolved is depicted in Fig. 5C. The genes for the first 49 proteins listed in Table 1 as well as the conserved indels in PheRS and exinuclease B protein, which are unique to almost all sequenced ε-proteobacteria, were likely introduced in a common ancestor of the Campylobacterales or ε-proteobacteria. The genes for the last three proteins listed in Table 1 (viz. WS1211, WS1752 and WS2059) that are absent in T. denitrificans but present in all (or most) other ε-proteobacteria were likely introduced in a common ancestor of the Helicobacter, Wolinella and Campylobacter after the divergence of T. denitrificans. The insert in the FtsH protease was also likely introduced at this stage. The proteins listed in Table 2 were introduced in a common ancestor of the Wolinella and Helicobacter genera, and it is expected that some of them will constitute distinctive characteristics of the Helicobacteraceae family. The rare genetic event leading to the fusion of rpoB and rpoC genes also occurred at a similar stage. The proteins listed in Tables 3 to 7 that are unique to either all sequenced Campylobacter species or various species within this genus, were introduced at different stages in the evolution of this group (Fig. 5C). The observed species distribution patterns of these proteins strongly support the branching pattern of Campylobacter species in the phylogenetic trees (Figs. 5A and 5B). The inference from these proteins and the phylogenetic trees that C. fetus is one of the deepest branching species within the Campylobacter genus is also strongly supported by the large insert in RpoC (Fig. 3B), which is present in all Campylobacter species except C. fetus.

Conclusion

The comparative genomics of ε-proteobacteria reported here have led to identification of a large number of molecular signatures (e.g., whole proteins, conserved indels and a gene-fusion event) that are distinctive characteristics of these bacteria. Our analyses indicate that these characteristics have been introduced at various stages in the evolution of ε-proteobacteria, but once introduced, they were generally stably retained in various descendents of these lineages with minimal gene loss or lateral gene transfer to other bacteria. Sequence information for these proteins or molecular signatures is presently available only from the Campylobacterales species and no information is available from the Nautiliales order, which comprise the other main group within ε-proteobacteria. However, the genomes of several ε-proteobacteria (e.g. Nautilia, Caminibacter, Arcobacter, Sulfurovum, Nitratiruptor) covering all of its main groups are currently in progress (noted in ref. [10]). Based upon our work on signature sequences for other groups of bacteria [5659], we expect that many of the signatures identified in the present work (Table 1) will also be found in different ε-proteobacteria, whereas several other will prove to be specific for only the Campylobacterales order. The primary sequences of many of these genes/proteins are highly conserved and they provide novel diagnostic tools for these bacteria by means of PCR amplification and fluorescence in situ hybridization methods. Monoclonal and polyclonal antibodies based upon these proteins provide another means for their detection. Additionally, these Campylobacterales or ε-proteobacteria specific proteins also provide potential targets for developing therapeutics and vaccines that are specific for these bacteria. The identified signature proteins and RGCs also provide novel and definitive molecular means for circumscribing a number of taxonomic groups within Campylobacterales (ε-proteobacteria) and for identifying species belonging to these groups.

The cellular functions of most of the ε-proteobacteria-specific proteins are not known. Although a number of these proteins exhibit weak sequence similarity to conserved domains in other proteins, their actual functions may be quite different, and determining them constitute an important task for the future. Likewise, it is also of much interest to understand the functional significance of the conserved indels in various proteins (viz. RpoC, PheRS, FtsH, exinuclease B) that are specific for different taxonomic groups/clades of ε-proteobacteria. Since these indels, which are located in highly conserved regions, are retained by all (available) members of these clades it is highly likely that they are functionally important (and essential) for these bacteria. Thus, it is of much importance to understand how the functions of these proteins are modified by these indels and the physiological significance of these modifications for these bacteria. Further studies on these ε-proteobacteria specific proteins and indels thus may lead to the discovery of novel biochemical and physiological characteristics that are uniquely shared by these bacteria.

Methods

Identification of proteins that are specific for epsilon proteobacteria

To identify proteins that are specific for ε-proteobacteria, all proteins in the genomes of W. succinogenes DSM 1740 [23] were analyzed. This genome was chosen for a number of reasons. First, of the sequenced ε-proteobacteria genomes, W. succinogenes genome is among the largest (2.11 Mb) with 2043 ORFs [23]. Hence, one expects that minimal gene loss has occurred in this bacterium and that it should contain maximal number of genes that may be present in other ε-proteobacteria. Second, phylogenetic and comparative studies have indicated that W. succinogenes forms an outgroup to various Helicobacter species and thus lies in an intermediate position between members of the Helicobacteraceae and Campylobacteraceae families [6, 14]. Thus, BLAST searches on proteins from this genome should enable us to identify proteins that are unique to the Helicobacteraceae family as well as those shared with other taxonomic groups of ε-proteobacteria. To identify proteins that are specific for the Campylobacter species, the genome of C. jejuni RM1221 was analyzed. The BLASTp searches were initially performed on each individual protein or ORF in these genomes against all available sequences in the NCBI sequence database, to identify all related gene/protein in other organisms [63, 64]. These searches were performed using the default parameters as set by the BLAST program, except that the low complexity filter was turned off. The expected values (E-values) of different hits from these searches were inspected to identify putative ε-proteobacteria-specific proteins [38, 40]. The proteins that were of interest to us generally involved large increase in E-values from the last ε-proteobacteria hit in the blast search to the first hit from any other organism. Further, the E values of these latter hits were expected to be in a range higher than 10-4, which indicates weak level of similarity that could occur by chance. However, higher E-values are sometimes acceptable for smaller proteins as the magnitude of the E-value depends upon the length of the query sequence [63]. All promising proteins identified by the above criteria were further analyzed using the position-specific iterated (PSI) BLAST program [63]. This program creates a position-specific scoring matrix from statistically significant alignments produced by the BLASTp program and then searches the database using this matrix. The PSI-BLAST program is more sensitive in identifying weak but biologically relevant sequence similarity as compared to the BLASTp program [63]. The output of the PSI-BLAST program divides the various hits into two categories, i.e. sequences producing significant alignment versus those where the E values are worse than the threshold (default value set at .005). For most of the proteins that are indicated to be specific for different subgroups within ε-proteobacteria, all significant alignments were from the indicated groups. In a few cases, where an isolated hit has an E value slightly below the threshold value (arbitrarily set), but there was a large jump in E value from the last ε-proteobacteria hit, such proteins were also regarded as specific for the indicated groups. All of the identified group-specific proteins were also examined for the presence of any conserved domain [47] and this information along with the genome identification number of the protein, its accession number, sequence length, etc. was tabulated. In the description of various proteins in the text, the "WS" and "CJE" parts of the descriptors indicate the identification numbers of the proteins in the genomes of W. succinogenes DMS 1740 and C. jejuni RM1221, respectively.

Identification of conserved indels that are specific for epsilon proteobacteria

Multiple sequence alignments for large number of proteins have been created in our earlier work [44, 56, 60]. To search for conserved indels that might be specific for ε-proteobacteria, these alignments were visually inspected to identify any indel that was uniquely present in ε-proteobacteria species, and which was flanked by conserved sequences. The indels that were not flanked by conserved regions were not considered. The specificity of these indels for ε-proteobacteria was evaluated by carrying out detailed BLAST searches on short sequence segments (usually between 60–100 aa) containing the indel and the flanking conserved regions. The purpose of these BLAST searches was to obtain sequence information from all available bacteria homologs to determine the presence of the identified indels in various species. The sequence information for these indels was compiled into signature files such as those presented in Figures 2 and 3.

Phylogenetic analysis

Phylogenetic trees for the sequenced ε-proteobacteria species were constructed based on 16S rRNA sequences as well as a number of conserved proteins (viz. RNA polymerase β subunit (RpoB), RNA polymerase β ' subunit (RpoC), DNA gyrase A subunit (GyrA), DNA gyrase B subunit (GyrB), Hsp70, Hsp60, alanyl tRNA synthetase (AlaRS), elongation factor-G (EF-G) and elongation factor-Tu (EF-Tu) proteins) The 16S rRNA and protein sequences were downloaded from the Ribosomal Database Project-II site [65] and NCBI databases, respectively and aligned using the CLUSTALx program [66]. A neighbor-joining bootstrapped trees based on rRNA sequences was constructed by the Juke's and Cantor [67] method. The sequences for various proteins were concatenated into a large dataset containing 7919 aligned positions (RpoB (1440), RpoC (1559), GyrA (880), GyrB (814), Hsp70 (661), Hsp60 (552), AlaRS (912), EF-G (698) and EF-Tu (403)) and a neighbor-joining bootstrap tree based on this was constructed by Kimura's methods [68]. All gaps in the sequences were omitted during phylogenetic analyses. The trees were constructed using the PHYLIP [69] and the TREECON programs [70] and they were rooted using the chlamydiae species which is a deep branching group in comparison to ε-proteobacteria [4143, 45].