Background

Helicobacter pylori infections occur in approximately 50% of the human population and are associated with several inflammatory gastroduodenal diseases [1], including two types of gastric cancers: gastric adenocarcinoma [2] and gastric extra-nodal marginal zone B-cell MALT (mucosa-associated lymphoid tissue) lymphoma, first described by Isaacson et al. [3]. Evolution of this bacterial infection towards malignancy only occurs in approximately 1% of infected individuals, suggesting that both bacterial and host susceptibility factors are involved[4].

Since the discovery of H. pylori, several studies have focused on elucidating H. pylori pathogenicity mechanisms (microbial factors) that are associated with disease outcomes[5]. The cag-pathogenicity island (cag PAI) has been recognized as a major pro-inflammatory actor, but its association with MALT lymphoma strains has yet to be clearly shown [6]. The VacA vacuolating cytotoxin, thought to cause detectable alterations in gastric epithelial cells and immune cells, is also one of the most studied H. pylori virulence factors [7]. VacA has also been suggested to play a role in H. pylori persistence, demonstrated by in vitro studies, based on its immunosuppressive properties [8]. Adhesion of H. pylori to gastric epithelial cells is another bacterial trait contributing to chronic state of the infection. BabA [9], SabA [10], HopZ [11], HomB [12] and 30 outer-membrane-like paralogs recognized as adhesins or potential adhesins are encoded by the H. pylori genome [13]. Several studies have highlighted their contribution to pathogen fitness in human populations [14, 15]. Over the last twenty years, genes encoding these virulence factors have served as genotyping markers to establish correlations between these markers, alone or in combination, and clinical outcomes of H. pylori infections [16].

Few studies have been conducted in relation to gastric MALT lymphoma-associated strains. Koehler et al. reported that the vacA m2 allele predominated in MALT lymphoma-associated isolates [17]. In previous studies [18, 19] including an identical collection of H. pylori gastric MALT lymphoma strains to that used here, the authors confirmed this finding and suggested that certain combinations of genomic markers may have a predictive value for determining whether gastric MALT lymphoma develops. All these data suggest the potential role for bacterial determinism in the clinical outcome of MALT lymphoma.

So far, comparative genomics involving sequenced H. pylori genomes have been limited to five clinical isolates isolated in the West and associated with gastritis [strain 26695 [20], peptic ulcers (strains J99 [GenBank:AE001439.1], P12 [EMBL:CP001217, EMBL:CP001218]), atrophic gastritis (HPAG1 [21]), or no known disease (strains G27 [22] and Shi470 [RefSeq:NC_010698]. However, no genome sequence of a H. pylori strain isolated from MALT lymphoma is currently available. Comparative genomics based on DNA-array analyses, first conducted by Salama et al. on 15 Caucasian isolates [23], led to the elucidation of the H. pylori core genome comprising the pool of ubiquitous H. pylori genes and strain-specific genes (non-ubiquitous). Gressmann et al. studied gene gain and loss during evolution, by comparing the genome of 56 globally representative strains of H. pylori; they reported that 25% of the genes were non-ubiquitous [24]. Through comparative genomics based on the analysis of 24 clinical isolates from various geographical origins (Western, Asian, African countries) using whole genome DNA arrays, we identified 213 non-ubiquitous or strain-specific genes [25]. In this study, we describe the gene distribution of these 213 non-ubiquitous genes (Additional file 1) within genomes from a large geographically homogeneous French collection of 120 well-characterized H. pylori strains associated with chronic gastritis, duodenal ulcer, intestinal metaplasia or gastric MALT lymphoma. A hierarchical clustering analysis of the DNA hybridization values identified a homogeneous phylogenic subpopulation of strains containing all of the cag PAI minus MALT lymphoma isolates. The B38 isolate was selected as a representative of this MALT lymphoma-specific cluster. Its genome sequence was completed, fully annotated, and compared with previously sequenced and published H. pylori genomes.

Results and Discussion

Non-ubiquitous gene distribution in relation to associated diseases

Hybridization results for the 120 studied DNAs used as a probe and the home-made macroarrays derived from the reference strain 26695 are presented in Additional file 1 (data based on the binary presence/absence analyses) and Figure 1 (data based on the multidimensional analysis of continuous values, see material and methods). Both presentations illustrate the distribution of each of the 254 genes (213 non-ubiquitous, and 41 ubiquitous, used for normalization) with respect to associated diseases. Each strain hybridization profile (Figure 1) is represented by a series of vertically aligned bar charts, whereas the horizontal lines represent each of the 254 genes. Each strain exhibited a unique profile. The most striking features were related to the distribution of the cag PAI genes: almost all H. pylori strains associated with metaplasia harbored a complete cag PAI, a result consistent with findings by Nilsson et al. [26]. However, a complete cag PAI was present in 70% of duodenal ulcer strains, and in 50% of chronic gastritis and of MALT lymphoma strains, confirming previously published findings for isolates collected in the West [27].

Figure 1
figure 1

Hybridization reactions on a DNA macroarray membrane containing 254 PCR products that are representative of H. pylori strain 26695 (41 ubiquitous genes + 213 non-ubiquitous or strain-specific genes). Bacterial DNAs from 120 isolates involved in various diseases, including chronic gastritis (yellow), intestinal metaplasia (pink), duodenal ulcer (blue) and gastric MZBL (green), were tested by hybridization. Isolates are listed on the horizontal axis, and the genes tested, on the vertical axis. Clustering (genesis software) was carried out using the continuous values from 120 heterologous hybridization experiments, where each value corresponds to the (log26695-logheterol.strain) value for each tested gene (see materials & methods). Colors of the line range from blue, if the gene is present, to red, if absent. The range of intermediate colors reflects the degree of hybridization and thus homology, but also the redundancy of the tested genes. This figure represents the clustering based on the complete set of 254 genes.

Hierarchical clustering of the continuous values derived from the hybridization experiments of 120 French clinical isolates presenting different disease characteristics was performed (Figure 1). This allowed us to visualize a branch clustering almost exclusively isolates associated with MALT lymphoma. Furthermore, principal component analysis allowed us to identify a combination of 48 genes (Additional file 1), which proved to be the most informative during multidimensional analysis. We then performed hierarchical clustering based on the values of these 48 genes (Figure 2). Two main branches were detected, one consisting of a distinct cluster of 20 isolates, all totally deprived of the cag PAI. Eighteen of the isolates were associated with MALT lymphoma and two with gastritis. Interestingly, none of the peptic ulcer or metaplasia isolates clustered in this branch. The second branch splits into two main clusters, one corresponding to isolates that totally or partially lack cag PAI genes mostly associated with gastritis and the other clustering isolates associated with other diseases.

Figure 2
figure 2

Hybridization reactions on a DNA macroarray membrane: clustering based on the 48 most discriminatory genes identified as key combinations of variables (genes/axes) from Principal Component Analysis. These 48 genes are labeled in Addional file 1.

To clarify the genetic determinism of the MALT lymphoma strains, we selected one strain that was representative of the MALT lymphoma cag PAI minus branch and determined its genome sequence. We selected strain B38, which was isolated from a 62-year-old man suffering from MALT lymphoma. It fulfilled various requirements: i) it belonged to the hpEurope phylogenetic branch according to MLST analysis (Suerbaum, personal communication), a property that was consistent with the five Helicobacter genome sequences previously published (26695, J99, HPAG1, P12, and G27); ii) it was genetically transformable; iii) it was plasmid free, and iv) it was capable of colonizing the mouse gastric mucosa. Its vac A status was s2m2 [18].

Main features of the B38 genome

The genome of the B38 strain consists of a circular chromosome containing 1,576,758 base pairs (bp) and an average GC content of 39.2% (Figure 3). It is the smallest H. pylori genome sequenced to date (Table 1). The B38 genome sequence was first automatically and then manually annotated using the MaGe system [28]http://www.genoscope.cns.fr/agc/mage and was then compared with the other sequenced H. pylori genomes. It contains 1,528 CDSs with a coding density (85.0%) similar to that found in the other Helicobacter sequenced strains. Among the 1528 CDSs, 1393 were predicted to be protein-coding genes (complete CDSs) with an average length of 971 bp; 135 correspond to partial CDSs, of which 133 are pseudogenes (i.e. 133 fragments representing 62 genes) and two are remnant genes (corresponding to truncated genes for which we cannot find the missing sections in close proximity) (Table 1).

Table 1 Summary of comparative features of Helicobacter genomes
Figure 3
figure 3

Genome map of Helicobacter pylori strain B38. From outside to inside: -GC skew (window 2500, step 500) in blue. -Total CDSs (green) with pseudogenes/partial genes (purple). -CDSs coding for hypothetical restriction/modification systems (purple), phage proteins (orange), or insertion sequences (ISHp609) (green). -Total CDSs according to the matrix defined for gene identification (matrix n°1 in red, matrix N°2 in black, matrix n°3 in green). -RNA (rRNA in green, tRNA in purple and misc_RNA in red). -Rule. -GC% (window 5000, step 2000) in yellow.Red arrow indicates the position of the origin of replication.

Of the 1,528 annotated CDSs, a function was assigned to 989 CDSs (64.7%). For 784 of them (79.3%), a function was experimentally demonstrated either in the Helicobacter species (188, 12.3%) or in another organism (596, 39%). Two hundred and five CDSs (20.7% of 989) received a function based on the presence of a conserved amino acid motif, a structural feature, or limited homology. A total of 378 CDSs have homologs in previously reported sequences of the genus Helicobacter (43.6% of 378), in the epsilon proteobacteria (35.2% of 378), or in other distant bacteria (21.2% of 378). Protein function classification based on the cluster orthologous genes classification (COG) database allowed us to place 1189 of the 1528 CDSs (77.81%) in at least one of the COG functional groups (Table 2): 454 were assigned to cellular processes and signaling systems, 342 to information storage and processing, while 595 were involved in metabolism. The B38 genome exhibits the highest percentage of CDSs associated with a COG group (77.97% vs 73.38% for 26695, 76.48% for J99, 76.15% for HPAG1 and, 73.49 for Shi470), with the number of CDSs involved in defense mechanisms slightly higher than in the other sequenced Helicobacter strains.

Table 2 Automatic distribution of protein functions, based on the COG classification, between Helicobacter strains

There are a significant number of restriction/modification systems present in H. pylori; their composition and activity have been shown to be strain-specific [29]. In the B38 strain, 63 CDSs were involved in restriction/modification systems. Among them, 30 elements were fragmented into pseudogenes corresponding to 12 potential genes, and three elements appeared to be partial genes (Additional file 2). Thus, the proportion of potentially active genes (52%) appeared to be higher in B38 than in strains J99 and 26695, in which only 30% of type II R-M systems were reported to be functional [30].

The B38 genome harbors five complete copies of the four-gene insertion sequence ISHp609. This insertion sequence was frequently found in H. pylori strains from Europe, Americas, India and Africa, but was almost always absent in strains from East Asia [31]. Three of the four genes (orf1, orf2, ORFA) demonstrated 100% of identity in the five B38 ISHp609 copies, whereas ORFB from one of the five B38 ISHp609 copies (HELPY1334) exhibited a single mutation. Among the sequenced genomes (Table 1), a single and complete copy of this element was found in strain HPAG1, but it differed slightly from that found in B38 (6, 8, and 9 mutations are present in orf1, ORFA, and ORFB of HPAG1, respectively). This consistency in the five copies of ISHp609 in B38 indicated that it has been acquired very recently, and that it is probably an active element that is capable of transposition, a property never experimentally demonstrated for a transposable element in H. pylori.

Another property associated with the B38 genome relates to the complete absence of four of the 45 genes encoding outer membrane proteins (OMPs) from the four conserved OMP families (Hop, Hor, Hof et Hom) (Additional file 3). B38 lacks bab B, bab C, sab B, and hom B, four OMPs known to play a major role in adhesion to gastric epithelial cells and possibly in long-term persistence of strains in the human gastric mucosa when associated with peptic ulcer diseases or gastric metaplasia [32]. B38 lacks a high number of adhesin genes among the sequenced genomes.

Comparative genomics and genome evolution

We then analyzed the genomic rearrangements through pair-wise genomic synteny comparisons between B38 and the eight published Helicobacteriaceae genomes. For five of the isolates (namely, 26695, J99, G27, P12, HPAG1), we confirmed the previously reported relative colinearity of the H.pylori genomes. This colinearity is mainly interrupted by insertion elements, the cag PAI, and genes encoding hypothetical proteins [33]. However, unexpectedly, conserved synteny highlighted an almost complete colinearity never described so far, between B38 and Shi470 (Figure 4). Shi470 is a clinical isolate from the gastric antrum of an Amerindian resident of a remote Amazonian village in Peru, and was thought to be related to strains from East Asia [RefSeq:NC_010698]. This unexpected absence of major genomic rearrangements between the two genomes prompted us to compare the genome of these two isolates more closely, as a way of better understanding H. pylori genome evolution. B38 lacks 174 Shi470 genes, of which 70 genes cluster in three insertion blocks: one corresponds to the well characterized cag PAI; another to a block of 33 CDSs, mainly remnants from a conjugative plasmid (presence of TraG, VirB11, toposiomerase I, ComB3, homologs of conjugal plasmid transfer system); and the third corresponds to a block that includes 7 CDSs encoding hypothetical proteins, as well as one CDS encoding an exodeoxyribonuclease subunit which is unique to the Shi 470 isolate.

Figure 4
figure 4

Synteny lineplot pair-wise analyses between B38 and the H. pylori strain 26695, J99, HPAG1, Shi470, P12, G27, Helicobacter hepaticus , or Helicobacter acinonychis.

Conversely, loss of synteny was also due to the presence of 110 CDSs in B38 that were not present in Shi470. Forty-three of these CDSs appeared as clusters within eight loci. Twenty corresponded to ISHp609 (5 complete and conserved copies of ISHp609 each comprising orf1, orf2, ORFA and ORFB) [31], which interrupts HELPY0571, HELPY0700 (both encoding restriction/modification systems), HELPY0838 (encoding a putative Rad50 ATPase), HELPY1330 (encoding a putative glycosyl-transferase), and HELPY1529 (a HAC prophage II protein homolog). In addition to these five ISHp609 insertions, loss of synteny was also due to the presence of CDSs in four other loci: i) a cluster of seven genes (HELPY1520 to HELPY1525 and HEPLY1527, HELPY1528 to HELPY1533) encoding HacII prophage-like proteins similar to those found in H. acinonychis strain Sheeba [34]; however, the size of the prophage is much larger (32 CDS) in this species, suggesting that the prophage in B38 has been deleted, possibly following the insertion of one copy of ISHp609; ii) a cluster of six genes encoding hypothetical proteins of unknown function (HELPY0051 to HELPY0056); iii) a cluster of three CDSs that are absent in Shi470, HPAG1, J99, P12, and G127, but present in strain 26695, of which two encode alginate-O-acetylation proteins (HELPY0497-498); iv) a cluster of seven CDSs that encode a putative helicase (HELPY0989) and a putative serine kinase (HELPY0990), two functional proteins not found in all of the other sequenced strains.

H. pylori core genome and strain-specific genes

BLAST score ratio analyses and comparisons between the B38 strain and the six other sequenced genomes, which were analyzed and revised through the MaGe system (Table 1), allowed us to establish that the core of the H. pylori genome consists of 1,275 CDSs. This number is slightly higher than that recently published by McClain and colleagues who identified 1,237 genes, as it takes into account additional CDSs detected by the MaGe system [35]. This number is lower than that calculated from data presented in Additional file 1 (1,358 genes) based on the macroarray hybridization analysis of 120 isolates. This approach overestimated the number of ubiquitous CDSs, as all small CDS (<350 bp) from the 26695 strain genome were excluded from the analysis, and thus were systematically counted as ubiquitous CDSs.

To identify strain-specific genes present in the B38 strain but absent from the other sequenced strains, we studied the putative orthologous relationship between two genomes i.e. gene couples who satisfy Bi-directional Best Hit (BBH) criteria. Criteria included a minimum of 30% sequence identity and 80% of the length of the smallest protein (Additional file 4). Only 16 CDSs were found to be unique to the B38 strain: nine seemed to be complete and thus putatively functional; six were shown to encode the putative HacII prophage-like proteins (HELPY1521-1522-1523-1524-1525-1527); three were found to encode hypothetical proteins (HELPY0409, HELPY0645 and HELPY0996), and seven corresponded to fragments of genes (partial genes) coding for either conserved hypothetical proteins, prophage-like sequences or for a restriction enzyme. Using the same methodology, we looked for genes that were present in the various H. pylori strains and absent in B38 (Additional file 5). If compared pair-wise, the number of CDSs absent in B38 was between 105 and 175. The only genes that were found to be exclusively absent in B38 corresponded to those of the cag PAI (Additional file 5), the well-known cluster of genes involved in the induction of a strong inflammatory response.

Specific properties associated with the genomes of strains belonging to the MALT lymphoma PAI minus cluster

Of the 19 strains belonging to the MALT lymphoma PAI minus cluster, all 19 contained the vac Am2 allele; 16 exhibited an s2m2 genotype, indicating that they encode a non-functional cytotoxin, and three exhibited an s1m2 genotype [18]. We then investigated whether the properties found to be unique to strain B38 are shared by the strains belonging to the cluster of the MALT lymphoma PAI-minus cluster. The search for the presence of the HacII-like prophage was done through hybridization using internal fragments of HELPY1521, HELPY1525, and HELPY1526 as probes. Four of the 19 strains (21%, including B38) of the MALT lymphoma PAI minus cluster, contained HacII prophage-like sequences. By contrast, 1/24 (4%) strains isolated from patients with MALT lymphoma containing cag PAI, 2/33 (6%) strains from patients suffering from gastritis and 2/27 strains (7.4%) from those with duodenal ulcers contained HacII prophage-like sequences. Furthermore, the presence of the two adjacent HELPY0989 and HELPY0990 genes encoding a helicase and a serine kinase, respectively, not previously found in the other sequenced genomes as functional proteins were found in three of the 19 strains (16%) of the B38 cluster. These two genes were not detected in the other MALT lymphoma strains (cag PAI positive), nor within the 22 isolates associated with gastritis and peptic ulcers. Finally, three clustered conservative mutations in glmM (HELPY0072 - Ala332, Leu333), leading to the absence of amplification of the 294-bp internal fragment of the phosphoglucosamine mutase-encoding gene [36], were observed in five of the 19 MALT lymphoma PAI minus isolates (26%). However, these mutations were not found in any of the 120 clinical isolates of this study, nor were they found in more than 400 H. pylori isolates associated with gastritis, peptic ulcers or metaplasia that were tested with identical oligonucleotides (personal data). These conservative mutations may be indicative of a selective pressure to maintain these mutations, together with a property encoded by a gene present in close proximity to glmM, a property that has yet to be identified. Thus, although none of the unique properties of B38 were shared by all MALT strains of the cluster, characterizing a cagPAI minus isolate containing either glmM mutations or HELPY0989-0990 genes may be predictive of MALT lymphoma, as these two characteristics were found exclusively among the strains of this cluster.

Conclusion

The study was initiated with the aim of gaining insight into the existence of bacterial determinism for gastric extra-nodal marginal zone B-cell MALT lymphoma. DNA hydridization against the whole genome of 120 clinical isolates revealed a cluster of 19 H. pylori strains, all completely deprived of cag PAI sequences originating from patients with MALT lymphoma. We sequenced the genome of strain B38, a representative of this cluster, and describe the first genome sequence of a cag PAI minus H. pylori strain. The absence of the cag PAI, including that of several non-ubiquitous genes, makes the B38 genome the smallest H. pylori genome described to date. The cagPAI minus B38 strain lacks a functional cytotoxin (vac As2m2) as well as genes encoding the major adhesion factors (absence of bab B, bab C, sab B, and hom B); thus, compared with well-known pro-inflammatory H. pylori isolates, it appears to be deprived of all known pathogenic determinants, but is nonetheless associated with gastric neoplasia. Further investigation is required to fully understand the difference in fitness between these strains with low pro-inflammatory profiles and the human host factors that may play a significant role in the development of gastric MALT lymphoma.

Methods

H. pylori strains, and growth

We examined 120 H. pylori strains isolated from patients from different areas of France enrolled in 3 multi-center studies carried out by 1) the Groupe d'Etude Français des Helicobacter (G.E.F.H.), 2) the Groupe d'Etude Français des Lymphomes Digestifs (G.E.L.D.) [37] and of the Fédération Française de Cancérologie Digestive (F.F.C.D.) [38], and 3) the Groupe d'Etude des Lymphomes de l'Adulte (G.E.L.A.). Criteria for patient inclusion were age (>55 years), suffering from chronic gastritis (n = 33), duodenal ulcer without intestinal metaplasia (27), intestinal metaplasia without ulcer (n = 17). We identified 43 strains from patients with gastric MALT lymphoma. H. pylori was isolated from one biopsy specimen following biopsy homogenization and culture under microaerophilic conditions (5-6% 02, 8-10% CO2, 80-85% N2) on blood agar medium (BA; Oxoid blood agar base N°2) supplemented with 10% horse blood, as reported previously [39]. One colony was selected at random from each primary culture; it was then sub-cultured and used to prepare chromosomal DNA. This DNA was extracted from 48-hour-old confluent cells using the QIAamp Tissue kit (Qiagen, Chatsworth, CA) according to the manufacturer's recommendations.

In house DNA macroarray membrane preparation

A total of 254 PCR products were amplified in four 96-well microtiter plates, corresponding to 41 ubiquitous and 213 non-ubiquitous genes from the genome of strain 26695 as previously described [39]. Briefly, amplification reactions were performed in 2 × 100 μl reaction volumes, in which 2 μl of DNA corresponding to the recombinant plasmid containing the full-length CDS (CoDing Sequence) inserted into the pILL570-derivative vector was used as template. Each PCR product was sequenced to confirm the identity of the gene, and was then spotted in triplicate onto a nylon membrane (Qfilter, Genetix 22.2 × 22.2 cm, N+) using a Qpix robot (Genetix). Denaturated 26695 genomic DNA was spotted in triplicate at the four corners of the membrane (positive controls) and seven squares were left empty as negative controls. Following spot deposition, membranes were fixed for 15 minutes in 0.5 M NaOH 1.5 M NaCl, washed briefly in distilled water, and stored wet at -20°C until use [39].

Aliquots of 250 μl of DNA were labeled by random priming with 2 μl of 33P-dCTP. Labeling was performed for 3 hours at room temperature. Unincorporated radionucleotides were removed by purification on Quick Spin Sephadex G-25 columns (Roche Diagnostics). Immediately before being used for hybridization experiments, the sonicated, labeled, and purified chromosomal DNA was heat-denaturated and cooled on ice. Hybridization was conducted in 5 ml prewarmed (65°C) hybridization mixtures containing the heat denaturated probe, with overnight incubation. Membranes were then washed and exposed for 25 hours to a phosphoimager screen (Molecular Dynamics).

Screens were scanned on a Storm 860 machine (Molecular Dynamics). Image analysis and quantification of hybridization intensities for each spot were performed using the Xdots Reader program (COSE) and determined in pixels [39]. The intensity of the background surrounding each spot was substracted from that of each of the spots. Twenty-one homologous hybridizations were performed. The average intensity of the 41 ubiquitous genes was calculated for each reference array. This number served to allocate a reference array to each heterologous hybridization (average of the ubiquitous spots from the heterologous and the homologous reference hybridizations were not significantly different, Student's test), to calculate the ratio used for normalization. Following normalization, the data were analyzed by attributing a binary score (presence/absence - Additional file 1) or by multidimensional analysis based on continuous intensity values (Figure 1 and Figure 2). To define the cutoff ratio for the presence/absence of a gene, we analyzed the results for the sequenced H. pylori J99 DNA hybridized with H. pylori 26695; the threshold for the presence of a gene was defined as >0.25. The multidimensional analyses (Genesis software) for the hierarchical clustering as well as for the Principal Component Analysis were performed using the 254 continuous values from the 120 heterologous hybridization experiments, each corresponding to (log10normalized intensity values of strain 26695) minus (log10normalized intensity values of the heterologous strain) (i.e. log26695-logheterol.strain).

Sequencing and annotation of the B38 genome

Genomic DNA was randomly sheared by nebulization (HydroShear, GeneMachines) and the ends were enzymatically repaired. Sma I fragments (1.5-4 kb) were inserted into plasmid vector pBAM3/Sma I (derived from pBluescript KS and constructed by R. Heilig). Large (35-45 kb) DNA fragments generated from partial BamH I-restriction were inserted into the cosmid vector pHC79/BamH I.

Plasmid DNA was prepared with the TempliPhi DNA sequencing template amplification kit (GE Healthcare-Bio-Sciences). Cosmid DNA was purified with the Montage BAC Miniprep96 Kit (Millipore). Sequencing reactions were performed from both ends of DNA templates using ABI PRISM BigDye Terminator cycle sequencing ready reactions kits and were run on a 3700 or a 3730 xl Genetic Analyzer (Applied Biosystems).

Sequence data base calling was carried out using Phred [40]. Sequences not meeting our production quality criteria (at least 100 bases called with a quality over 20) were discarded. Sequences were screened against plasmid vector and E. coli sequences. The traces were assembled using Phrap and Consed [41]. Whole genome shotgun sequencing was performed to ensure approximately 11-fold coverage. Autofinish [42] was used to design primers for improving regions of low quality sequence and for primer walking along templates spanning the gaps between contigs. Several strategies were used to orientate contigs and to enable directed PCR-based approaches to span the gaps between contigs. These strategies included linking isolates and a Blast-based approach, which identified contigs with hits to the H. pylori strain 26695 genome. Various combined PCR techniques were used to amplify genomic or cosmid DNA, to close the gaps between the final contigs. Outward-directed primers were designed for each of the contig ends; the primer sequences were subsequently checked and confirmed to be unique to the genome. This combined PCR process required approximately 200 PCR reactions pairing each of the primers. In addition, two cosmid isolates containing a rDNA operon copy each, were completely sequenced by sub-cloning into a pSMART-LC vector (Lucigen Corp.). The error rate was less than 1 error per 10,000 bp in the final assembly. The complete genome sequence was obtained from 40 153 sequences, resulting in 14-fold coverage.

AMIGene software was used to predict which CDSs were likely to encode proteins [43]. The set of predicted genes underwent automatic functional annotation using the set of tools listed in Vallenet et al. [28]. All these data (syntactic and functional annotations, results of comparative analysis) are stored in a relational database, called PyloriScope. Manual validation of the automatic annotation was performed using the MaGe (Magnifying Genomes, http://www.genoscope.cns.fr) web-based interface, which allows graphic visualization of the annotations enhanced by the synchronized representation of synteny groups in other genomes chosen for comparison.

Accession Numbers

The EMBL Nucleotide Sequence Database http://www.ebi.ac.uk/embl accession number for the H. pylori strain B38 chromosome is [EMBL:FM991728].

All data and comparative genomics concerning the H. pylori B38 genome are stored in PyloriScope http://www.genoscope.cns.fr/agc/mage, a related database that is available to the public.