Background

Mycobacteria are Gram-positive, acid-fast, pleomorphic, non-motile rods belonging to the order Actinomycetales. Mycobacterium avium complex organisms consist of the human and animal pathogens M. avium subsp. avium, M. avium subsp. paratuberculosis, and M. avium subsp. silvaticum [1]. DNA-DNA hybridization studies have long ago established a genetic similarity between M. avium subspecies avium (M. avium) and M. avium subspecies paratuberculosis (M. paratuberculosis) [24]. Now that whole genome sequencing technologies are available, investigators can begin to examine genetic relatedness in greater detail through direct nucleotide-nucleotide comparisons. These comparisons are particularly important in instances where two genetically similar bacteria have little or no specific diagnostic tests to distinguish each.

The literature reports genetic similarity between M. paratuberculosis and M. avium at between 72% and 95% [2, 4] depending on the region analyzed. However, despite the reported similarities, these mycobacteria are quite different phenotypically. M. paratuberculosis is an intracellular pathogen that infects ruminant animals, most notably cattle and sheep. The site of infection is the gastrointestinal tract, where it causes a chronic inflammatory ailment termed Johne's disease [5]. In contrast, M. avium is common in the environment, causes tuberculosis in birds, and disseminated infections in HIV patients [6]. Growth of M. paratuberculosis is characterized by its slow rate (doubling time of 22–26 hours, compared to 10–12 hours for M. avium) and requirement of mycobactin in culture media [5]. With the absence of a well-defined genetic system for M. paratuberculosis, a comparative genomic approach holds great potential in addressing the genetic basis for many of these phenotypic differences.

The genus Mycobacterium contains species that range from fast-growingsaprophytes such as M. smegmatis and M. fortuitum to slow-growing pathogens such asM. leprae, M. tuberculosis and M. paratuberculosis. Although the chromosomal origin of replication has been studied in some mycobacteria [7, 8], the genetic organization of the origin of replication in M. paratuberculosis has been previously unknown. Knowledge of the gene organization and sequence of this region is particularly important because chromosomal replication may be regulated by a common mechanism that could directly affect rate of growth.

Several features of the oriC region are highly conserved among bacteria. The sequence immediately flanking the dnaA gene is considered the origin of chromosomal replication, or oriC region [9, 10]. This region contains several genes that encode proteins required for basic cellular functions, including the protein subunit of RNase P (RnpA), ribosomal protein L34 (RpmH), the replication initiator protein (DnaA), the beta subunit of DNA polymerase III (DnaN), the recombination repair protein RecF, and the DNA gyrase proteins GyrA and GyrB. The relative gene order in this region is also highly conserved in many bacteria, especially the Gram-positives [11]. Although intergenic sequences in this region are conserved only among closely related organisms, the DnaA box is found in the non-coding regions flanking dnaA in most bacteria studied [12]. DnaA boxes are conserved nucleotide sequences (TTGTCCACA) where the DnaA protein binds to DNA, triggering events that ultimately lead to replication initiation and DNA synthesis [9].

In an effort to understand the genetic basis for growth rate and other phenotypic differences between M. paratuberculosis and M. avium, we have analyzed the genetic similarity of these genomes using two strategies. First, the putative oriC region of M. paratuberculosis was amplified, sequenced and compared with M. avium and other bacteria. Second, we examined nucleotide identity outside the oriC region using DNA homology matrix analysis as well as using several hundred M. paratuberculosis sequences from a random shotgun library compared with M. avium sequences present in the unfinished microbial genomes database. Our results show that these subspecies not only have a conserved gene order surrounding the origin of chromosomal replication, but also have a high synteny and nucleotide identity throughout both genomes. In addition, this preliminary comparative survey of the genomes of M. avium and M. paratuberculosis show even greater similarity (97%) than the literature suggests (72% to 95%) [2].

Results

Identification of predicted ORFs encoding replication-related proteins

An ~11-kb contiguous genomic fragment from M. paratuberculosis was amplified and sequenced using 15 primer pairs designed from M. avium genomic sequence in the putative oriC region (Fig. 1). This strategy enabled the successful amplification of all 15 minimally overlapping fragments of ~800 bp in length for this region of the M. paratuberculosis chromosome. A putative replication origin was identified by GC skew analysis [14]. A strong inflection point in the GC plot marks this origin (Fig. 1). Eleven ORFs were identified using the gene prediction software Artemis [15] (release 3; The Sanger Centre http://www.sanger.ac.uk/Software/Artemis/). Similarity searches were conducted locally using the BLASTP algorithm through the Artemis interface. Seven of these ORFs have high identity to proteins essential for basic cellular processes, including replication, in other mycobacterial species (Table 1). The function of GidB is unknown, but it may have a role in cell division [11]. RNase P, which consists of the protein subunit RnpA and a catalytic RNA subunit, is essential for generating mature tRNAs by cleaving the 5'-terminal leader sequences of precursor tRNAs [16]. rpmH encodes ribosomal protein L34, and DnaA is the initiator protein for chromosome replication. The B-subunit of DNA polymerase is encoded by dnaN. The recF gene product is involved in recombination, DNA repair, and induction of the SOS response, and may also have a role in replication [17]. Bacterial DNA gyrase, a tetramer consisting of A and B subunits, catalyzes the ATP-dependent unwinding of covalently closed circular DNA [18]. The remaining predicted ORFs in this region have high similarity to hypothetical proteins from M. tuberculosis (Table 1).

Figure 1
figure 1

Amplification strategy and organization of the M. paratuberculosis chromosomal origin of replication. The locations of primer pairs used for amplification and sequencing are marked with facing arrows above the kilobase (kb) scale. The GC skew is shown beneath the kb scale and has a window size of 500. OriC, right at the point of the GC inflection, designates the origin of replication. An open reading frame map of the ~11 kb fragment is represented by shaded boxes and the two divergent arrows immediately above identify the direction of transcription. The degree of substitution in comparison to the corresponding M. avium gene is indicated below the gene name. π (tau) is the overall substitution rate, ds is the synonymous substitution rate, and dn is the non-synonymous substitution rate. GidB, glucose inhibited division protein B. RnpA, RNAse protein component A. RpmH, ribosomal protein L34. DnaA, replication initiator. DnaN, DNA polymerase subunit III. GyrB, DNA gyrase subunit B.

Table 1 Sequence analysis of predicted ORFs in the M. paratuberculosis oriC region.

Sequence homology and conserved gene order in the oriC region of mycobacteria and other gram-positive bacteria

Alignment of the region surrounding oriC for several mycobacteria and other gram-positive bacteria provides some interesting comparisons (Fig. 2). The M. paratuberculosis oriC region conforms to the conserved gene order that is present in other mycobacteria as well as the closely related Streptomyces coelicolor. Even the more distantly related Bacillus subtilis shows some degree of synteny in this region. The fast growing M. smegmatis species contains a gnd sequence between dnaN and recF, which is absent in the slow-growing mycobacteria (Fig. 2). However, there appear to be no notable differences between M. avium and M. paratuberculosis at this level. The M. smegmatis coding sequence, gnd, has similarity to the 6-phophogluconate dehydrogenase genes in E. coli, but the mycobacterial protein is predicted to be about 200 amino acids shorter than the E. coli homolog. The length of non-coding intergenic regions between rpmHdnaA and dnaAdnaN is well conserved among the bacteria shown in figure 2. In many bacteria where a functional oriC has been identified, this gene order is conserved and oriC is adjacent to the dnaA gene [9, 10, 19].

Figure 2
figure 2

Comparative gene order in the oriC region of mycobacteria and other Gram-positive bacteria. The relative gene order in this region of M. paratuberculosis conforms to the highly conserved order found in other gram-positive bacteria. Numbers indicate the length of the ORF or intergenic region. Arrows show the direction of transcription.

The amino acid sequence of each gene product was compared with the corresponding sequence in M. paratuberculosis for all species in this study (Table 2). The data show that while gene order is conserved, the percent identity declines in comparisons with mycobacteria other than M. avium. This percent identity declines even further in comparisons with non-mycobacterial sequences such as S. coelicolor and Corynebacteria glutamicum (Table 2).

Table 2 Comparison of amino acid identity in the oriC region with the corresponding M. paratuberculosis sequence.

Conserved functional motifs in the M. paratuberculosis putative oriC

Fuzznuc (EMBOSS; http://www.hgmp.mrc.ac.uk/Software/EMBOSS/index.html) was used to identify potential DnaA boxes in the M. paratuberculosis oriC region. The Gram-positive organisms in this study harbor 10 – 30 DnaA boxes (with 1 – 3 mismatches from the consensus sequence TTGTCCACA) flanking the dnaA sequence [8, 2023] and 35 were found surrounding the M. paratuberculosis dnaA gene (Fig. 3). In addition, a hexameric sequence thought to be recognized by ATP-DnaA (AGATCT) was found in the 3' non-coding sequence adjacent to dnaA (Fig. 3b). The significance of additional dnaA boxes in M. paratuberculosis is likely necessary to open the DNA helix of this GC rich organism (69% GC content).

Figure 3
figure 3

Non-coding sequences flanking M. paratuberculosis dnaA harbor 35 DnaA boxes. Nucleotide sequence of the rpmH-dnaA intergenic region (A) and dnaA-dnaN intergenic region (B) are shown. Sequences matching the DnaA box consensus(TTGTCCACA) with 1 – 3 mismatches are marked with an arrow. In (B), an A/T-rich region is underlined and the potential ATP-DnaA recognition site is boxed.

The dnaA gene is divided into four functional domains based on analysis of several dnaA mutants [24]. These domains consist of (1) an area near the N-terminus thought be involved in ability of the DnaA protein to aggregate, (2) ATP binding, (3) a domain that maps to a region near the C-terminus and is involved in DNA binding, (4) and a final domain of unknown function, but may bind DnaB. The conserved ATP-binding site that is found in domain III in other bacteria was also located in M. paratuberculosis (Fig. 3b). An AT-rich stretch of 19 nucleotides (74% A+T), which in other bacteria serves as the site of local unwinding of DNA after DnaA-DNA interaction, was located in non-coding sequence adjacent to dnaA (Fig. 3b). The non-coding sequences flanking dnaA are slightly AT-rich in general, relative to the rest of the genome sequence, consistent with findings in other gram-positive bacteria (38% – 40% A/T, vs. ~33% in the entire sequence).

A vast majority of all M. paratuberculosis K-10 genomic sequence have considerable nucleotide similarity to sequences from the human pathogenic isolate M. avium 104

As a basis for all nucleotide comparisons between M. avium and M. paratuberculosis in this study, an alignment of the 16s rRNA gene was performed. That analysis revealed a 100% nucleotide identity over the entire 1,472-bp gene (data not shown). Likewise, the oriC region in M. paratuberculosis was found to share a high level of nucleotide identity (~98%) with M. avium. Calculation of the rates of total nucleotide diversity (3) and synonymous substitution per synonymous site (ds) and non-synonymous substitution per non-synonymous site (dn) revealed patterns of variation within the range observed from sequence data outside the oriC region. These calculations showed a high degree of similarity between the two sequences and a predominance of synonymous over non-synonymous substitutions (Fig. 1). The patterns of nucleotide substitution varied considerably between genes in this region of the genome. For instance, there was complete nucleotide identity in the rpmH and recF genes and only 94% identity in the gene rnpA. To verify that these observed differences were real and not as a result of sequencing errors in the yet unfinished M. avium genome, we confirmed the data by resequencing the entire 11 kb region from an isolate clone of M. avium and obtained identical results (not shown).

We next determined if the nucleotide identities would remain consistently high when M. paratuberculosis sequences outside the oriC region were compared with M. avium. Sequencing of the M. paratuberculosis K-10 cattle isolate is nearing completion in our laboratories and TIGR http://www.tigr.org is in the finishing stages of M. avium isolate 104. Beginning with nucleotide number 1 in the dnaA coding region of each genome, a comparison of 2 million bases of M. paratuberculosis with 2 million bases from M. avium by Pustell DNA matrix analysis [25], indicates that genomic similarity continues outside the surrounding oriC region (Fig. 4). When evaluating similarities between two sequences of this size, a matrix comparison is the method of first choice. In addition, the matrix method displays matching regions in the context of the sequence as a whole, making it easy to determine if the regions are repeated or inverted. For example, figure 4 shows a large 56.6 kb genomic inversion of the region surrounding nucleotide 350,000. The DNA identity matrix also identified sequences that were present in one genome, but absent in the other as shown by the broken diagonal lines (Fig. 4). These data show remarkable similarity over large regions in both mycobacterial genomes.

Figure 4
figure 4

DNA matrix analysis of a contiguous 2 million nucleotide section of the M. avium (y-axis) and M. paratuberculosis (x-axis) genomes. Four 500,000 nucleotide matrices are shown with the nucleotide segments indicated above each plot. A long unbroken diagonal line from the upper left corner to the lower right corner indicates that the sequences are collinear. The diagonal line (in blue) that runs from the lower left to the upper right at the 350,000 nucleotide region indicates that one sequence is the reverse complement of the other. The arrows (in red) show sequences present in M. avium but absent in M. paratuberculosis and the arrowhead (in green) shows a sequence represented only in M. paratuberculosis. The initial nucleotide in the dnaA coding sequence was defined as number one in both genomes for this analysis. The parameters for this DNA identity matrix include: a window size of 30, a minimum percent score of 80, and a hash value of 4.

Finally, we analyzed 548 recombinant clones from a randomly sheared M. paratuberculosis small insert library in order to obtain specific rates of nucleotide substitutions. Sequences from these clones represented over 350,000 bp of unique (non-overlapping) M. paratuberculosis genomic DNA and comprised 7% of the estimated 5 Mb genome sequence. From this analysis, we estimated the rates of total synonymous and non-synonymous substitutions for 200 fragments that were aligned in-frame and then analyzed with the program NAGV2 [26] using the methods of Nei and Gojobori [27]. The results of these analyses show that the average nucleotide diversity between the two species is 2.59% ± 0.06% (range 0% to 18.8%; median, 1.85% ± 0.05%). The results also show that the average rates of synonymous substitution per synonymous site are 3.38% ± 1.32% (range, 0% to 19.5%; median, 3.5% ± 1.5%). In contrast, the rates of non-synonymous substitution per non-synonymous site were 1.89% ± 0.05% (range, 0% to 12.9%; median 1.3% ± 0.05%). These results not only indicate that the two subspecies have a high degree of nucleotide identity (>97%), but also suggest that the patterns of substitution have favored synonymous substitutions as can be expected from positive selection.

Discussion

With the genome sequencing projects of M. paratuberculosis and M. avium nearing completion, we have been able to compare large amounts of sequence data for the first time. Our results show substantial nucleotide identity above even that reported previously in the literature [24]. Paradoxically, the overall nucleotide identity between these phenotypically distinct mycobacteria appears similar to that observed with two phenotypically identical Helicobacter pylori isolates at ≥98% nucleotide identity [28].

The high nucleotide identity shared between M. paratuberculosis and M. avium directly conflicts with their divergent phenotypic characteristics. Because of strong similarity in the oriC region, alternative hypotheses should be tested to explain the growth rate differences between M. avium and M. paratuberculosis. Genomic rearrangements and the presence of unique genes identified by matrix analysis in this study are two such possibilities that could account for some of the phenotypic differences. We have recently reported on M. paratuberculosis coding sequences that are absent in M. avium [29]. From an analysis of 48% of the M. paratuberculosis genome, only 27 predicted coding sequences were found to be absent in M. avium. Therefore, an estimated total of 50–60 M. paratuberculosis coding sequences might be absent in M. avium following a whole genome analysis. This extremely low number of unique M. paratuberculosis genes is in stark contrast to E. coli where the MG1655 isolate contains 528 genes not found in the EDL933 isolate [30]. Further analysis of this limited number of unique coding sequences will be critical in developing specific diagnostic reagents. Finally, a detailed analysis of coding sequences unique to each respective mycobacterial genome and their genetic regulatory networks will be necessary to understand the molecular basis for growth rate and other phenotypic differences.

Other potential explanations include the presence of global regulators, insertion sequences, transcription-translation rates, genomic rearrangements and ribosomal RNA operons. Each respective genome possesses insertion elements (IS900, IS1311) at unique loci that could distinctly affect growth difference or other phenotype by insertional mutation. Foley-Thomas et al. [31] compared the expression of the luciferase gene in M. paratuberculosis with the fast-growing M. smegmatis and concluded that the rates of transcription and translation may not account for the slow growth of M. paratuberculosis.

We present evidence for at least one large-scale genomic rearrangement between these two subspecies. This rearrangement consists of a 56.6 kb inversion that contains approximately 61 predicted coding sequences (Bannantine and Kapur, unpublished). Genomic rearrangements such as that described could have a profound effect on phenotype. The presence of multiple copies of ribosomal RNA operons within a genome can be directly attributed to faster growth rate. The increased gene dosage results in more ribosomes and therefore increased protein translational capacity. However, only one rRNA operon is present in each subspecies and this is also true for the fast growers Mycobacterium abscessus and Mycobacterium chelonae [32]. These fast growing mycobacteria have multiple promoters that increase the transcriptional rate of the rRNA operon to overcome gene dosage limitations [32]. The rRNA operon promoter structures have not been mapped by primer extension for either M. paratuberculosis or M. avium, but if M. avium had multiple functional rRNA operon promoters, that may account for the growth rate differences.

The genetic organization of the origin of replication has been characterized in several Gram-positive pathogens including B. subtilis, S. coelicolor, M. tuberculosis, M. avium, M. leprae, and M. smegmatis [8]. The results of our investigation on the oriC region of M. paratuberculosis show that each of the 15 primer pairs, designed from M. avium sequence data, resulted in the successful amplification and subsequent sequencing of an ~11 kb region of the M. paratuberculosis genome. The sequenced region encodes 11 putative proteins, several of which show a high level of identity to proteins that are known or predicted to be involved in DNA replication. However, we found a cluster of substitutions in a region of rnpA (data not shown). It is noteworthy that in this region of the gene, each of the nucleotide substitutions results in an amino acid replacement. While mutations in this region of the gene are known to result in dramatic differences in ability of bacteria to respond to environmental stresses [33], the functional significance of these differences between M. avium and M. paratuberculosis are at present unknown. While these sequencing efforts have revealed a conserved gene order in the oriC of Gram-positive bacteria [11], the nucleotide and amino acid identity between M. paratuberculosis and M. avium in this region is much stronger when compared to other mycobacteria and other Gram-positive bacteria (see Table 2). It is well recognized that the characterization of gene organization in the oriC region as well as the complete genome sequence will provide a springboard for addressing questions such as the nature of the slow growth rate of M. paratuberculosis as compared to the genetically related rapidly-growing mycobacteria. Progress on these research fronts will improve our chances of understanding and controlling infections caused by M. paratuberculosis and related pathogens.

The conservation of functional sequence motifs in the oriC of other Gram-positive organisms has provided clues to the mechanism of bacterial replication. For instance, DnaA monomers bind to specific, non-palindromic 9-nucleotide sequences called DnaA boxes, and this interaction is thought to initiate replication. The oriC of Gram-positive bacteria typically contains 10 – 30 of these DnaA boxes, often found in non-coding regions flanking the dnaA gene. The interaction of DnaA with DnaA boxes promotes the local unwinding of a nearby AT-rich region, providing an entry site for the DnaB/DnaC helicase complex. The dnaA gene itself is divided into four domains that differ in the extent of sequence homology [34]. Domain IV is responsible for DnaA box recognition and domain III is a highly conserved region containing the ATP-binding site [13, 35]. Domain I participates in cooperative DnaA protein-DNA interactions [36].

The genetic relatedness of M. paratuberculosis with other mycobacterial subspecies has been the root cause of the lack of development of M. paratuberculosis-specific diagnostic tests. By comparing the genome sequences of both M. paratuberculosis and M. avium, specific diagnostic tests may be developed and a better understanding of the molecular differences that contribute to unique phenotypes will be obtained. Finally, knowledge of the complete genome sequence of M. paratuberculosis is expected to facilitate the identification of diagnostic sequences in this economically significant veterinary pathogen.

Conclusion

With the genomes of M. paratuberculosis and M. avium nearly completed, investigators will be able to analyze the similarities and differences between these genomes with amazing detail. Through a comparative genomic analysis of over 2 million nucleotides, we have shown that the two subspecies, avium and paratuberculosis, are highly similar at the gene and nucleotide level. This is in stark contrast to the phenotypic differences that each displays.

Methods

Strains and growth media

A cattle isolate (K-10) of M. paratuberculosis [31] has been chosen for genome sequencing studies. The organism was grown in Middlebrook 7H9 broth supplemented with OADC (Difco Laboratories, Detroit, MI), Tween 80, and mycobactin J (Allied Monitor, Fayette, MO) as described by Bannantine et al. [37]. M. avium strain 104 was grown in Middlebrook 7H9 broth. DNA was extracted using the Qiagen QIAamp Tissue Kit (Chatsworth, CA).

Primer design and amplifications

A web-interfaced program, Primer3 http://www-genome.wi.mit.edu/cgi-bin/primer/primer3_www.cgi, was used. Primers were designed based on available M. avium strain 104 genomic sequence data http://www.tigr.org for the amplification of 11 genes in a contiguous ~11 kb M. paratuberculosis fragment surrounding the putative origin of replication (oriC). By this strategy, a total of 15 primer pairs were constructed for the amplification of 15 minimally overlapping fragments of ~800 bp in length for this region of the M. paratuberculosis genome. Amplification reactions included the high fidelity DNA polymerase, Pfu (Stratagene, La Jolla, CA) and an annealing temperature of 58°C.

Library construction

A random 2.2-kb insert library of M. paratuberculosis K-10 has been constructed as follows. Total M. paratuberculosis genomic DNA was isolated and randomly sheared using a nebulizer and compressed nitrogen according to protocols developed by Bruce Roe's laboratory http://www.genome.ou.edu. The resulting DNA fragments were separated by gel electrophoresis and fragments in the range of 2.1–2.2 kb were purified. After polishing the ends of the fragments using Klenow (New England Biolabs, Beverly, MA), they were cloned into SmaI-restricted/CIAP pUC18 vector. The resulting library was >90% recombinant and contained more than 50,000 independent recombinant clones.

DNA Sequencing and Analysis

The DMSO protocol (ABI Automated DNA Sequencing Chemistry Guide, ABI, Foster City, CA) was implemented for carrying out the sequencing reactions and data were collected using ABI 377 automated DNA sequencers at the Advanced Genetic Analysis Center at the University of Minnesota. The data was analyzed using the DNAStar (Madison, WI) package and Artemis [15]. Rates of synonymous and non-synonymous substitution were calculated by the un-weighted method of Nei and Gojobori [27]. Pustell DNA matrix analysis [25] was performed using MacVector version 7.1 software.

Nucleotide Sequence Accession Number

The GenBank accession number for the M. paratuberculosis 11-kb oriC region is AF222789. The M. paratuberculosis random sequences can be accessed via the M. paratuberculosis genome project website: http://www.cbc.umn.edu/ResearchProjects/AGAC/Mptb/Mptbhome.html.