Background

A large number of genomes of different strains and closely related species of pathogens have been sequenced and many others are in the process. A detailed analysis of these genomic sequences can help us to decipher and establish genotype to phenotype relationship. The organisms evolve through a series of molecular changes reflected in genomic sequences and some of these are evolutionarily selected based on survival in a specific ecological niche. Characterization of sequence alterations in closely related organisms can help us to understand genome evolution at the molecular level in short time span, for example emergence of new endemic strains in a few decades. Therefore it is important to catalog all the sequence differences between any two organisms so that these can be a basis for designing experiments linking phenotype to genotype. In pathogenic organisms such a database can be useful in identifying species and strain-specific markers that can be a basis for designing diagnostic reagents.

A number of molecular mechanisms have been described that are responsible for genomic changes [1]. These contribute to single nucleotide polymorphisms (SNP), variable number of tandem repeats, insertion/deletion with or without involving transposable elements and recombination. Many of these have been used as markers for identification of strains and diagnosis of pathogens [24]. M. tuberculosis is a major cause of morbidity and mortality throughout the world. Genomic variations in this organism have been used to type pathogenic strains in a limited scale [5, 6]. A comprehensive database of all the genomic variations of M. tuberculosis is not currently available though some attempts have been made in this direction. For example, MTBreg (please see Availability & requirements for more information) covers variations that are detected using spoligotyping, MycoDB (please see Availability & requirements for more information) [7], MycoperonDB (please see Availability & requirements for more information) [8] and GenoMycDB (please see Availability & requirements for more information) [9] have some features that allow comparison between two genomes in a limited manner. In this report we describe a comprehensive database of genomic differences among strains and species of Mycobacteria belonging to the M. tuberculosis complex. The variations have been identified using ABWGC, a comparative genomic tool previously described by us [10]. We hope that this database will be useful to clinicians and basic scientists interested in understanding Mycobacterial diseases.

Construction and content

The database contains pre-computed data derived from full genome sequences of M. tuberculosis strains H37Rv, CDC1551, H37Ra, F11, Mycobacterium bovis AF2122/97 and M. bovis BCG str. Pasteur 1173P2 using ABWGC [10]. The variations are categorized as SNPs, insertions, divergent regions (based on lack of sequence identity) and tandem repeats. All computations have been carried out in a pair-wise fashion. In some cases, such as SNPs the results differ depending upon the genome that has been used as a query in a pair of genomes. The database contains two sets of data pertaining to using each genome as query sequence. Insertion in one genome can be considered as a deletion in another genome, so the database contains only the insertions. If the insertions are due to known insertion elements and phage sequences these have been pointed out so that the information can be used for devising methods for better diagnosis and strain identification. Tandem repeats were identified using ABWGC and verified by "Tandem Repeat Finder" [11].

MGDD is implemented by using three- tier architecture. The web based application is created by using Apache web server which is connected to the database using MYSQL through an application layer written in Perl-CGI.

The information from MGDD can be obtained by selecting a specific query using the "search option" given in the MGDD browser.

Utility and discussion

MGDD has a web interface for the retrieval of genomic diversity information. A search can be initiated by first selecting genomes from the "Query" and "Subject" scroll down menu-bar. Currently there is information about six organisms and these can be selected in a pair-wise manner (Fig. 1). These organisms are:

Figure 1
figure 1

A typical output of a query (SNP). The transition selected was 'gt' and M. tuberculosis H37Rv was compared with M. tuberculosis CDC1551.

M. tuberculosis CDC1551 (NC_002755.2)

M. tuberculosis F11 (NC_009565.1)

M. tuberculosis H37Ra (NC_009525.1)

M. tuberculosis strain H37Rv (NC_000962.2)

M. bovis AF2122/97 (NC_002945.3)

M. bovis BCG str Pasteur 1173P2 (NC_008769.1)

Each pair of organisms can be analyzed in two different ways by choosing each one as query and the other one as target genome. We recommend that a pair should be analyzed in both ways in order to get a comprehensive list of variations, particularly indels. After selection of organisms the type of variation needs to be selected from the search page. Currently there are four options available and one of these to be chosen among:

SNP, Insertion, Repeat expansion, Divergent regions

After submission of the selected information a detailed query page appears. For example, if SNP is selected the new page will ask for choosing one of the 20 different possible transitions in a user-defined menu-bar and the search can be made restrictive by specifying genomic coordinates or gene name (Fig. 1). The output would show all the indicated SNPs in the selected region along with annotation of genes that contain the SNPs (Fig. 1). For insertions, divergent regions and repeat expansion the query page has also the option of selecting output on the basis of size in nucleotides. There are four options at present and these are >10, 10–50, 50–100 and <100. Since one can select only one query at a time, an error message is displayed if more then one query is selected.

Table 1 gives the total data present in the database. However, the distribution of these changes among strains and species are different. For example, the number of SNPs between the two M. tuberculosis strains H37Ra and Rv are 588 and that between the two M. bovis strains are 1271 (Table 2). In general the number of variants, observed between the two M. tuberculosis strains were much less compared to that between the two M. bovis strains. This is consistent with the fact that M. tuberculosis strains H37Ra and Rv have been recently derived from H37 [12]. These differences are a result of evolutionary history of the organisms and can be useful to map all the potential mutation hotspots in these organisms.

Table 1 Statistics and data composition of MGDD
Table 2 Genomic variants in M. tuberculosis and M. bovis strains

Conclusion

In this report we describe MGDD, a database of genomic variants computed from fully sequenced organisms belonging to the M. tuberculosis complex. It contains data pertaining to SNP, insertions, repeat expansion and regions that show sequence divergence. Since MGDD is modular information regarding new genomes can be incorporated as and when the sequences become available. The search tool is simple and user friendly and allows one to locate a specific variation in any part of the genome or a gene.

Availability and requirements

The web server can be accessed at http://mirna.jnu.ac.in/mgdd/.

MTBreg: http://www.doe-mbi.ucla.edu/Services/MTBreg/

MycoDB: http://xbase.bham.ac.uk/mycodb/about.pl

MycoperonDB: http://www.cdfd.org.in/mycoperondb/index.html

GenoMycDB: http://157.86.176.108/~catanho/genomycdb/