Importing statistical measures into Artemis enhances gene identification in the Leishmania genome project

Aggarwal, Gautam; Worthey, EA; McDonagh, Paul D; Myler, Peter J

doi:10.1186/1471-2105-4-23

Importing statistical measures into Artemis enhances gene identification in the Leishmania genome project

Research article
Open access
Published: 07 June 2003

Volume 4, article number 23, (2003)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

Importing statistical measures into Artemis enhances gene identification in the Leishmania genome project

Download PDF

Gautam Aggarwal¹,
EA Worthey¹,
Paul D McDonagh² &
…
Peter J Myler^1,3

8226 Accesses
16 Citations
Explore all metrics

Abstract

Background

Seattle Biomedical Research Institute (SBRI) as part of the Leishmania Genome Network (LGN) is sequencing chromosomes of the trypanosomatid protozoan species Leishmania major. At SBRI, chromosomal sequence is annotated using a combination of trained and untrained non-consensus gene-prediction algorithms with ARTEMIS, an annotation platform with rich and user-friendly interfaces.

Results

Here we describe a methodology used to import results from three different protein-coding gene-prediction algorithms (GLIMMER, TESTCODE and GENESCAN) into the ARTEMIS sequence viewer and annotation tool. Comparison of these methods, along with the CODON USAGE algorithm built into ARTEMIS, shows the importance of combining methods to more accurately annotate the L. major genomic sequence.

Conclusion

An improvised and powerful tool for gene prediction has been developed by importing data from widely-used algorithms into an existing annotation platform. This approach is especially fruitful in the Leishmania genome project where there is large proportion of novel genes requiring manual annotation.

The state of play in higher eukaryote gene annotation

Article 24 October 2016

Protocol Update for large-scale genome and gene function analysis with the PANTHER classification system (v.14.0)

Article 25 February 2019

On the Identification of Clinically Relevant Bacterial Amino Acid Changes at the Whole Genome Level Using Auto-PSS-Genome

Article 19 May 2021

Background

At Seattle Biomedical Research Institute (SBRI), we are involved, as part of the Leishmania Genome Network (LGN), in the sequencing and annotation of the trypanosomatid protozoan species L. major Friedlin (LmjF). Following DNA sequence determination, putative protein-coding regions within the sequence are predicted and functionally classified. Although trypanosomatids are eukaryotes, their gene structure is more similar to that of prokaryotes; they have essentially no introns and small intergenic regions. Two small LmjF chromosomes (chr1 and chr3) have been completely sequenced and annotated. The 79 protein-coding genes predicted from chr1 are organized in two large divergent polycistronic gene clusters of 29 and 50 genes, on the "bottom" and "top" DNA strains, respectively [1]; while chr3 contains two convergent polycistronic clusters of 65 and 29 genes, with a single divergent gene at one telomere and a single tRNA between the two large clusters [2].

Presently, a large number of methods exist for in silico prediction of coding regions [3–7]. These computational methods use a range of underlying statistical properties of the coding regions and can be generally classified as consensus (signal sensors) and non-consensus (content sensors) [8, 9]. The non-consensus methods can be further classified as trained, which require unbiased sets of coding regions, and untrained, which use statistical properties to discriminate between coding and non-coding regions. Although non-consensus methods have been very successful in identifying genes in most of the sequencing projects, currently none have 100% specificity and sensitivity. In the absence of such a method, the use of a combination of methods is next best option [10–13]. Since LmjF genes do not contain introns, and the signal sequences for trans-splicing and polyadenylation are poorly defined, consensus methods have little utility for Leishmania gene prediction. In addition, ~70% of the genes have no significant homology to existing genes in sequence databases, so extrinsic content sensing methods are of limited use; leaving only intrinsic content sensing methods for possible use in gene prediction. Given that the number of experimentally confirmed gene prediction in Leishmania is currently small, and many methods use similar statistical approaches [4], the choice of two trained methods (GLIMMER[14] and CODON USAGE[15]) and two untrained methods (TESTCODE[16], and GENESCAN[17]) which rely on unrelated statistical measures should provide substantial power for gene prediction in LmjF.

The freely available JAVA-based software package ARTEMIS[18] was designed specifically as an annotation platform and has a user-friendly graphical interface. It simplifies time-consuming processes such as inter-file format conversion, BLAST analysis [19], and provides a convenient environment for viewing the gene structure and organization of large DNA segments. Here we describe a method for importing data from GLIMMER, TESTCODE, and into GENESCAN into ARTEMIS, to enhance gene prediction and annotation.

Results and Discussion

We have developed a partially automated process for prediction and annotation of LmjF protein-coding genes in which the gene predictions from GLIMMER and the statistical outputs from TESTCODE and GENESCAN are imported into ARTEMIS (see additional file 1), where they can be viewed graphically alongside the CODON USAGE statistics already built into ARTEMIS. Figure 1 shows a panel containing results from each of the four gene-prediction methods for a typical LmjF sequence. The predictions from GLIMMER are imported as CDS features and displayed as colored rectangles in the panel showing ORFs (the vertical bars are the stop codons) in all six reading frames. The window scans from TESTCODE, GENESCAN and CODON USAGE are displayed graphically in panels above the GLIMMER predictions. The thresholds used to indicate likely protein-coding ORFs for TESTCODE and GENESCAN are 4.0 and 9.7, respectively. This allows visual comparison of the four gene prediction methods and manual alteration of the GLIMMER-predicted CDS features if necessary. The reliance on multiple gene prediction methods increases confidence in the predictions.

In Table 1, we show a comparison of the results of automated gene prediction using the four different programs with the manual annotations for three completely sequenced chromosomes (chr1, chr3 and chr4) from LmjF. The False Positive rate for each individual method was quite high, with GLIMMER being significantly worse than the others. Most of the False Positives were due to prediction of genes on the wrong coding strand. All methods, with the exception of TESTCODE, showed a low number of False Negatives. The poor performance of TESTCODE was largely due to use of a high cut-off value (9.7) for the average Fickett statistic of the whole ORF, rather than smaller windows. Thus, individually, each of the automated programs had high Error Discovery Rates (fraction of incorrect predictions made for expected predictions, Table 1), ranging from 0.77 for GENESCAN to 1.96 for GLIMMER.

Table 1 Automated gene prediction^a in Leishmania major

Full size table

Combination of the programs improved the Error Discovery Rate, especially in terms of false positives (Table 2). When only ORFs predicted by all four programs are considered, the false positive rate was <1%, but the false negative rate was almost 50%. By including ORFs predicted by only three of the four programs, the false negative rate was dramatically lowered to 10%, but the false positive rate rose to >10%. Further relaxation of stringency (two of four programs) resulted in a substantial increase in false positives (78%), with only modest decrease in false negatives (~5%). Thus, the Error Discovery Rate is least (21%) by considering the consensus prediction of three out of four programs. The use of two trained (GLIMMER and CODON USAGE), and two non-trained (TESTCODE and GENESCAN) algorithms reduced false positives and false negatives.

Table 2 Automated gene prediction by combination of different methods.

Full size table

Conclusions

The semi-automated comparative analysis clear shows that some degree of manual annotation is still necessary in projects where there is large proportion of novel genes. The manual annotation is time consuming and labor intensive. The ARTEMIS desktop environment, with importation of trained and non-trained non-consensus gene-prediction algorithms, facilitates easy comparison of the results and allows the user to make more-informed decisions for calling protein-coding genes. Thus, this improvised and powerful software, developed using already existing gene identification methods and annotation platform, is extremely helpful for whole genome sequencing projects.

Methods

GLIMMER 2.0 http://www.tigr.org/software/glimmer/[14] was trained using predicted protein-coding genes from LmjF chr1 [1] (manual annotations based on TESTCODE and CODON USAGE) and chr4 (manual annotations using HEXAMER and CODON USAGE: A. Ivens, personal communication) using the default settings. The trained GLIMMER was run on LmjF sequence using the default setting with a minimum gene length of 75 amino acids and output was parsed into an EMBL-formatted feature table file. This data were imported into ARTEMIS 4.0 (installed on Intel-based Linux or Windows 2000 machines) using the "Read Features Into" option of the "File" menu. This allows the GLIMMER-predicted genes to be displayed as CDS Features. The TESTCODE[16], GENESCANhttp://202.41.10.146/public_htmlnew/gs.htm[17] and CODON USAGE[15] algorithms were re-coded in C++ and the statistical results collected in text files with single value for each sliding window (100 nt windows, sliding by onent increments). These TESTCODE and GENESCAN data were imported into ARTEMIShttp://www.sanger.ac.uk/Software/Artemis/[18] using the "Add User Plot" option of the "Display" menu, and displayed graphically. This procedure can be used to import other sliding window methods. The CODON USAGE bias statistics, which has been coded as part of ARTEMIS, is calculated for the three reading frames of each DNA strand and displayed in different colors using the "Add Usage Plot" option of the "Display" menu to import Leishmania CODON USAGE tables. Figure 1 shows a panel containing results from each of the four gene-prediction methods for a typical LmjF sequence.

For automated GENESCAN, TESTCODE and CODON USAGE predictions, genes were called only for those ORFs larger than 100 amino acids with mean scores (over the entire ORF) above thresholds of 4.0, 9.7, and 0, respectively. For overlapping ORFs (on the same or opposite strands), the one with the highest signal was used.

References

Myler PJ, Audleman L, deVos T, Hixson G, Kiser P, Lemley C, Magness C, Rickell E, Sisk E, Sunkin S, et al.: Leishmania major Friedlin chromosome 1 has an unusual distribution of protein-coding genes. Proc Natl Acad Sci U S A 1999, 96: 2902–2906. 10.1073/pnas.96.6.2902
Article PubMed Central CAS PubMed Google Scholar
Worthey E, Aggarwal G, Cawthra J, Fazelinia G, Fu G, Hassebrock M, Hixson G, Ivens AC, Kiser P, Marsolini F, et al.: Leishmania major chromosome 3 contains two long "convergent" polycistronic gene clusters separated by a tRNA gene. Nucl Acids Res, in press.
Claverie JM: Computational methods for the identification of genes in vertebrate genomic sequences. Hum Mol Genet 1997, 6: 1735–1744. 10.1093/hmg/6.10.1735
Article CAS PubMed Google Scholar
Fickett JW: The gene identification problem: an overview for developers. Computers Chem 1996, 20: 103–118. 10.1016/S0097-8485(96)80012-X
Article CAS Google Scholar
Guigo R: DNA composition, codon usage and exon prediction. In Genetics Databases (Edited by: Bishop M). San Diego: Academic Press, Inc 1999, 53–80.
Google Scholar
Jones J, Field JK, Risk JM: A comparative guide to gene prediction tools for the bioinformatics amateur. Int J Oncol 2002, 20: 697–705.
CAS PubMed Google Scholar
Mathe C, Sagot MF, Schiex T, Rouze P: Current methods of gene prediction, their strengths and weaknesses. Nucl Acids Res 2002, 30: 4103–4117. 10.1093/nar/gkf543
Article PubMed Central CAS PubMed Google Scholar
Stormo GD: Gene-finding approaches for eukaryotes. Genome Res 2000, 10: 394–397. 10.1101/gr.10.4.394
Article CAS PubMed Google Scholar
Burge CB, Karlin S: Finding the genes in genomic DNA. Curr Opin Struct Biol 1998, 8: 346–354. 10.1016/S0959-440X(98)80069-9
Article CAS PubMed Google Scholar
Aggarwal G, Ramaswamy R: Ab initio gene identification: prokaryote genome annotation with Genescan and Glimmer. J Biosci 2002, 27: 7–14.
Article CAS PubMed Google Scholar
Yada T, Takagi T, Totoki Y, Sakaki Y, Takaeda Y: DIGIT: a novel gene finding program by combining gene-finders. Pac Symp Biocomput 2003, 375–387.
Google Scholar
Pavloviç V, Garg A, Kasif S: A Bayesian framework for combining gene predictions. Bioinformatics 2002, 18: 19–27. 10.1093/bioinformatics/18.1.19
Article PubMed Google Scholar
Howe KL, Chothia T, Durbin R: GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Res 2002, 12: 1418–1427. 10.1101/gr.149502
Article PubMed Central CAS PubMed Google Scholar
Delcher AL, Harmon D, Kasif S, White O, Salzberg SL: Improved microbial gene identification with GLIMMER. Nucl Acids Res 1999, 27: 4636–4641. 10.1093/nar/27.23.4636
Article PubMed Central CAS PubMed Google Scholar
Staden R, McLachlan AD: Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucl Acids Res 1982, 10: 141–156.
Article PubMed Central CAS PubMed Google Scholar
Fickett JW: Recognition of protein coding regions in DNA sequences. Nucl Acids Res 1982, 10: 5303–5318.
Article PubMed Central CAS PubMed Google Scholar
Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R: Prediction of probable genes by Fourier analysis of genomic sequences. Comput Appl Biosci 1997, 13: 263–270.
CAS PubMed Google Scholar
Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream M-A, Barrell B: Artemis: sequence visualisation and annotation. Bioinformatics 2000, 16: 944–945. 10.1093/bioinformatics/16.10.944
Article CAS PubMed Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410. 10.1006/jmbi.1990.9999
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The authors thank Kim Rutherford (Wellcome Trust Sanger Institute) for the help and useful discussion. This work was supported by NIH grant AI40599.

Author information

Authors and Affiliations

Seattle Biomedical Research Institute, 4 Nickerson Street, Seattle, WA, 98109, USA
Gautam Aggarwal, EA Worthey & Peter J Myler
Immunex Corporation, 51 University Street, Seattle, WA, 98101, USA
Paul D McDonagh
Departments of Pathobiology and Medical Education and Biomedical Informatics, University of Washington, Seattle, WA, 98195, USA
Peter J Myler

Authors

Gautam Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar
EA Worthey
View author publications
You can also search for this author in PubMed Google Scholar
Paul D McDonagh
View author publications
You can also search for this author in PubMed Google Scholar
Peter J Myler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter J Myler.

Additional information

Authors' contributions

GA re-coded the TESTCODE, GENESCAN and C ODON U SAGE algorithms in C++ for UNIX environment and performed the automated combined prediction analysis. PDM coded the wrapper for parsing the GLIMMER predictions. All authors read and approved the final manuscript.

Electronic supplementary material

12859_2003_73_MOESM1_ESM.zip

Additional File 1: This is a zip file that contains one perl script (glimmer_atremis.pl), two (testcode_unix and testcode_win.exe) executable files and a readme.txt file describing the details of usage and other information relevant to the programs. (ZIP 51 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Rights and permissions

This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what re-use is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and re-use information, please contact the Rights and Permissions team.

About this article

Cite this article

Aggarwal, G., Worthey, E., McDonagh, P.D. et al. Importing statistical measures into Artemis enhances gene identification in the Leishmania genome project. BMC Bioinformatics 4, 23 (2003). https://doi.org/10.1186/1471-2105-4-23

Download citation

Received: 19 February 2003
Accepted: 07 June 2003
Published: 07 June 2003
DOI: https://doi.org/10.1186/1471-2105-4-23

Importing statistical measures into Artemis enhances gene identification in the Leishmania genome project