Abstract
With several rice genome projects approaching completion gene prediction/finding by computer algorithms has become an urgent task. Two test sets were constructed by mapping the newly published 28,469 full-length KOME rice cDNA to the RGP BAC clone sequences of Oryza sativa ssp. japonica: a single-gene set of 550 sequences and a multi-gene set of 62 sequences with 271 genes. These data sets were used to evaluate five ab initio gene prediction programs: RiceHMM, GlimmerR, GeneMark, FGENSH and BGF. The predictions were compared on nucleotide, exon and whole gene structure levels using commonly accepted measures and several new measures. The test results show a progress in performance in chronological order. At the same time complementarity of the programs hints on the possibility of further improvement and on the feasibility of reaching better performance by combining several gene-finders.
Similar content being viewed by others
References
Yu J, Hu S-N et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science, 2002, 296: 79–92.
Goff S A, Ricke D et al. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science, 2002, 296: 92–100.
The international rice genome sequencing project. http://rgp.dna.affrc.go.jp/IRSGP/
Sasaki T, Matsumoto T, Yamamoto K et al. The genome sequence and structure of rice chromosome 1. Nature, 2002, 420: 312–316.
Feng Q, Zhang Y J, Wang S Y et al. Sequence and analysis of rice chromosome 4. Nature, 2002, 420: 316–320.
The rice chromosome 10 sequencing consortium. In-depth view of structure, activity and evolution of rice chromosome 10. Science, 2003, 300: 1566–1569.
Zhao W-M, Wang J, He X-M et al. BGI-RIS: An integrated information resource and comparative analysis workbench for rice genomics. Nucl. Acids Res., 2004, 32: D377–D382.
Pertea M, Salzberg S L. Computational gene finding in plants. Plant Mol. Biol., 2002, 48(1): 39–48.
Solovyev V V. Finding Genes by Computer: Probabilistic and Discriminative Approaches. Current Topics in Computational Molecular Biology, Jiang T, Xu Y, Zhang M Q (eds.), Tsinghua University Press and MIT Press, 2002, pp.201–248.
Brent M R, Guigó R. Recent advances in gene structure prediction. Curr. Opin. Struct. Biol., 2004, 14: 264–272.
Shah S P, McVicker G P, Mackworth A K et al. GeneComber: Combining outputs of gene prediction programs. Bioinformatics, 2003, 9(10): 1296–1297.
Allen J E, Pertea M, Salzberg S L. Computational gene prediction using multiple sources of evidence. Genome Res., 2004, 14(1): 142–148.
Burset M, Guigó R. Evaluation of gene structure prediction programs. Genomics, 1996, 34: 353–367.
Guigó R, Agarwal P, Abril J F et al. An assessment of gene prediction accuracy in large DNA sequences. Genome Res., 2000, 10(10): 1631–1642.
Rogic S, Mackworth A K, Ouellette B F. Evaluation of gene-finding programs on mammalian sequences. Genome Res., 2001, 11(5): 817–832.
Kleffe J, Hermann K, Vahrson W et al. Logitlinear models for the prediction of splics sites in plant pre-mRNA sequences. Nucl. Acids Res., 1996, 24(23): 4709–4718.
The European Union Arabidopsis Genome Sequencing Consortium and the Cold Spring Harbor Washington University in St Louis and PE Biosystem Arabidopsis Sequencing Consortium. Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana. Nature, 1999, 402: 769–777.
Yuan Q, Quackenbush J, Sultana R et al. Rice bioinformatics. Analysis of rice sequence data and leveraging the data to other plant species. Plant Physiol., 2001, 125: 1166–1174.
The Rice Full-Length cDNA Consortium. Collection, mapping and annotation of over 28,000 cDNA clones from japonica rice. Science, 2003, 301: 376–379.
Jabbari K, Cruveiller S, Clay O et al. The new genes of rice: A closer look. Trends in Plant Sci., 2004, 9(6): 281–285.
Kent W J. BLAT: The BLAST-like alignment tool. Genome Res., 2002, 12(4): 656–664.
Altschul S F, Madden T L, Schaffer A A et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucl. Acid Res., 1997, 25(17): 3389–3402.
Sakata K, Nagasaki H et al. A computer program for prediction of gene domain on rice genome sequence. In The 2nd Georgia Tech Int. Conf. Bioinformatics Abstracts, 1999, 78.
Salzberg S L, Pertea M, Delcher A L et al. Interpolated Markov models for eukaryotic gene finding. Genomics, 1999, 59(1): 24–31.
Delcher A L, Harmon D, Kasif S et al. Improved microbial gene identification with Glimmer. Nucl. Acids Res., 1999, 27(23): 4636–4641.
Borodovsky M, McIninch J. GENMARK: Parallel gene recognition for both DNA strands. Computer Chem., 1993, 17(2): 123–133.
Salamov A A et al. Ab initio gene finding in Drosophila genomic DNA. Genome Res., 2000, 10(4): 516–522.
Zheng W-M. Finding signals for plant promoters. Genomics, Proteomics & Bioinformatics, 2003, 1(1): 68–73.
Zheng W-M. Genomic signal enhancement by clustering. Commun. Theor. Phys., 2003, 39(5): 631–634.
Zheng W-M. Genomic signal search by dynamic programming. Commun. Theor. Phys., 2003, 39(6): 761–764.
Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J. Mol. Evol., 1997, 268(1): 78–94.
Burge C. Identification of genes in human genomic DNA [Dissertation]. Stanford University, 1997.
Snyder E E, Stormo G D. Identification of coding regions in genomic DNA sequences: An application of dynamic programming and neural networks. Nucl. Acids Res., 1993, 21: 607–613.
Zhang M Q. Computational prediction of eukaryotic protein-coding genes. Nature Reviews Genetics, 2002, 3: 698–709.
Abril J F, Guigó R. gff2ps: Visualizing genomic annotations. Bioinformatics, 2000, 16(8): 743–744.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, H., Liu, JS., Xu, Z. et al. Test Data Sets and Evaluation of Gene Prediction Programs on the Rice Genome. J Comput Sci Technol 20, 446–453 (2005). https://doi.org/10.1007/s11390-005-0446-x
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/s11390-005-0446-x