Test Data Sets and Evaluation of Gene Prediction Programs on the Rice Genome

Abstract

With several rice genome projects approaching completion gene prediction/finding by computer algorithms has become an urgent task. Two test sets were constructed by mapping the newly published 28,469 full-length KOME rice cDNA to the RGP BAC clone sequences of Oryza sativa ssp. japonica: a single-gene set of 550 sequences and a multi-gene set of 62 sequences with 271 genes. These data sets were used to evaluate five ab initio gene prediction programs: RiceHMM, GlimmerR, GeneMark, FGENSH and BGF. The predictions were compared on nucleotide, exon and whole gene structure levels using commonly accepted measures and several new measures. The test results show a progress in performance in chronological order. At the same time complementarity of the programs hints on the possibility of further improvement and on the feasibility of reaching better performance by combining several gene-finders.

This is a preview of subscription content, log in to check access.

References

  1. 1.

    Yu J, Hu S-N et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science, 2002, 296: 79–92.

    Google Scholar 

  2. 2.

    Goff S A, Ricke D et al. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science, 2002, 296: 92–100.

    Article  Google Scholar 

  3. 3.

    The international rice genome sequencing project. http://rgp.dna.affrc.go.jp/IRSGP/

  4. 4.

    Sasaki T, Matsumoto T, Yamamoto K et al. The genome sequence and structure of rice chromosome 1. Nature, 2002, 420: 312–316.

    Article  Google Scholar 

  5. 5.

    Feng Q, Zhang Y J, Wang S Y et al. Sequence and analysis of rice chromosome 4. Nature, 2002, 420: 316–320.

    Article  Google Scholar 

  6. 6.

    The rice chromosome 10 sequencing consortium. In-depth view of structure, activity and evolution of rice chromosome 10. Science, 2003, 300: 1566–1569.

    Article  Google Scholar 

  7. 7.

    Zhao W-M, Wang J, He X-M et al. BGI-RIS: An integrated information resource and comparative analysis workbench for rice genomics. Nucl. Acids Res., 2004, 32: D377–D382.

    Google Scholar 

  8. 8.

    Pertea M, Salzberg S L. Computational gene finding in plants. Plant Mol. Biol., 2002, 48(1): 39–48.

    Article  Google Scholar 

  9. 9.

    Solovyev V V. Finding Genes by Computer: Probabilistic and Discriminative Approaches. Current Topics in Computational Molecular Biology, Jiang T, Xu Y, Zhang M Q (eds.), Tsinghua University Press and MIT Press, 2002, pp.201–248.

  10. 10.

    Brent M R, Guigó R. Recent advances in gene structure prediction. Curr. Opin. Struct. Biol., 2004, 14: 264–272.

    Article  Google Scholar 

  11. 11.

    Shah S P, McVicker G P, Mackworth A K et al. GeneComber: Combining outputs of gene prediction programs. Bioinformatics, 2003, 9(10): 1296–1297.

    Google Scholar 

  12. 12.

    Allen J E, Pertea M, Salzberg S L. Computational gene prediction using multiple sources of evidence. Genome Res., 2004, 14(1): 142–148.

    Google Scholar 

  13. 13.

    Burset M, Guigó R. Evaluation of gene structure prediction programs. Genomics, 1996, 34: 353–367.

    Article  Google Scholar 

  14. 14.

    Guigó R, Agarwal P, Abril J F et al. An assessment of gene prediction accuracy in large DNA sequences. Genome Res., 2000, 10(10): 1631–1642.

    Google Scholar 

  15. 15.

    Rogic S, Mackworth A K, Ouellette B F. Evaluation of gene-finding programs on mammalian sequences. Genome Res., 2001, 11(5): 817–832.

    Article  Google Scholar 

  16. 16.

    Kleffe J, Hermann K, Vahrson W et al. Logitlinear models for the prediction of splics sites in plant pre-mRNA sequences. Nucl. Acids Res., 1996, 24(23): 4709–4718.

    Article  Google Scholar 

  17. 17.

    The European Union Arabidopsis Genome Sequencing Consortium and the Cold Spring Harbor Washington University in St Louis and PE Biosystem Arabidopsis Sequencing Consortium. Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana. Nature, 1999, 402: 769–777.

    Google Scholar 

  18. 18.

    Yuan Q, Quackenbush J, Sultana R et al. Rice bioinformatics. Analysis of rice sequence data and leveraging the data to other plant species. Plant Physiol., 2001, 125: 1166–1174.

    Article  Google Scholar 

  19. 19.

    The Rice Full-Length cDNA Consortium. Collection, mapping and annotation of over 28,000 cDNA clones from japonica rice. Science, 2003, 301: 376–379.

    Google Scholar 

  20. 20.

    Jabbari K, Cruveiller S, Clay O et al. The new genes of rice: A closer look. Trends in Plant Sci., 2004, 9(6): 281–285.

    Google Scholar 

  21. 21.

    Kent W J. BLAT: The BLAST-like alignment tool. Genome Res., 2002, 12(4): 656–664.

    Article  MathSciNet  Google Scholar 

  22. 22.

    Altschul S F, Madden T L, Schaffer A A et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucl. Acid Res., 1997, 25(17): 3389–3402.

    Google Scholar 

  23. 23.

    Sakata K, Nagasaki H et al. A computer program for prediction of gene domain on rice genome sequence. In The 2nd Georgia Tech Int. Conf. Bioinformatics Abstracts, 1999, 78.

  24. 24.

    Salzberg S L, Pertea M, Delcher A L et al. Interpolated Markov models for eukaryotic gene finding. Genomics, 1999, 59(1): 24–31.

    Article  Google Scholar 

  25. 25.

    Delcher A L, Harmon D, Kasif S et al. Improved microbial gene identification with Glimmer. Nucl. Acids Res., 1999, 27(23): 4636–4641.

    Article  Google Scholar 

  26. 26.

    Borodovsky M, McIninch J. GENMARK: Parallel gene recognition for both DNA strands. Computer Chem., 1993, 17(2): 123–133.

    Google Scholar 

  27. 27.

    Salamov A A et al. Ab initio gene finding in Drosophila genomic DNA. Genome Res., 2000, 10(4): 516–522.

    Article  Google Scholar 

  28. 28.

    Zheng W-M. Finding signals for plant promoters. Genomics, Proteomics & Bioinformatics, 2003, 1(1): 68–73.

    Google Scholar 

  29. 29.

    Zheng W-M. Genomic signal enhancement by clustering. Commun. Theor. Phys., 2003, 39(5): 631–634.

    Google Scholar 

  30. 30.

    Zheng W-M. Genomic signal search by dynamic programming. Commun. Theor. Phys., 2003, 39(6): 761–764.

    Google Scholar 

  31. 31.

    Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J. Mol. Evol., 1997, 268(1): 78–94.

    Google Scholar 

  32. 32.

    Burge C. Identification of genes in human genomic DNA [Dissertation]. Stanford University, 1997.

  33. 33.

    Snyder E E, Stormo G D. Identification of coding regions in genomic DNA sequences: An application of dynamic programming and neural networks. Nucl. Acids Res., 1993, 21: 607–613.

    Google Scholar 

  34. 34.

    Zhang M Q. Computational prediction of eukaryotic protein-coding genes. Nature Reviews Genetics, 2002, 3: 698–709.

    Article  Google Scholar 

  35. 35.

    Abril J F, Guigó R. gff2ps: Visualizing genomic annotations. Bioinformatics, 2000, 16(8): 743–744.

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Bai-Lin Hao.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Li, H., Liu, J., Xu, Z. et al. Test Data Sets and Evaluation of Gene Prediction Programs on the Rice Genome. J Comput Sci Technol 20, 446–453 (2005). https://doi.org/10.1007/s11390-005-0446-x

Download citation

Keywords

  • gene prediction
  • rice genome
  • test sets
  • accuracy measures
  • hidden Markov models
  • dynamic programming