Bayesian Detection of Coding Regions in DNA/RNA Sequences Through Event Factoring

  • Renatha Oliva Capua
  • Helena Cristina da Gama Leitão
  • Jorge Stolfi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4756)


We describe a Bayesian inference method for the identification of protein coding regions (active or residual) in DNA or RNA sequences. Its main feature is the computation of the conditional and a priori probabilities required in Bayes’s formula by factoring each event (possible annotation) for a nucleotide string into the concatenation of shorter events, believed to be independent.The factoring allows us to obtain fast but reliable estimates for these parameters from readily available databases; whereas the probability estimation for unfactored events would require databases and tables of astronomical size. Promising results were obtained in tests with natural and artificial genomes.


coding regions ab-initio DNA tagging Bayesian inference 


  1. 1.
    Meidanis, J., Setubal, J.C.: Introduction to Computational Molecular Biology. PWS Publishing Company (1997)Google Scholar
  2. 2.
    Batzoglou, S., Pachter, L., Mesirov, J.P., Berger, B., Lander, E.S.: Human and mouse gene structure: Comparative analysis and application to exon prediction. Genome Research (10), 950–958 (2000)CrossRefGoogle Scholar
  3. 3.
    Fickett, J.W.: Recognition of protein coding regions in DNA sequences. Nucleic Acids Research 10(17), 5303–5318 (1982)CrossRefGoogle Scholar
  4. 4.
    Sagot, C.M.M.F., Schiex, T., Rouzé, P.: Recognition of protein coding regions in DNA sequences. Nucleic Acids Research 30(19), 4103–4117 (2002)CrossRefGoogle Scholar
  5. 5.
    Pertea, M., Salzberg, S.L.: Computational gene finding in plants. Plant Molecular Biology 48(1-2), 39–48 (2002)CrossRefGoogle Scholar
  6. 6.
    Farber, R., Lapedes, A., Sirotkin, K.: Determination of eukaryotic protein coding regions using neural networks and information theory. Journal of Molecular Biology (226), 471–479 (1992)CrossRefGoogle Scholar
  7. 7.
    Kotlar, D., Lavner, Y.: Gene prediction by spectral rotation measure: A new method for identifying protein-coding regions. Genome Research 13(8), 1930–1937 (2003)Google Scholar
  8. 8.
    Majoros, W.H., Pertea, M., Salzberg, S.L.: Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics 21(9), 1782–1788 (2005)CrossRefGoogle Scholar
  9. 9.
    Staden, R., McLachlan, A.D.: Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Res. 10(1), 141–155 (1982)CrossRefGoogle Scholar
  10. 10.
    Capua, R.O., da Gama Leitão, H.C., Stolfi, J.: Uma abordagem estatística para identificação de éxons. In: WOB 2004, Brasília, DF (2004)Google Scholar
  11. 11.
    Reese, M.: Database with human genome sequences (2005),

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Renatha Oliva Capua
    • 1
  • Helena Cristina da Gama Leitão
    • 1
  • Jorge Stolfi
    • 2
  1. 1.Institute of Computing, Federal Fluminense University (UFF), Rua Passo da Pátria, 156, Bloco E – 24210-240 Niterói, RJBrazil
  2. 2.Institute of Computing, University of Campinas (UNICAMP), Cx. Postal 6176 – 13081-970 Campinas, SPBrazil

Personalised recommendations