Skip to main content

Single Species Gene Finding

  • Chapter
  • First Online:
Comparative Gene Finding

Part of the book series: Computational Biology ((COBO,volume 11))

Abstract

A gene finding software typically consists of a main algorithm that serves as an umbrella for a large number of rather complex submodels. The submodels represent various features of a gene, such as exons, introns, and splice site models. Each submodel scores the probability, or likelihood, that each given sequence region constitute the corresponding gene feature, and then these scores are passed on up to the main algorithm. The main algorithm integrates the scores and parses the input sequence into a set of gene predictions. This chapter covers a five of the most commonly used mathematical models used as main algorithms in single species gene finding. The models are hidden Markov models, generalized hidden Markov models, interpolated Markov models, neural networks, and decision trees. Each model is described in algorithmic detail, and each model section is finished off by exemplifying a gene finder that uses the model in question.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baldi, P., Brunak, S.: Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge (2001)

    MATH  Google Scholar 

  2. Begleiter, R., El-Yaniv, R., Yona, G.: On prediction using variable order Markov models. J. Artif. Intell. 22, 385–421 (2004)

    MATH  MathSciNet  Google Scholar 

  3. Blattner, F.R., Plunkett, G., Bloch, C.A., Perna, N.T., Burland, V., Riley, M., Collado-vides, J., Glasner, J.D., Rode, C.K., Mayhew, G.F., Gregor, J., Davis, N.W., Kirkpatrick, H.A., Goeden, M.A., Rose, D.J., Mau, B., Shao, Y.: The complete genome sequence of Escherichia coli K-12. Science 277, 1453–1469 (1997)

    Article  Google Scholar 

  4. Breiman, L.: Some properties of splitting criteria. Mach. Learn. 24, 41–47 (1996)

    MATH  MathSciNet  Google Scholar 

  5. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. Chapman & Hall, London (1984)

    MATH  Google Scholar 

  6. Burge, C., Karlin, S.: Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997)

    Article  Google Scholar 

  7. Delcher, A.L., Harmon, D., Kasif, S., White, O., Salzberg, S.L.: Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27, 4636–4641 (1999)

    Article  Google Scholar 

  8. Delcher, A.L., Bratke, K.A., Powers, E.C., Salzberg, S.L.: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23, 673–679 (2007)

    Article  Google Scholar 

  9. Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis. Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998)

    MATH  Google Scholar 

  10. Fickett, J.W., Tung, C.-S.: Assessment of protein coding measures. Nucleic Acids Res. 20, 6441–6450 (1992)

    Article  Google Scholar 

  11. Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)

    MATH  Google Scholar 

  12. http:www.cbcb.umd.edu/software/glimmer

  13. Jukes, T.H., Osawa, S.: The genetic code in mitochondria and chloroplasts. Experientia 46, 1117–1126 (1990)

    Article  Google Scholar 

  14. Karlin, S., Taylor, H.M.: A First Course in Stochastic Processes, 2nd edn. Academic Press, New York (1975)

    MATH  Google Scholar 

  15. Koski, T.: Hidden Markov Models for Bioinformatics. Springer, Berlin (2001)

    MATH  Google Scholar 

  16. Larsen, T., Krogh, A.: EasyGene—a prokaryotic gene finder that ranks ORFs by statisticial significance. BMC Bioinf. 4, 21–35 (2003)

    Article  Google Scholar 

  17. McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biol. 52, 99–115 (1943)

    Google Scholar 

  18. Murthy, S.K., Kasif, S., Salzberg, S.L.: A system for induction of oblique decision trees. J. Artif. Intell. Res. 2, 1–32 (1994)

    MATH  Google Scholar 

  19. Ohler, U., Harbeck, S., Niemann, H., Nöth, E., Reese, M.G.: Interpolated Markov chains for eukaryotic promoter recognition. Bioinformatics 15, 362–369 (1999)

    Article  Google Scholar 

  20. Perna, N.T., Plunkett, G., Burland, V., Mau, B., Glasner, J.D., Rose, D.J., Mayhew, G.F., Evans, P.S., Gregor, J., Kirkpatrick, H.A., Pósfai, G., Hackett, J., Klink, S., Boutin, A., Shao, Y., Miller, L., Grotbeck, E.J., Davis, N.W., Lim, A., Dimalanta, E.T., Potamousis, K.D., Apodaca, J., Anantharaman, T.S., Lin, J., Yen, G., Schwartz, D.C., Welch, R.A., Blattner, F.R.: Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409, 529–533 (2001)

    Article  Google Scholar 

  21. Pertea, M., Lin, X., Salzberg, S.L.: GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res. 29, 1185–1190 (2001)

    Article  Google Scholar 

  22. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1, 81–106 (1986)

    Google Scholar 

  23. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)

    Google Scholar 

  24. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989)

    Article  Google Scholar 

  25. Rissanen, J.: A universal data compression system. IEEE Trans. Inf. Theory 29, 656–664 (1983)

    Article  MATH  MathSciNet  Google Scholar 

  26. Rivas, E., Eddy, S.R.: Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinf. 2, 8 (2001)

    Article  Google Scholar 

  27. Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 386–408 (1958)

    Article  MathSciNet  Google Scholar 

  28. Salzberg, S.L., Delcher, A.L., Fasman, K.H., Henderson, J.: A decision tree system for finding genes in DNA. J. Comput. Biol. 5, 667–680 (1998)

    Article  Google Scholar 

  29. Salzberg, S.L., Delcher, A.L., Kasif, S., White, O.: Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 26, 544–548 (1998)

    Article  Google Scholar 

  30. Schukat-Talamazzini, E.G., Gallwitz, F., Harbeck, S., Warnke, V.: Rational interpolation of maximum likelihood predictors in stochastic language modeling. In: Proc. Eurospeech’97, pp. 2731–2734. Rhodes, Greece (1997)

    Google Scholar 

  31. Sharp, P.M., Cowe, E.: Synonymous codon usage in Sacharomyces cerevisiae. Yeast 7, 657–678 (1991)

    Article  Google Scholar 

  32. Shmatkov, A.M., Melikyan, A.A., Chernousko, F.L., Borodovsky, M.: Finding prokaryotic genes by the ‘frame-by-frame’ algorithm: targeting gene starts and overlapping genes. Bioinformatics 15, 874–886 (1999)

    Article  Google Scholar 

  33. Shmilovici, A., Ben-Gal, I.: Using a VOM model for reconstructing potential coding regions in EST sequences. Comput. Stat. 22, 49–69 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  34. Skovgaard, M., Jensen, L.J., Brunak, S., Ussery, D., Krogh, A.: On the total number of genes and their length distribution in complete microbial genomes. Trends Genet. 17, 425–428 (2001)

    Article  Google Scholar 

  35. Snyder, E.E., Stormo, G.D.: Identification of protein coding regions in genomic DNA. J. Mol. Biol. 248, 1–18 (1995)

    Article  Google Scholar 

  36. Xu, Y., Mural, R.J., Einstein, J.R., Shah, M.B., Uberbacher, E.C.: GRAIL: a multi-agent neural network system for gene identification. Proc. IEEE 84, 1544–1552 (1996)

    Article  Google Scholar 

  37. Xu, Y., Uberbacher, E.C.: Computational gene prediction using neural networks and similarity search. In: Salzberg, S.L., Searls, D.B., Kasif., S. (eds.) Computational Methods in Molecular Biology, pp. 109–128. Elsevier Science B.V., Amsterdam (1998)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marina Axelson-Fisk .

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag London

About this chapter

Cite this chapter

Axelson-Fisk, M. (2010). Single Species Gene Finding. In: Comparative Gene Finding. Computational Biology, vol 11. Springer, London. https://doi.org/10.1007/978-1-84996-104-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-1-84996-104-2_2

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84996-103-5

  • Online ISBN: 978-1-84996-104-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics