Single Species Gene Finding

Axelson-Fisk, Marina

doi:10.1007/978-1-84996-104-2_2

Marina Axelson-Fisk²

Part of the book series: Computational Biology ((COBO,volume 11))

Abstract

A gene finding software typically consists of a main algorithm that serves as an umbrella for a large number of rather complex submodels. The submodels represent various features of a gene, such as exons, introns, and splice site models. Each submodel scores the probability, or likelihood, that each given sequence region constitute the corresponding gene feature, and then these scores are passed on up to the main algorithm. The main algorithm integrates the scores and parses the input sequence into a set of gene predictions. This chapter covers a five of the most commonly used mathematical models used as main algorithms in single species gene finding. The models are hidden Markov models, generalized hidden Markov models, interpolated Markov models, neural networks, and decision trees. Each model is described in algorithmic detail, and each model section is finished off by exemplifying a gene finder that uses the model in question.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baldi, P., Brunak, S.: Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge (2001)
MATH Google Scholar
Begleiter, R., El-Yaniv, R., Yona, G.: On prediction using variable order Markov models. J. Artif. Intell. 22, 385–421 (2004)
MATH MathSciNet Google Scholar
Blattner, F.R., Plunkett, G., Bloch, C.A., Perna, N.T., Burland, V., Riley, M., Collado-vides, J., Glasner, J.D., Rode, C.K., Mayhew, G.F., Gregor, J., Davis, N.W., Kirkpatrick, H.A., Goeden, M.A., Rose, D.J., Mau, B., Shao, Y.: The complete genome sequence of Escherichia coli K-12. Science 277, 1453–1469 (1997)
Article Google Scholar
Breiman, L.: Some properties of splitting criteria. Mach. Learn. 24, 41–47 (1996)
MATH MathSciNet Google Scholar
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. Chapman & Hall, London (1984)
MATH Google Scholar
Burge, C., Karlin, S.: Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997)
Article Google Scholar
Delcher, A.L., Harmon, D., Kasif, S., White, O., Salzberg, S.L.: Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27, 4636–4641 (1999)
Article Google Scholar
Delcher, A.L., Bratke, K.A., Powers, E.C., Salzberg, S.L.: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23, 673–679 (2007)
Article Google Scholar
Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis. Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998)
MATH Google Scholar
Fickett, J.W., Tung, C.-S.: Assessment of protein coding measures. Nucleic Acids Res. 20, 6441–6450 (1992)
Article Google Scholar
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
MATH Google Scholar
http:www.cbcb.umd.edu/software/glimmer
Jukes, T.H., Osawa, S.: The genetic code in mitochondria and chloroplasts. Experientia 46, 1117–1126 (1990)
Article Google Scholar
Karlin, S., Taylor, H.M.: A First Course in Stochastic Processes, 2nd edn. Academic Press, New York (1975)
MATH Google Scholar
Koski, T.: Hidden Markov Models for Bioinformatics. Springer, Berlin (2001)
MATH Google Scholar
Larsen, T., Krogh, A.: EasyGene—a prokaryotic gene finder that ranks ORFs by statisticial significance. BMC Bioinf. 4, 21–35 (2003)
Article Google Scholar
McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biol. 52, 99–115 (1943)
Google Scholar
Murthy, S.K., Kasif, S., Salzberg, S.L.: A system for induction of oblique decision trees. J. Artif. Intell. Res. 2, 1–32 (1994)
MATH Google Scholar
Ohler, U., Harbeck, S., Niemann, H., Nöth, E., Reese, M.G.: Interpolated Markov chains for eukaryotic promoter recognition. Bioinformatics 15, 362–369 (1999)
Article Google Scholar
Perna, N.T., Plunkett, G., Burland, V., Mau, B., Glasner, J.D., Rose, D.J., Mayhew, G.F., Evans, P.S., Gregor, J., Kirkpatrick, H.A., Pósfai, G., Hackett, J., Klink, S., Boutin, A., Shao, Y., Miller, L., Grotbeck, E.J., Davis, N.W., Lim, A., Dimalanta, E.T., Potamousis, K.D., Apodaca, J., Anantharaman, T.S., Lin, J., Yen, G., Schwartz, D.C., Welch, R.A., Blattner, F.R.: Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409, 529–533 (2001)
Article Google Scholar
Pertea, M., Lin, X., Salzberg, S.L.: GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res. 29, 1185–1190 (2001)
Article Google Scholar
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1, 81–106 (1986)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
Google Scholar
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989)
Article Google Scholar
Rissanen, J.: A universal data compression system. IEEE Trans. Inf. Theory 29, 656–664 (1983)
Article MATH MathSciNet Google Scholar
Rivas, E., Eddy, S.R.: Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinf. 2, 8 (2001)
Article Google Scholar
Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 386–408 (1958)
Article MathSciNet Google Scholar
Salzberg, S.L., Delcher, A.L., Fasman, K.H., Henderson, J.: A decision tree system for finding genes in DNA. J. Comput. Biol. 5, 667–680 (1998)
Article Google Scholar
Salzberg, S.L., Delcher, A.L., Kasif, S., White, O.: Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 26, 544–548 (1998)
Article Google Scholar
Schukat-Talamazzini, E.G., Gallwitz, F., Harbeck, S., Warnke, V.: Rational interpolation of maximum likelihood predictors in stochastic language modeling. In: Proc. Eurospeech’97, pp. 2731–2734. Rhodes, Greece (1997)
Google Scholar
Sharp, P.M., Cowe, E.: Synonymous codon usage in Sacharomyces cerevisiae. Yeast 7, 657–678 (1991)
Article Google Scholar
Shmatkov, A.M., Melikyan, A.A., Chernousko, F.L., Borodovsky, M.: Finding prokaryotic genes by the ‘frame-by-frame’ algorithm: targeting gene starts and overlapping genes. Bioinformatics 15, 874–886 (1999)
Article Google Scholar
Shmilovici, A., Ben-Gal, I.: Using a VOM model for reconstructing potential coding regions in EST sequences. Comput. Stat. 22, 49–69 (2007)
Article MATH MathSciNet Google Scholar
Skovgaard, M., Jensen, L.J., Brunak, S., Ussery, D., Krogh, A.: On the total number of genes and their length distribution in complete microbial genomes. Trends Genet. 17, 425–428 (2001)
Article Google Scholar
Snyder, E.E., Stormo, G.D.: Identification of protein coding regions in genomic DNA. J. Mol. Biol. 248, 1–18 (1995)
Article Google Scholar
Xu, Y., Mural, R.J., Einstein, J.R., Shah, M.B., Uberbacher, E.C.: GRAIL: a multi-agent neural network system for gene identification. Proc. IEEE 84, 1544–1552 (1996)
Article Google Scholar
Xu, Y., Uberbacher, E.C.: Computational gene prediction using neural networks and similarity search. In: Salzberg, S.L., Searls, D.B., Kasif., S. (eds.) Computational Methods in Molecular Biology, pp. 109–128. Elsevier Science B.V., Amsterdam (1998)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Dept. Mathematical Sciences, Chalmers University of Technology, Eklandgatan 86, 412 96, Göteborg, Sweden
Dr. Marina Axelson-Fisk

Authors

Dr. Marina Axelson-Fisk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marina Axelson-Fisk .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Axelson-Fisk, M. (2010). Single Species Gene Finding. In: Comparative Gene Finding. Computational Biology, vol 11. Springer, London. https://doi.org/10.1007/978-1-84996-104-2_2

Download citation

DOI: https://doi.org/10.1007/978-1-84996-104-2_2
Published: 30 January 2010
Publisher Name: Springer, London
Print ISBN: 978-1-84996-103-5
Online ISBN: 978-1-84996-104-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics