Gene Prediction

  • Tyler AliotoEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 855)


Evolutionary genomics is a field that relies heavily upon comparing genomes, that is, the full complement of genes of one species with another. However, given a genome sequence and little else, as is now often the case, genes must first be found and annotated before downstream analyses can be done. Computational gene prediction techniques are brought to bear on the problem of constructing a genome annotation as manual annotation is extremely time-consuming and costly. This chapter reviews the methods by which the individual components of a typical gene structure are detected in genomic sequence and then discusses several popular statistical frameworks for integrated gene prediction on eukaryotic genome sequences.

Key words

Gene prediction Dynamic programming Hidden Markov model Conditional random field Coding statistics Coding potential Genome annotation Markov chain 


  1. 1.
    Gingeras, TR. (2007) Origin of phenotypes: genes and transcripts, Genome Res 17, 682–690.PubMedCrossRefGoogle Scholar
  2. 2.
    Borodovsky, M, and McIninch, J. (1993) Recognition of genes in DNA sequence with ambiguities, Biosystems 30, 161–171.PubMedCrossRefGoogle Scholar
  3. 3.
    Salzberg, SL, Delcher, AL, Kasif, S, and White, O. (1998) Microbial gene identification using interpolated Markov models, Nucleic Acids Res 26, 544–548.PubMedCrossRefGoogle Scholar
  4. 4.
    Hyatt, D, Chen, GL, Locascio, PF, Land, ML, Larimer, FW, and Hauser, LJ. (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification, BMC Bioinformatics 11, 119.PubMedCrossRefGoogle Scholar
  5. 5.
    Wang, ET, Sandberg, R, Luo, S, Khrebtukova, I, Zhang, L, Mayr, C, Kingsmore, SF, Schroth, GP, and Burge, CB. (2008) Alternative isoform regulation in human tissue transcriptomes, Nature 456, 470–476.PubMedCrossRefGoogle Scholar
  6. 6.
    Kozak, M. (1981) Possible role of flanking nucleotides in recognition of the AUG initiator codon by eukaryotic ribosomes, Nucleic Acids Res 9, 5233–5252.PubMedCrossRefGoogle Scholar
  7. 7.
    Altschul, SF, Gish, W, Miller, W, Myers, EW, and Lipman, DJ. (1990) Basic local alignment search tool. Journal of molecular biology. 215, 403–410.PubMedGoogle Scholar
  8. 8.
    Gelfand, MS, Mironov, AA, and Pevzner, PA. (1996) Gene recognition via spliced sequence alignment, Proceedings of the National Academy of Sciences of the United States of America 93, 9061–9066.PubMedCrossRefGoogle Scholar
  9. 9.
    Mott, R. (1997) EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA, Computer applications in the biosciences:CABIOS 13, 477–478.PubMedGoogle Scholar
  10. 10.
    Florea, L, Hartzell, G, Zhang, Z, Rubin, GM, and Miller, W. (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence, Genome Res 8, 967–974.PubMedGoogle Scholar
  11. 11.
    Kent, WJ. (2002) BLAT – the BLAST-like alignment tool, Genome research. 12, 656–2292R.PubMedGoogle Scholar
  12. 12.
    Wu, T, and Watanabe, C. (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences, Bioinformatics (Oxford, England) 21, 1859–1875.Google Scholar
  13. 13.
    Slater, G, and Birney, E. (2005) Automated generation of heuristics for biological sequence comparison, BMC bioinformatics [electronic resource]. 6, 31.Google Scholar
  14. 14.
    Birney, E, Clamp, M, and Durbin, R. (2004) GeneWise and Genomewise, Genome Research 14, 988–995.PubMedCrossRefGoogle Scholar
  15. 15.
    Hubbard, T, Barker, D, Birney, E, Cameron, G, Chen, Y, Clark, L, Cox, T, Cuff, J, Curwen, V, Down, T, et al. (2002) The Ensembl genome database project, Nucleic acids research. 30, 38–41.PubMedCrossRefGoogle Scholar
  16. 16.
    Hsu, F, Kent, WJ, Clawson, H, Kuhn, RM, Diekhans, M, and Haussler, D. (2006) The UCSC Known Genes, Bioinformatics (Oxford, England) 22, 1036–1046.Google Scholar
  17. 17.
    Trapnell, C, Williams, BA, Pertea, G, Mortazavi, A, Kwan, G, van Baren, MJ, Salzberg, SL, Wold, BJ, and Pachter, L. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nat Biotechnol. 28, 511–515.Google Scholar
  18. 18.
    Guttman, M, Garber, M, Levin, JZ, Donaghey, J, Robinson, J, Adiconis, X, Fan, L, Koziol, MJ, Gnirke, A, Nusbaum, C, Rinn, JL, Lander, ES, and Regev, A. (2010) Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs, Nat Biotechnol 28, 503–510.Google Scholar
  19. 19.
    Stanke, M, Keller, O, Gunduz, I, Hayes, A, Waack, S, and Morgenstern, B. (2006) AUGUSTUS: ab initio prediction of alternative transcripts, Nucleic acids research 34, W435–439.PubMedCrossRefGoogle Scholar
  20. 20.
    Parra, G, Blanco, E, and Guigó, R. (2000) GeneID in Drosophila, Genome Research 10, 511–515.PubMedCrossRefGoogle Scholar
  21. 21.
    Barash, Y, Calarco, JA, Gao, W, Pan, Q, Wang, X, Shai, O, Blencowe, BJ, and Frey, BJ. (2010) Deciphering the splicing code, Nature 465, 53–59.PubMedCrossRefGoogle Scholar
  22. 22.
    Tilgner, H, Nikolaou, C, Althammer, S, Sammeth, M, Beato, M, Valcarcel, J, and Guigo, R. (2009) Nucleosome positioning as a determinant of exon recognition, Nat Struct Mol Biol 16, 996–1001.PubMedCrossRefGoogle Scholar
  23. 23.
    Burge, C, and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA, J Mol Biol 268, 78–94.PubMedCrossRefGoogle Scholar
  24. 24.
    Castelo, R, and Guigo, R. (2004) Splice site identification by idlBNs, Bioinformatics 20 Suppl 1, i69–76.PubMedCrossRefGoogle Scholar
  25. 25.
    Sun, Y-F, Fan, X-D, and Li, Y-D. (2003) Identifying splicing sites in eukaryotic RNA: support vector machine approach, Computers in biology and medicine 33, 17–29.PubMedCrossRefGoogle Scholar
  26. 26.
    Zhang, XHF, Heller, KA, Hefter, I, Leslie, CS, and Chasin, LA. (2003) Sequence information for the splicing of human pre-mRNA identified by support vector machine classification, Genome Research 13, 2637–2650.PubMedCrossRefGoogle Scholar
  27. 27.
    Degroeve, S, Saeys, Y, De Baets, B, Rouzé, P, and Van de Peer, Y. (2005) SpliceMachine: predicting splice sites from high-dimensional local context representations, Bioinformatics (Oxford, England) 21, 1332–1338.Google Scholar
  28. 28.
    Baten, AKMA, Chang, BCH, Halgamuge, SK, and Li, J. (2006) Splice site identification using probabilistic parameters and SVM classification, BMC Bioinformatics 7 Suppl 5, S15.Google Scholar
  29. 29.
    Ratsch, G, Sonnenburg, S, and Schafer, C. (2006) Learning interpretable SVMs for biological sequence classification, BMC Bioinformatics 7 Suppl 1, S9.Google Scholar
  30. 30.
    Fickett, JW, and Tung, CS. (1992) Assessment of protein coding measures, Nucleic acids research 20, 6441–6450.PubMedCrossRefGoogle Scholar
  31. 31.
    Gelfand, MS. (1995) Prediction of function in DNA sequence analysis, Journal of computational biology: a journal of computational molecular cell biology 2, 87–115.CrossRefGoogle Scholar
  32. 32.
    Guigo, R, and Fickett, JW. (1995) Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA, J Mol Biol 253, 51–60.PubMedCrossRefGoogle Scholar
  33. 33.
    Uberbacher, EC, and Mural, RJ. (1991) Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach, Proceedings of the National Academy of Sciences of the United States of America 88, 11261–11265.PubMedCrossRefGoogle Scholar
  34. 34.
    Xu, Y, Einstein, JR, Mural, RJ, Shah, M, and Uberbacher, EC. (1994) An improved system for exon recognition and gene modeling in human DNA sequences, In International Conference on Intelligent Systems for Molecular Biology, pp 376–384.Google Scholar
  35. 35.
    Alexandersson, M, Cawley, S, and Pachter, L. (2003) SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model, Genome Res 13, 496–502.PubMedCrossRefGoogle Scholar
  36. 36.
    Parra, G, Agarwal, P, Abril, JF, Wiehe, T, Fickett, JW, and Guigo, R. (2003) Comparative gene prediction in human and mouse, Genome Res 13, 108–117.PubMedCrossRefGoogle Scholar
  37. 37.
    Korf, I, Flicek, P, Duan, D, and Brent, MR. (2001) Integrating genomic homology into gene structure prediction, Bioinformatics 17 Suppl 1, S140–148.PubMedCrossRefGoogle Scholar
  38. 38.
    Pedersen, JS, and Hein, J. (2003) Gene finding with a hidden Markov model of genome structure and evolution, Bioinformatics (Oxford, England) 19, 219–227.Google Scholar
  39. 39.
    Siepel, A, and Haussler, D. (2004) Combining phylogenetic and hidden Markov models in biosequence analysis, Journal of computational biology: a journal of computational molecular cell biology 11, 413–428.CrossRefGoogle Scholar
  40. 40.
    Gross, S, Do, C, Sirota, M, and Batzoglou, S. (2007) CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction, Genome Biol 8, R269.PubMedCrossRefGoogle Scholar
  41. 41.
    Gelfand, MS, and Roytberg, MA. (1993) Prediction of the exon-intron structure by a dynamic programming approach, Biosystems 30, 173–182.PubMedCrossRefGoogle Scholar
  42. 42.
    Guigo, R. (1998) Assembling genes from predicted exons in linear time with dynamic programming, J Comput Biol 5, 681–702.PubMedCrossRefGoogle Scholar
  43. 43.
    Solovyev, VV, Salamov, AA, and Lawrence, CB. (1995) Identification of human gene structure using linear discriminant functions and dynamic programming, Proc Int Conf Intell Syst Mol Biol 3, 367–375.PubMedGoogle Scholar
  44. 44.
    Blanco, E, Parra, G, and Guigo, R. (2007) Using geneid to identify genes, Curr Protoc Bioinformatics Chapter 4, Unit 4 3.Google Scholar
  45. 45.
    Salzberg, SL, Pertea, M, Delcher, AL, Gardner, MJ, and Tettelin, H. (1999) Interpolated Markov models for eukaryotic gene finding, Genomics 59, 24–31.PubMedCrossRefGoogle Scholar
  46. 46.
    Krogh, A, Mian, IS, and Haussler, D. (1994) A hidden Markov model that finds genes in E. coli DNA, Nucleic Acids Res 22, 4768–4778.PubMedCrossRefGoogle Scholar
  47. 47.
    Kulp, D, Haussler, D, Reese, MG, and Eeckman, FH. (1996) A generalized hidden Markov model for the recognition of human genes in DNA, Proc Int Conf Intell Syst Mol Biol 4, 134–142.PubMedGoogle Scholar
  48. 48.
    Henderson, J, Salzberg, S, and Fasman, KH. (1997) Finding genes in DNA with a Hidden Markov Model, J Comput Biol 4, 127–141.PubMedCrossRefGoogle Scholar
  49. 49.
    Krogh, A. (1997) Two methods for improving performance of an HMM and their application for gene finding, Proc Int Conf Intell Syst Mol Biol 5, 179–186.PubMedGoogle Scholar
  50. 50.
    Salamov, AA, and Solovyev, VV. (2000) Ab initio gene finding in Drosophila genomic DNA, Genome Research 10, 516–522.PubMedCrossRefGoogle Scholar
  51. 51.
    Baum, LE, Petrie, T, Soules, G, and Weiss, N. (1970) A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains, The Annals of Mathematical Statistics 41, 164–171.CrossRefGoogle Scholar
  52. 52.
    Dempster, AP, Laird, NM, and Rubin, DB. (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society. Series B (Methodological) 39, 1–38.Google Scholar
  53. 53.
    Korf, I, Flicek, P, Duan, D, and Brent, MR. (2001) Integrating genomic homology into gene structure prediction, Bioinformatics (Oxford, England) 17 Suppl 1, S140–148.Google Scholar
  54. 54.
    Majoros, WH, Pertea, M, and Salzberg, SL. (2005) Efficient implementation of a generalized pair hidden Markov model for comparative gene finding, Bioinformatics 21, 1782–1788.PubMedCrossRefGoogle Scholar
  55. 55.
    Meyer, IM, and Durbin, R. (2002) Comparative ab initio prediction of gene structures using pair HMMs, Bioinformatics (Oxford, England) 18, 1309–1318.Google Scholar
  56. 56.
    Hasegawa, M, Kishino, H, and Yano, T. (1985) Dating of the human-ape splitting by a molecular clock of mitochondrial DNA, J Mol Evol 22, 160–174.PubMedCrossRefGoogle Scholar
  57. 57.
    McAuliffe, JD, Pachter, L, and Jordan, MI. (2004) Multiple-sequence functional annotation and the generalized hidden Markov phylogeny, Bioinformatics (Oxford, England) 20, 1850–1860.Google Scholar
  58. 58.
    Gross, SS, and Brent, MR. (2006) Using multiple alignments to improve gene prediction, Journal of computational biology: a journal of computational molecular cell biology 13, 379–393.CrossRefGoogle Scholar
  59. 59.
    Ng, AY, and Jordan, MI. (2002) On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes, In Advances in Neural Information Processing Systems (NIPS) (Dietterich, T, Becker, S, and Ghahramani, Z, Eds.) 2, 841–848.Google Scholar
  60. 60.
    Ratsch, G, Sonnenburg, S, Srinivasan, J, Witte, H, Muller, KR, Sommer, RJ, and Scholkopf, B. (2007) Improving the Caenorhabditis elegans genome annotation using machine learning, PLoS Comput Biol 3, e20.PubMedCrossRefGoogle Scholar
  61. 61.
    Sonnenburg, S, Schweikert, G, Philips, P, Behr, J, and Ratsch, G. (2007) Accurate splice site prediction using support vector machines, BMC Bioinformatics 8 Suppl 10, S7.Google Scholar
  62. 62.
    Sarawagi, S, and Cohen, W. (2005) Semi-Markov Conditional Random Fields for Information Extraction, In Advances in Neural Information Processing Systems 17 (Saul, LK, Weiss, Y, and Bottou, L, Eds.), pp 1185–1192, MIT Press, Cambridge, MA.Google Scholar
  63. 63.
    Bernal, A, Crammer, K, Hatzigeorgiou, A, and Pereira, F. (2007) Global discriminative learning for higher-accuracy computational gene prediction, PLoS Comput Biol 3, e54.PubMedCrossRefGoogle Scholar
  64. 64.
    DeCaprio, D, Vinson, JP, Pearson, MD, Montgomery, P, Doherty, M, and Galagan, JE. (2007) Conrad: gene prediction using conditional random fields, Genome Res 17, 1389–1398.PubMedCrossRefGoogle Scholar
  65. 65.
    Gross, SS, Do, CB, Sirota, M, and Batzoglou, S. (2007) CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction, Genome Biol 8, R269.PubMedCrossRefGoogle Scholar
  66. 66.
    Howe, K, Chothia, T, and Durbin, R. (2002) GAZE: a generic framework for the integration of gene-prediction data by dynamic programming, Genome Research 12, 1418–1427.PubMedCrossRefGoogle Scholar
  67. 67.
    Allen, JE, Majoros, WH, Pertea, M, and Salzberg, SL. (2006) JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions, Genome Biol 7 Suppl 1, S9 1–13.Google Scholar
  68. 68.
    Elsik, CG, Mackey, AJ, Reese, JT, Milshina, NV, Roos, DS, and Weinstock, GM. (2007) Creating a honey bee consensus gene set, Genome Biology 8, R13.Google Scholar
  69. 69.
    Coghlan, A, and Durbin, R. (2007) Genomix: a method for combining gene-finders’ predictions, which uses evolutionary conservation of sequence and intron-exon structure, Bioinformatics (Oxford, England) 23, 1468–1475.Google Scholar
  70. 70.
    Foissac, S, and Schiex, T. (2005) Integrating alternative splicing detection into gene prediction, BMC bioinformatics 6, 25–25.PubMedCrossRefGoogle Scholar
  71. 71.
    Elsik, CG, Tellam, RL, Worley, KC, Gibbs, RA, Muzny, DM, Weinstock, GM, Adelson, DL, Eichler, EE, Elnitski, L, Guigo, R, et al. (2009) The genome sequence of taurine cattle: a window to ruminant biology and evolution, Science 324, 522–528.PubMedCrossRefGoogle Scholar
  72. 72.
    Burset, M, and Guigo, R. (1996) Evaluation of gene structure prediction programs, Genomics 34, 353–367.PubMedCrossRefGoogle Scholar
  73. 73.
    Rogic, S, Mackworth, AK, and Ouellette, FB. (2001) Evaluation of gene-finding programs on mammalian sequences, Genome Res 11, 817–832.PubMedCrossRefGoogle Scholar
  74. 74.
    Reese, M, Hartzell, G, Harris, N, Ohler, U, Abril, J, and Lewis, S. (2000) Genome annotation assessment in Drosophila melanogaster, Genome Research 10, 483–501.PubMedCrossRefGoogle Scholar
  75. 75.
    Guigó, R, Flicek, P, Abril, J, Reymond, A, Lagarde, J, Denoeud, F, Antonarakis, S, Ashburner, M, Bajic, V, Birney, E, Castelo, R, Eyras, E, Ucla, C, Gingeras, T, Harrow, J, Hubbard, T, Lewis, S, and Reese, M. (2006) EGASP: the human ENCODE Genome Annotation Assessment Project, Genome Biology 7 Suppl 1, 2–1.Google Scholar
  76. 76.
    Coghlan, A, Fiedler, T, McKay, S, Flicek, P, Harris, T, Blasiar, D, Consortium, tn, and Stein, L. (2008) nGASP – the nematode genome annotation assessment project, BMC Bioinformatics 9, 549.Google Scholar
  77. 77.
    Alioto, T. (2007) U12DB: a database of orthologous U12-type spliceosomal introns, Nucleic acids research 35, 110–115.CrossRefGoogle Scholar
  78. 78.
    Kryukov, GV, Castellano, S, Novoselov, SV, Lobanov, AV, Zehtab, O, Guigo, R, and Gladyshev, VN. (2003) Characterization of mammalian selenoproteomes, Science 300, 1439–1443.PubMedCrossRefGoogle Scholar
  79. 79.
    Castellano, S, Gladyshev, VN, Guigo, R, and Berry, MJ. (2008) SelenoDB 1.0: a database of selenoprotein genes, proteins and SECIS elements, Nucleic Acids Res 36, D332–338.PubMedCrossRefGoogle Scholar
  80. 80.
    Majoros, WH (2007) Methods for Computational Gene Prediction, Cambridge University Press.Google Scholar
  81. 81.
    Harrow, J, Nagy, A, Reymond, A, Alioto, T, Patthy, L, Antonarakis, SE, and Guigo, R. (2009) Identifying protein-coding genes in genomic sequences, Genome Biol 10, 201.PubMedCrossRefGoogle Scholar
  82. 82.
    Abril, JF, and Guigo, R. (2000) gff2ps: visualizing genomic annotations, Bioinformatics 16, 743–744.PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Centro Nacional de Análisis GenómicoBarcelonaSpain

Personalised recommendations