Skip to main content

Computational Methods for Ab Initio and Comparative Gene Finding

  • Protocol
  • First Online:
Data Mining Techniques for the Life Sciences

Part of the book series: Methods in Molecular Biology ((MIMB,volume 609))

Abstract

High-throughput DNA sequencing is increasing the amount of public complete genomes even though a precise gene catalogue for each organism is not yet available. In this context, computational gene finders play a key role in producing a first and cost-effective annotation. Nowadays a compilation of gene prediction tools has been made available to the scientific community and, despite the high number, they can be divided into two main categories: (1) ab initio and (2) evidence based. In the following, we will provide an overview of main methodologies to predict correct exon–intron structures of eukaryotic genes falling in such categories. We will take into account also new strategies that commonly refine ab initio predictions employing comparative genomics or other evidence such as expression data. Finally, we will briefly introduce metrics to in house evaluation of gene predictions in terms of sensitivity and specificity at nucleotide, exon, and gene levels as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wright, F. A., Lemon, W. J., Zhao, W. D., Sears, R., Zhuo, D., Wang, J. P., et al. (2001) A draft annotation and overview of the human genome. Genome Biol 2, RESEARCH0025.

    Google Scholar 

  2. McPherson, J. D., Marra, M., Hillier, L., Waterston, R. H., Chinwalla, A., Wallis, J., et al. (2001) A physical map of the human genome. Nature 409, 934–941.

    Article  CAS  PubMed  Google Scholar 

  3. ENCODE Project Consortium (2004) The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306, 636–640.

    Google Scholar 

  4. Gerstein, M. B., Bruce, C., Rozowsky, J. S., Zheng, D., Du, J., Korbel, J. O., et al. (2007) What is a gene, post-ENCODE? History and updated definition. Genome Res 17, 669–681.

    Article  CAS  PubMed  Google Scholar 

  5. Weinstock, G. M. (2007) ENCODE: more genomic empowerment. Genome Res 17, 667–668.

    Article  CAS  PubMed  Google Scholar 

  6. Korf, I. (2004) Gene finding in novel genomes. BMC Bioinformatics 5, 59.

    Article  PubMed  Google Scholar 

  7. Guigo, R., Agarwal, P., Abril, J. F., Burset, M., Fickett, J. W. (2000) An assessment of gene prediction accuracy in large DNA sequences. Genome Res 10, 1631–1642.

    Article  CAS  PubMed  Google Scholar 

  8. Arumugam, M., Wei, C., Brown, R. H., Brent, M. R. (2006) Pairagon+N-SCAN_EST: a model-based gene annotation pipeline. Genome Biol 7 Suppl 1, S5 1–10.

    Article  PubMed  Google Scholar 

  9. Silke, J. (1997) The majority of long non-stop reading frames on the antisense strand can be explained by biased codon usage. Gene 194, 143–155.

    Article  CAS  PubMed  Google Scholar 

  10. Fickett, J. W. (1982) Recognition of protein coding regions in DNA sequences. Nucleic Acids Res 10, 5303–5318.

    Article  CAS  PubMed  Google Scholar 

  11. Staden, R. (1984) Measurements of the effects that coding for a protein has on a DNA sequence and their use for finding genes. Nucleic Acids Res 12, 551–567.

    Article  CAS  PubMed  Google Scholar 

  12. Gerhard, D. S., Wagner, L., Feingold, E. A., Shenmen, C. M., Grouse, L. H., Schuler, G., et al. (2004) The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). Genome Res 14, 2121–2127.

    Article  PubMed  Google Scholar 

  13. Kotlar, D., Lavner, Y. (2003) Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. Genome Res 13, 1930–1937.

    CAS  PubMed  Google Scholar 

  14. Lio, P. (2003) Wavelets in bioinformatics and computational biology: state of art and perspectives. Bioinformatics 19, 2–9.

    Article  CAS  PubMed  Google Scholar 

  15. Guo, F. B., Ou, H. Y., Zhang, C. T. (2003) ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Res 31, 1780–1789.

    Article  CAS  PubMed  Google Scholar 

  16. Burge, C., Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268, 78–94.

    Article  CAS  PubMed  Google Scholar 

  17. Lukashin, A. V., Borodovsky, M. (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 26, 1107–1115.

    Article  CAS  PubMed  Google Scholar 

  18. Parra, G., Blanco, E., Guigo, R. (2000) GeneID in Drosophila. Genome Res 10, 511–515.

    Article  CAS  PubMed  Google Scholar 

  19. Majoros, W. H., Pertea, M., Salzberg, S. L. (2004) TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879.

    Article  CAS  PubMed  Google Scholar 

  20. Foissac, S., Bardou, P., Moisan, A., Cros, M. J., Schiex, T. (2003) EUGENE’HOM: A generic similarity-based gene finder using multiple homologous sequences. Nucleic Acids Res 31, 3742–3745.

    Article  CAS  PubMed  Google Scholar 

  21. Salzberg, S. L., Pertea, M., Delcher, A. L., Gardner, M. J., Tettelin, H. (1999) Interpolated Markov models for eukaryotic gene finding. Genomics 59, 24–31.

    Article  CAS  PubMed  Google Scholar 

  22. Guigo, R. (1998) Assembling genes from predicted exons in linear time with dynamic programming. J Comput Biol 5, 681–702.

    Article  CAS  PubMed  Google Scholar 

  23. Stormo, G. D. (2000) Gene-finding approaches for eukaryotes. Genome Res 10, 394–397.

    Article  CAS  PubMed  Google Scholar 

  24. Reese, M. G., Kulp, D., Tammana, H., Haussler, D. (2000) Genie–gene finding in Drosophila melanogaster. Genome Res 10, 529–538.

    Article  CAS  PubMed  Google Scholar 

  25. Krogh, A. (2000) Using database matches with for HMMGene for automated gene detection in Drosophila. Genome Res 10, 523–528.

    Article  CAS  PubMed  Google Scholar 

  26. Stanke, M., Waack, S. (2003) Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19 Suppl 2, ii215–225.

    PubMed  Google Scholar 

  27. Smith, T. F., Waterman, M. S. (1981) Identification of common molecular subsequences. J Mol Biol 147, 195–197.

    Article  CAS  PubMed  Google Scholar 

  28. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J. (1990) Basic local alignment search tool. J Mol Biol 215, 403–410.

    CAS  PubMed  Google Scholar 

  29. Kent, W. J. (2002) BLAT – the BLAST-like alignment tool. Genome Res 12, 656–664.

    CAS  PubMed  Google Scholar 

  30. Pearson, W. R. (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 132, 185–219.

    CAS  PubMed  Google Scholar 

  31. Karlin, S., Altschul, S. F. (1993) Applications and statistics for multiple high-scoring segments in molecular sequences. Proc Natl Acad Sci USA 90, 5873–5877.

    Article  CAS  PubMed  Google Scholar 

  32. Badger, J. H., Olsen, G. J. (1999) CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol 16, 512–524.

    CAS  PubMed  Google Scholar 

  33. Castrignano, T., De Meo, P. D., Grillo, G., Liuni, S., Mignone, F., Talamo, I. G., et al. (2006) GenoMiner: a tool for genome-wide search of coding and non-coding conserved sequence tags. Bioinformatics 22, 497–499.

    Article  CAS  PubMed  Google Scholar 

  34. Castrignano, T., Canali, A., Grillo, G., Liuni, S., Mignone, F., Pesole, G. (2004) CSTminer: a web tool for the identification of coding and noncoding conserved sequence tags through cross-species genome comparison. Nucleic Acids Res 32, W624–W627.

    Article  CAS  PubMed  Google Scholar 

  35. Florea, L., Hartzell, G., Zhang, Z., Rubin, G. M., Miller, W. (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 8, 967–974.

    CAS  PubMed  Google Scholar 

  36. Wheelan, S. J., Church, D. M., Ostell, J. M. (2001) Spidey: a tool for mRNA-to-genomic alignments. Genome Res 11, 1952–1957.

    CAS  PubMed  Google Scholar 

  37. Usuka, J., Brendel, V. (2000) Gene structure prediction by spliced alignment of genomic DNA with protein sequences: increased accuracy by differential splice site scoring. J Mol Biol 297, 1075–1085.

    Article  CAS  PubMed  Google Scholar 

  38. Usuka, J., Zhu, W., Brendel, V. (2000) Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics 16, 203–211.

    Article  CAS  PubMed  Google Scholar 

  39. Wu, T. D., Watanabe, C. K. (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875.

    Article  CAS  PubMed  Google Scholar 

  40. Mott, R. (1997) EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput Appl Biosci 13, 477–478.

    CAS  PubMed  Google Scholar 

  41. Bonizzoni, P., Rizzi, R., Pesole, G. (2005) ASPIC: a novel method to predict the exon-intron structure of a gene that is optimally compatible to a set of transcript sequences. BMC Bioinformatics 6, 244.

    Article  PubMed  Google Scholar 

  42. Castrignano, T., Rizzi, R., Talamo, I. G., De Meo, P. D., Anselmo, A., Bonizzoni, P., et al. (2006) ASPIC: a web resource for alternative splicing prediction and transcript isoforms characterization. Nucleic Acids Res 34, W440–W443.

    Article  CAS  PubMed  Google Scholar 

  43. Djebali, S., Delaplace, F., Crollius, H. R. (2006) Exogean: a framework for annotating protein-coding genes in eukaryotic genomic DNA. Genome Biol 7 Suppl 1, S7 1–10.

    Article  PubMed  Google Scholar 

  44. Gelfand, M. S., Mironov, A. A., Pevzner, P. A. (1996) Gene recognition via spliced sequence alignment. Proc Natl Acad Sci USA 93, 9061–9066.

    Article  CAS  PubMed  Google Scholar 

  45. Birney, E., Clamp, M., Durbin, R. (2004) GeneWise and Genomewise. Genome Res 14, 988–995.

    Article  CAS  PubMed  Google Scholar 

  46. Meyer, I. M., Durbin, R. (2002) Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics 18, 1309–1318.

    Article  CAS  PubMed  Google Scholar 

  47. Pachter, L., Alexandersson, M., Cawley, S. (2002) Applications of generalized pair hidden Markov models to alignment and gene finding problems. J Comput Biol 9, 389–399.

    Article  CAS  PubMed  Google Scholar 

  48. Parra, G., Agarwal, P., Abril, J. F., Wiehe, T., Fickett, J. W., Guigo, R. (2003) Comparative gene prediction in human and mouse. Genome Res 13, 108–117.

    Article  CAS  PubMed  Google Scholar 

  49. Korf, I., Flicek, P., Duan, D., Brent, M. R. (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 17 Suppl 1, S140–S148.

    PubMed  Google Scholar 

  50. Yeh, R. F., Lim, L. P., Burge, C. B. (2001) Computational inference of homologous gene structures in the human genome. Genome Res 11, 803–816.

    Article  CAS  PubMed  Google Scholar 

  51. Gross, S. S., Brent, M. R. (2006) Using multiple alignments to improve gene prediction. J Comput Biol 13, 379–393.

    Article  CAS  PubMed  Google Scholar 

  52. Stanke, M., Tzvetkova, A., Morgenstern, B. (2006) AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome. Genome Biol 7 Suppl 1, S11 11–18.

    Article  Google Scholar 

  53. Carter, D., Durbin, R. (2006) Vertebrate gene finding from multiple-species alignments using a two-level strategy. Genome Biol 7 Suppl 1, S6 1–12.

    Article  PubMed  Google Scholar 

  54. Wei, C., Brent, M. R. (2006) Using ESTs to improve the accuracy of de novo gene prediction. BMC Bioinformatics 7, 327.

    Article  PubMed  Google Scholar 

  55. Curwen, V., Eyras, E., Andrews, T. D., Clarke, L., Mongin, E., Searle, S. M., et al. (2004) The Ensembl automatic gene annotation system. Genome Res 14, 942–950.

    Article  CAS  PubMed  Google Scholar 

  56. Slater, G. S., Birney, E. (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31.

    Article  PubMed  Google Scholar 

  57. Eyras, E., Caccamo, M., Curwen, V., Clamp, M. (2004) ESTGenes: alternative splicing from ESTs in Ensembl. Genome Res 14, 976–987.

    Article  CAS  PubMed  Google Scholar 

  58. Parra, G., Bradnam, K., Korf, I. (2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067.

    Article  CAS  PubMed  Google Scholar 

  59. Howe, K. L., Chothia, T., Durbin, R. (2002) GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Res 12, 1418–1427.

    Article  CAS  PubMed  Google Scholar 

  60. Allen, J. E., Majoros, W. H., Pertea, M., Salzberg, S. L. (2006) JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biol 7 Suppl 1, S9 1–13.

    Article  PubMed  Google Scholar 

  61. Burset, M., Guigo, R. (1996) Evaluation of gene structure prediction programs. Genomics 34, 353–367.

    Article  CAS  PubMed  Google Scholar 

  62. Guigo, R., Flicek, P., Abril, J. F., Reymond, A., Lagarde, J., Denoeud, F., et al. (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7 Suppl 1, S2 1–31.

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by the projects VIGNA (Ministero Politiche Agrigole e Forestali), LIBI – Laboratorio Internazionale di Bioinformatica (Fondo Italiano Ricerca di Base, Ministero dell’ Università e della Ricerca), Laboratorio per la Bioinformatica e la Biodiversità Molecolare (Ministero dell’Università e della Ricerca), Telethon, and AIRC.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Humana Press, a part of Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Picardi, E., Pesole, G. (2010). Computational Methods for Ab Initio and Comparative Gene Finding. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 609. Humana Press. https://doi.org/10.1007/978-1-60327-241-4_16

Download citation

  • DOI: https://doi.org/10.1007/978-1-60327-241-4_16

  • Published:

  • Publisher Name: Humana Press

  • Print ISBN: 978-1-60327-240-7

  • Online ISBN: 978-1-60327-241-4

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics