Computational Methods for Ab Initio and Comparative Gene Finding

Picardi, Ernesto; Pesole, Graziano

doi:10.1007/978-1-60327-241-4_16

Ernesto Picardi³ &
Graziano Pesole³

Part of the book series: Methods in Molecular Biology ((MIMB,volume 609))

3849 Accesses
26 Citations

Abstract

High-throughput DNA sequencing is increasing the amount of public complete genomes even though a precise gene catalogue for each organism is not yet available. In this context, computational gene finders play a key role in producing a first and cost-effective annotation. Nowadays a compilation of gene prediction tools has been made available to the scientific community and, despite the high number, they can be divided into two main categories: (1) ab initio and (2) evidence based. In the following, we will provide an overview of main methodologies to predict correct exon–intron structures of eukaryotic genes falling in such categories. We will take into account also new strategies that commonly refine ab initio predictions employing comparative genomics or other evidence such as expression data. Finally, we will briefly introduce metrics to in house evaluation of gene predictions in terms of sensitivity and specificity at nucleotide, exon, and gene levels as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 159.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Wright, F. A., Lemon, W. J., Zhao, W. D., Sears, R., Zhuo, D., Wang, J. P., et al. (2001) A draft annotation and overview of the human genome. Genome Biol 2, RESEARCH0025.
Google Scholar
McPherson, J. D., Marra, M., Hillier, L., Waterston, R. H., Chinwalla, A., Wallis, J., et al. (2001) A physical map of the human genome. Nature 409, 934–941.
Article CAS PubMed Google Scholar
ENCODE Project Consortium (2004) The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306, 636–640.
Google Scholar
Gerstein, M. B., Bruce, C., Rozowsky, J. S., Zheng, D., Du, J., Korbel, J. O., et al. (2007) What is a gene, post-ENCODE? History and updated definition. Genome Res 17, 669–681.
Article CAS PubMed Google Scholar
Weinstock, G. M. (2007) ENCODE: more genomic empowerment. Genome Res 17, 667–668.
Article CAS PubMed Google Scholar
Korf, I. (2004) Gene finding in novel genomes. BMC Bioinformatics 5, 59.
Article PubMed Google Scholar
Guigo, R., Agarwal, P., Abril, J. F., Burset, M., Fickett, J. W. (2000) An assessment of gene prediction accuracy in large DNA sequences. Genome Res 10, 1631–1642.
Article CAS PubMed Google Scholar
Arumugam, M., Wei, C., Brown, R. H., Brent, M. R. (2006) Pairagon+N-SCAN_EST: a model-based gene annotation pipeline. Genome Biol 7 Suppl 1, S5 1–10.
Article PubMed Google Scholar
Silke, J. (1997) The majority of long non-stop reading frames on the antisense strand can be explained by biased codon usage. Gene 194, 143–155.
Article CAS PubMed Google Scholar
Fickett, J. W. (1982) Recognition of protein coding regions in DNA sequences. Nucleic Acids Res 10, 5303–5318.
Article CAS PubMed Google Scholar
Staden, R. (1984) Measurements of the effects that coding for a protein has on a DNA sequence and their use for finding genes. Nucleic Acids Res 12, 551–567.
Article CAS PubMed Google Scholar
Gerhard, D. S., Wagner, L., Feingold, E. A., Shenmen, C. M., Grouse, L. H., Schuler, G., et al. (2004) The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). Genome Res 14, 2121–2127.
Article PubMed Google Scholar
Kotlar, D., Lavner, Y. (2003) Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. Genome Res 13, 1930–1937.
CAS PubMed Google Scholar
Lio, P. (2003) Wavelets in bioinformatics and computational biology: state of art and perspectives. Bioinformatics 19, 2–9.
Article CAS PubMed Google Scholar
Guo, F. B., Ou, H. Y., Zhang, C. T. (2003) ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Res 31, 1780–1789.
Article CAS PubMed Google Scholar
Burge, C., Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268, 78–94.
Article CAS PubMed Google Scholar
Lukashin, A. V., Borodovsky, M. (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 26, 1107–1115.
Article CAS PubMed Google Scholar
Parra, G., Blanco, E., Guigo, R. (2000) GeneID in Drosophila. Genome Res 10, 511–515.
Article CAS PubMed Google Scholar
Majoros, W. H., Pertea, M., Salzberg, S. L. (2004) TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879.
Article CAS PubMed Google Scholar
Foissac, S., Bardou, P., Moisan, A., Cros, M. J., Schiex, T. (2003) EUGENE’HOM: A generic similarity-based gene finder using multiple homologous sequences. Nucleic Acids Res 31, 3742–3745.
Article CAS PubMed Google Scholar
Salzberg, S. L., Pertea, M., Delcher, A. L., Gardner, M. J., Tettelin, H. (1999) Interpolated Markov models for eukaryotic gene finding. Genomics 59, 24–31.
Article CAS PubMed Google Scholar
Guigo, R. (1998) Assembling genes from predicted exons in linear time with dynamic programming. J Comput Biol 5, 681–702.
Article CAS PubMed Google Scholar
Stormo, G. D. (2000) Gene-finding approaches for eukaryotes. Genome Res 10, 394–397.
Article CAS PubMed Google Scholar
Reese, M. G., Kulp, D., Tammana, H., Haussler, D. (2000) Genie–gene finding in Drosophila melanogaster. Genome Res 10, 529–538.
Article CAS PubMed Google Scholar
Krogh, A. (2000) Using database matches with for HMMGene for automated gene detection in Drosophila. Genome Res 10, 523–528.
Article CAS PubMed Google Scholar
Stanke, M., Waack, S. (2003) Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19 Suppl 2, ii215–225.
PubMed Google Scholar
Smith, T. F., Waterman, M. S. (1981) Identification of common molecular subsequences. J Mol Biol 147, 195–197.
Article CAS PubMed Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J. (1990) Basic local alignment search tool. J Mol Biol 215, 403–410.
CAS PubMed Google Scholar
Kent, W. J. (2002) BLAT – the BLAST-like alignment tool. Genome Res 12, 656–664.
CAS PubMed Google Scholar
Pearson, W. R. (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 132, 185–219.
CAS PubMed Google Scholar
Karlin, S., Altschul, S. F. (1993) Applications and statistics for multiple high-scoring segments in molecular sequences. Proc Natl Acad Sci USA 90, 5873–5877.
Article CAS PubMed Google Scholar
Badger, J. H., Olsen, G. J. (1999) CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol 16, 512–524.
CAS PubMed Google Scholar
Castrignano, T., De Meo, P. D., Grillo, G., Liuni, S., Mignone, F., Talamo, I. G., et al. (2006) GenoMiner: a tool for genome-wide search of coding and non-coding conserved sequence tags. Bioinformatics 22, 497–499.
Article CAS PubMed Google Scholar
Castrignano, T., Canali, A., Grillo, G., Liuni, S., Mignone, F., Pesole, G. (2004) CSTminer: a web tool for the identification of coding and noncoding conserved sequence tags through cross-species genome comparison. Nucleic Acids Res 32, W624–W627.
Article CAS PubMed Google Scholar
Florea, L., Hartzell, G., Zhang, Z., Rubin, G. M., Miller, W. (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 8, 967–974.
CAS PubMed Google Scholar
Wheelan, S. J., Church, D. M., Ostell, J. M. (2001) Spidey: a tool for mRNA-to-genomic alignments. Genome Res 11, 1952–1957.
CAS PubMed Google Scholar
Usuka, J., Brendel, V. (2000) Gene structure prediction by spliced alignment of genomic DNA with protein sequences: increased accuracy by differential splice site scoring. J Mol Biol 297, 1075–1085.
Article CAS PubMed Google Scholar
Usuka, J., Zhu, W., Brendel, V. (2000) Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics 16, 203–211.
Article CAS PubMed Google Scholar
Wu, T. D., Watanabe, C. K. (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875.
Article CAS PubMed Google Scholar
Mott, R. (1997) EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput Appl Biosci 13, 477–478.
CAS PubMed Google Scholar
Bonizzoni, P., Rizzi, R., Pesole, G. (2005) ASPIC: a novel method to predict the exon-intron structure of a gene that is optimally compatible to a set of transcript sequences. BMC Bioinformatics 6, 244.
Article PubMed Google Scholar
Castrignano, T., Rizzi, R., Talamo, I. G., De Meo, P. D., Anselmo, A., Bonizzoni, P., et al. (2006) ASPIC: a web resource for alternative splicing prediction and transcript isoforms characterization. Nucleic Acids Res 34, W440–W443.
Article CAS PubMed Google Scholar
Djebali, S., Delaplace, F., Crollius, H. R. (2006) Exogean: a framework for annotating protein-coding genes in eukaryotic genomic DNA. Genome Biol 7 Suppl 1, S7 1–10.
Article PubMed Google Scholar
Gelfand, M. S., Mironov, A. A., Pevzner, P. A. (1996) Gene recognition via spliced sequence alignment. Proc Natl Acad Sci USA 93, 9061–9066.
Article CAS PubMed Google Scholar
Birney, E., Clamp, M., Durbin, R. (2004) GeneWise and Genomewise. Genome Res 14, 988–995.
Article CAS PubMed Google Scholar
Meyer, I. M., Durbin, R. (2002) Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics 18, 1309–1318.
Article CAS PubMed Google Scholar
Pachter, L., Alexandersson, M., Cawley, S. (2002) Applications of generalized pair hidden Markov models to alignment and gene finding problems. J Comput Biol 9, 389–399.
Article CAS PubMed Google Scholar
Parra, G., Agarwal, P., Abril, J. F., Wiehe, T., Fickett, J. W., Guigo, R. (2003) Comparative gene prediction in human and mouse. Genome Res 13, 108–117.
Article CAS PubMed Google Scholar
Korf, I., Flicek, P., Duan, D., Brent, M. R. (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 17 Suppl 1, S140–S148.
PubMed Google Scholar
Yeh, R. F., Lim, L. P., Burge, C. B. (2001) Computational inference of homologous gene structures in the human genome. Genome Res 11, 803–816.
Article CAS PubMed Google Scholar
Gross, S. S., Brent, M. R. (2006) Using multiple alignments to improve gene prediction. J Comput Biol 13, 379–393.
Article CAS PubMed Google Scholar
Stanke, M., Tzvetkova, A., Morgenstern, B. (2006) AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome. Genome Biol 7 Suppl 1, S11 11–18.
Article Google Scholar
Carter, D., Durbin, R. (2006) Vertebrate gene finding from multiple-species alignments using a two-level strategy. Genome Biol 7 Suppl 1, S6 1–12.
Article PubMed Google Scholar
Wei, C., Brent, M. R. (2006) Using ESTs to improve the accuracy of de novo gene prediction. BMC Bioinformatics 7, 327.
Article PubMed Google Scholar
Curwen, V., Eyras, E., Andrews, T. D., Clarke, L., Mongin, E., Searle, S. M., et al. (2004) The Ensembl automatic gene annotation system. Genome Res 14, 942–950.
Article CAS PubMed Google Scholar
Slater, G. S., Birney, E. (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31.
Article PubMed Google Scholar
Eyras, E., Caccamo, M., Curwen, V., Clamp, M. (2004) ESTGenes: alternative splicing from ESTs in Ensembl. Genome Res 14, 976–987.
Article CAS PubMed Google Scholar
Parra, G., Bradnam, K., Korf, I. (2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067.
Article CAS PubMed Google Scholar
Howe, K. L., Chothia, T., Durbin, R. (2002) GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Res 12, 1418–1427.
Article CAS PubMed Google Scholar
Allen, J. E., Majoros, W. H., Pertea, M., Salzberg, S. L. (2006) JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biol 7 Suppl 1, S9 1–13.
Article PubMed Google Scholar
Burset, M., Guigo, R. (1996) Evaluation of gene structure prediction programs. Genomics 34, 353–367.
Article CAS PubMed Google Scholar
Guigo, R., Flicek, P., Abril, J. F., Reymond, A., Lagarde, J., Denoeud, F., et al. (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7 Suppl 1, S2 1–31.
Article Google Scholar

Download references

Acknowledgments

This work was supported by the projects VIGNA (Ministero Politiche Agrigole e Forestali), LIBI – Laboratorio Internazionale di Bioinformatica (Fondo Italiano Ricerca di Base, Ministero dell’ Università e della Ricerca), Laboratorio per la Bioinformatica e la Biodiversità Molecolare (Ministero dell’Università e della Ricerca), Telethon, and AIRC.

Author information

Authors and Affiliations

Dipartimento di Biochimica e Biologia Molecolare “E. Quagliariello”, University of Bari, Bari, Italy
Ernesto Picardi & Graziano Pesole

Authors

Ernesto Picardi
View author publications
You can also search for this author in PubMed Google Scholar
Graziano Pesole
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Max F. Perutz Laboratories GmbH, Universität Wien, Dr. Bohr-Gasse 9, Wien, 1030, Austria
Oliviero Carugo
Research (A*STAR), Agency for Science & Technology, Biopolis Street 30, Singapore, 138671, Singapore
Frank Eisenhaber

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Picardi, E., Pesole, G. (2010). Computational Methods for Ab Initio and Comparative Gene Finding. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 609. Humana Press. https://doi.org/10.1007/978-1-60327-241-4_16

Download citation

DOI: https://doi.org/10.1007/978-1-60327-241-4_16
Published: 30 October 2009
Publisher Name: Humana Press
Print ISBN: 978-1-60327-240-7
Online ISBN: 978-1-60327-241-4
eBook Packages: Springer Protocols

Publish with us

Policies and ethics