Computational Gene Annotation in New Genome Assemblies Using GeneID

  • Enrique Blanco
  • Josep F. Abril
Part of the Methods in Molecular Biology book series (MIMB, volume 537)


The sequence of many eukaryotic genomes is nowadays available from a personal computer to any researcher in the world-wide scientific community. However, the sequences are worthless without the adequate annotation of the biological meaningful elements. The annotation of the genes, in particular, is a challenging task that can not be tackled without the aid of specific bioinformatics tools. We present in this chapter a simple protocol mainly based on the combination of the program GeneID and other computational tools to annotate the location of a gene, which was previously annotated in D. melanogaster, in the recently assembled genome of D. yakuba.

Key words

Gene finding gene protein genome exon intron sequence alignment annotation pipeline 


  1. 1.
    Blanco, E., and R. Guigó (2005) Predictive methods using DNA sequences, in Bioinformatics : A Practical Guide to the Analysis of Genes and Proteins (Baxevanis, A.D. and Ouellette, B.F.F. Eds). Wiley-Interscience: Hoboken, NJ, p. xviii, 540 p.Google Scholar
  2. 2.
    ENCODE Project Consortium (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146), 799–816.CrossRefGoogle Scholar
  3. 3.
    Zhang, M. Q. (2002) Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet 3(9), 698–709.PubMedCrossRefGoogle Scholar
  4. 4.
    Venter, J. C., et al. (2001) The sequence of the human genome. Science 291(5507), 1304–51.PubMedCrossRefGoogle Scholar
  5. 5.
    Nagaraj, S. H., Gasser, R. B., and Ranganathan, S. (2007) A hitchhiker’s guide to expressed sequence tag (EST) analysis. Brief Bioinform 8(1), 6–21.PubMedCrossRefGoogle Scholar
  6. 6.
    Stanke, M., Tzvetkova, A., and Morgenstern, B. (2006) AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome. Genome Biol 7 Suppl 1, S11 1–8.CrossRefGoogle Scholar
  7. 7.
    Allen, J. E., and Salzberg, S. L. (2005) JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 21(18), 3596–603.PubMedCrossRefGoogle Scholar
  8. 8.
    Kuhn, R. M., et al. (2007) The UCSC genome browser database: update 2007. Nucleic Acids Res 35(Database issue), D668–73.PubMedCrossRefGoogle Scholar
  9. 9.
    Hubbard, T. J., et al. (2007) Ensembl 2007. Nucleic Acids Res 35(Database issue), D610–7.PubMedCrossRefGoogle Scholar
  10. 10.
    Wheeler, D. L., et al. (2007) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 35(Database issue), D5–12.PubMedCrossRefGoogle Scholar
  11. 11.
    Guigo, R., et al. (1992) Prediction of gene structure. J Mol Biol 226(1), 141–57.PubMedCrossRefGoogle Scholar
  12. 12.
    Parra, G., Blanco, E., and Guigo, R. (2000) GeneID in Drosophila. Genome Res 10(4), 511–5.PubMedCrossRefGoogle Scholar
  13. 13.
    Blanco, E., Parra, G., and Guigó, R. (2007) Using geneid to identify genes in Current Protocols in Bioinformatics (Baxevanis, A. D. et al., Eds). John Wiley & Sons: New York, p. 1–28 (Unit 4.3).Google Scholar
  14. 14.
    Burge, C., and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268(1), 78–94.PubMedCrossRefGoogle Scholar
  15. 15.
    Besemer, J., and Borodovsky, M. (2005) GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res 33(Web Server issue), W451–4.PubMedCrossRefGoogle Scholar
  16. 16.
    Uberbacher, E. C., and Mural, R. J. (1991) Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci USA 88(24), 11261–5.PubMedCrossRefGoogle Scholar
  17. 17.
    Salamov, A. A., and Solovyev, V. V. (2000) Ab initio gene finding in Drosophila genomic DNA. Genome Res 10(4), 516–22.PubMedCrossRefGoogle Scholar
  18. 18.
    Reese, M. G., et al. (2000) Genome annotation assessment in Drosophila melanogaster. Genome Res 10(4), 483–501.PubMedCrossRefGoogle Scholar
  19. 19.
    Glockner, G., et al. (2002) Sequence and analysis of chromosome 2 of Dictyostelium discoideum. Nature 418(6893), 79–85.PubMedCrossRefGoogle Scholar
  20. 20.
    Jaillon, O., et al. (2004) Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature 431(7011), 946–57.PubMedCrossRefGoogle Scholar
  21. 21.
    Aury, J. M., et al. (2006) Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia. Nature 444(7116), 171–8.PubMedCrossRefGoogle Scholar
  22. 22.
    Guigo, R., et al. (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7 Suppl 1, S2 1–31.CrossRefGoogle Scholar
  23. 23.
    Gingeras, T. R. (2007) Origin of phenotypes: genes and transcripts. Genome Res 17(6), 682–90.PubMedCrossRefGoogle Scholar
  24. 24.
    Ladd, A. N., and Cooper, T. A. (2002) Finding signals that regulate alternative splicing in the post-genomic era. Genome Biol 3(11), reviews0008.PubMedCrossRefGoogle Scholar
  25. 25.
    Low, S. C., and Berry, M. J. (1996) Knowing when not to stop: selenocysteine incorporation in eukaryotes. Trends Biochem Sci 21(6), 203–8.PubMedGoogle Scholar
  26. 26.
    Castellano, S., et al. (2004) Reconsidering the evolution of eukaryotic selenoproteins: a novel nonmammalian family with scattered phylogenetic distribution. EMBO Rep 5(1), 71–7.PubMedCrossRefGoogle Scholar
  27. 27.
    Pruitt, K. D., Tatusova, T., and Maglott, D. R. (2007) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35(Database issue), D61–5.PubMedCrossRefGoogle Scholar
  28. 28.
    Crosby, M. A., et al. (2007) FlyBase: genomes by the dozen. Nucleic Acids Res 35(Database issue), D486–91.PubMedCrossRefGoogle Scholar
  29. 29.
    Guigo, R. (1998) Assembling genes from predicted exons in linear time with dynamic programming. J Comput Biol 5(4), 681–702.PubMedCrossRefGoogle Scholar
  30. 30.
    Kent, W. J. (2002) BLAT – the BLAST-like alignment tool. Genome Res 12(4), 656–64.PubMedGoogle Scholar
  31. 31.
    Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22), 4673–80.PubMedCrossRefGoogle Scholar
  32. 32.
    Birney, E., Clamp, M., and Durbin, R. (2004) GeneWise and Genomewise. Genome Res 14(5), 988–95.PubMedCrossRefGoogle Scholar
  33. 33.
    Abril, J. F., and Guigo, R. (2000) gff2ps: visualizing genomic annotations. Bioinformatics 16(8), 743–4.PubMedCrossRefGoogle Scholar
  34. 34.
    Fabra, P., and Miracle, J. (1983) Diccionari general de la Ilengua catalana. (17a ed). EDHASA editorial: Barcelona, 1786 p.Google Scholar
  35. 35.
    Jimenez, G., et al. (2000) Relief of gene repression by torso RTK signaling: role of capicua in Drosophila terminal and dorsoventral patterning. Genes Dev 14(2), 224–31.PubMedGoogle Scholar
  36. 36.
    Adams, M. D., et al. (2000) The genome sequence of Drosophila melanogaster. Science 287(5461), 2185–95.PubMedCrossRefGoogle Scholar
  37. 37.
    Parra, G., et al. (2003) Comparative gene prediction in human and mouse. Genome Res 13(1), 108–17.PubMedCrossRefGoogle Scholar
  38. 38.
    Wang, M., Buhler, J., and Brent, M. R. (2003) The effects of evolutionary distance on TWINSCAN, an algorithm for pair-wise comparative gene prediction. Cold Spring Harb Symp Quant Biol 68, 125–30.PubMedCrossRefGoogle Scholar
  39. 39.
    Batzoglou, S. (2005) The many faces of sequence alignment. Brief Bioinform 6(1), 6–22.PubMedCrossRefGoogle Scholar

Copyright information

© Humana Press, a part of Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Enrique Blanco
    • 1
  • Josep F. Abril
    • 1
  1. 1.Departament de Genètica, Facultat de BiologiaUniversitat de BarcelonaBarcelonaSpain

Personalised recommendations