Comparative Gene Prediction Based on Gene Structure Conservation

  • Shu Ju Hsieh
  • Chun Yuan Lin
  • Ning Han Liu
  • Chuan Yi Tang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4146)


Identifying protein coding genes is one of most important task in newly sequenced genomes. With increasing numbers of gene annotations verified by experiments, it is feasible to identify genes in newly sequenced genomes by comparing with genes annotated on phylogenetically close organisms. Here, we propose a program, GeneAlign, which predicts the genes on one sequence by measuring the similarity between the predicted sequence and related genes annotated on another genome. The program applies CORAL, a heuristic linear time alignment tool, to determine whether the regions flanked by candidate signals are similar with the annotated exons or not. The approach, which employs the conservation of gene structures and sequence homologies between protein coding regions, increases the prediction accuracy. GeneAlign was tested on Projector data set of 449 human-mouse homologous sequence pairs. At the gene level, the sensitivity and specificity of GeneAlign are 80%, and larger than 96% at the exon level.


Gene Prediction Translation Initiation Site Splice Signal Exon Level Annotate Exon 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Alexandersson, M., Cawley, S., Pachter, L.: SLAM: cross-organisms gene finding and alignment with a generalized pair hidden Markov model. Genome Res. 13, 496–502 (2003)CrossRefGoogle Scholar
  2. 2.
    Allen, J.E., Pertea, M., Salzberg, S.L.: Computational gene prediction using multiple sources of evidence. Genome Res. 14, 142–148 (2004)CrossRefGoogle Scholar
  3. 3.
    Allen, J.E., Salzberg, S.L.: JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 21, 3596–3606 (2005)CrossRefGoogle Scholar
  4. 4.
    Batzoglou, S., Pachter, L., Mesirovi, J.P., Berger, B., Lander, E.S.: Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res. 10, 950–958 (2000)CrossRefGoogle Scholar
  5. 5.
    Bernal, A., Ear, U., Kyrpides, N.: Genomes OnLine Database (GOLD): a monitor of genome projects world-wide. Nucleic Acids Res. 29, 126–127 (2001)CrossRefGoogle Scholar
  6. 6.
    Birney, E., Clamp, M., Durbin, R.: GeneWise and Genomewise. Genome Res. 14, 988–995 (2004)CrossRefGoogle Scholar
  7. 7.
    Brejova, B., Brown, D.G., Li, M., Vinar, T.: ExonHunter: a comprehensive approach to gene finding. Bioinformatics 21, 57–65 (2005)CrossRefGoogle Scholar
  8. 8.
    Brendel, V., Xing, L., Zhu, W.: Gene structure prediction from consensus spliced alignment of multiple ESTs matching the same genomic locus. Bioinformatics 20, 1157–1169 (2004)CrossRefGoogle Scholar
  9. 9.
    Brent, M.R., Buigo, R.: Recent advances in gene structure prediction. Curr. Opin. Struct. Biol. 14, 264–272 (2004)CrossRefGoogle Scholar
  10. 10.
    Burge, C., Karlin, S.: Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997)CrossRefGoogle Scholar
  11. 11.
    Burset, M., Guigó, R.: Evaluation of gene structure prediction programs. Genomics 34, 353–367 (1996)CrossRefGoogle Scholar
  12. 12.
    Chen, T.M., Lu, C.C., Li, W.H.: Prediction of splice sites with dependency graphs and their expanded bayesian networks. Bioinformatics 21, 471–482 (2005)CrossRefGoogle Scholar
  13. 13.
    Curwen, V., Eyras, E., Andrews, T.D., Clarke, L., Mongin, E., Searle, S.M., Clamp, M.: The Ensembl automatic gene annotation system. Genome Res. 14, 942–950 (2004)CrossRefGoogle Scholar
  14. 14.
    Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., Miller, W.: A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8, 967–974 (1998)Google Scholar
  15. 15.
    Gelfand, M.S., Mironov, A.A., Pevzner, P.A.: Gene recognition via spliced sequence alignment. Proc. Natl. Acad. Sci. 93, 9061–9066 (1996)CrossRefGoogle Scholar
  16. 16.
    Hsieh, S.J., Lin, C.Y., Chung, Y.S., Tang, C.Y.: Comparative exon prediction based on heuristic coding region alignment. In: Proceedings of the International Symposium on Parallel Architectures, Algorithms, and Networks, pp. 14–19 (2005)Google Scholar
  17. 17.
    Karolchik, D., Baertsch, R., Diekhans, M., Furey, T.S., Hinrichs, A., Lu, Y.T., Roskin, K.M., Schwartz, M., Sugnet, C.W., Thomas, D.J., et al.: The UCSC Genome Browser Database. Nucleic Acids Res. 31, 51–54 (2003)CrossRefGoogle Scholar
  18. 18.
    Kent, W.J., Zahler, A.M.: Conservation, regulation, synteny, and introns in a large-scale C. briggsae-C. elegans genomic alignment. Genome Res. 10, 1115–1125 (2000)CrossRefGoogle Scholar
  19. 19.
    Korf, I., Flicek, P., Duan, D., Brent, M.R.: Integrating genomic homology into gene structure prediction. Bioinformatics 17, 140–148 (2001)Google Scholar
  20. 20.
    Meyer, I.M., Durbin, R.: Gene structure conservation aids similarity based gene prediction. Nucleic Acids Res. 32, 776–783 (2004)CrossRefGoogle Scholar
  21. 21.
    Meyer, I.M., Durbin, R.: Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics 18, 1309–1318 (2002)CrossRefGoogle Scholar
  22. 22.
    Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002)Google Scholar
  23. 23.
    Nadershahi, A., Fahrenkrug, S.C., Ellis, L.B.: Comparison of computational method for identifying translation initiation sites in EST data. BMC Bioinformatics 5, 14 (2004)CrossRefGoogle Scholar
  24. 24.
    Novichkov, P.S., Gelfand, M.S., Mironov, A.A.: Gene recognition in eukaryotic DNA by comparison of genomic sequences. Bioinformatics 17, 1011–1018 (2001)CrossRefGoogle Scholar
  25. 25.
    Parra, G., Agarwal, P., Abril, J.F., Wiehe, T., Fickett, J.W., Guigó, R.: Comparative gene prediction in human and mouse. Genome Res. 13, 108–117 (2003)CrossRefGoogle Scholar
  26. 26.
    Pedersen, A.G., Nielen, H.: Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives for EST and Genome analysis. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 5, pp. 226–233 (1997)Google Scholar
  27. 27.
    Pruitt, K.D., Tatusova, T., Maglott, D.R.: NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33, 501–504 (2005)CrossRefGoogle Scholar
  28. 28.
    Wheelan, S.J., Church, D.M., Ostell, J.M.: Spidey: a tool for mRNA-to-genomic alignments. Genome Res. 11, 1952–1957 (2001)Google Scholar
  29. 29.
    Wu, T.D., Watanabe, C.K.: GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Shu Ju Hsieh
    • 1
  • Chun Yuan Lin
    • 2
  • Ning Han Liu
    • 1
  • Chuan Yi Tang
    • 1
  1. 1.Department of Computer Science 
  2. 2.Institute of Molecular and Cellular BiologyNational Tsing-Hua UniversityHsinchuTaiwan, ROC

Personalised recommendations