Eugène: An Eukaryotic Gene Finder That Combines Several Sources of Evidence

  • Thomas Schiex
  • Annick Moisan
  • Pierre Rouzé
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2066)

Abstract

In this paper, we describe the basis of EuGéne, a gene finder for eukaryotic organisms applied to Arabidopsis thaliana. The specificity of EuGéne, compared to existing gene finding software, is that EuGéne has been designed to combine the output of several information sources, including output of other software or user information. To achieve this, a weighted directed acyclic graph (DAG) is built in such a way that a shortest feasible path in this graph represents the most likely gene structure of the underlying DNA sequence.

The usual simple Bellman linear time shortest path algorithm for DAG has been replaced by a shortest path with constraints algorithm. The constraints express minimum length of introns or intergenic regions. The specificity of the constraints leads to an algorithm which is still linear both in time and space. p] EuGéne effectiveness has been assessed on Araset, a recent dataset of Arabidopsis thaliana sequences used to evaluate several existing gene finding software. It appears that, despite its simplicity, EuGéne gives results which compare very favourably to existing software. We try to analyse the reasons of these results.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    R. Bellman, Dynamic Programming, Princeton Univ. Press, Princeton, New Jersey, (1957).Google Scholar
  2. [2]
    V. Brendel and J. Kleffe, (1998), Prediction of locally optimal splice sites in plant pre-mRNA with application to gene identification in Arabidopsis thaliana genomic DNA, Nucleic Acids Res., 26, pp. 4749–4757.CrossRefGoogle Scholar
  3. [3]
    C. Burge and S. Karlin, Apr 1997, Prediction of complete gene structures in human genomic dna., J Mol Biol, 268, pp. 78–94.CrossRefGoogle Scholar
  4. [4]
    T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to algorithms, MIT Press, (1990). ISBN: 0-262-03141-8.Google Scholar
  5. [5]
    L. Florea, G. Hartzell, Z. Zhang, G. Rubin, and W. Miller, Sept. 1998, A computer program for aligning a cdna sequence with a genomic dna sequence, Genome Res, 8, pp. 967–974.Google Scholar
  6. [6]
    X. Huang, M. Adams, H. Zhou, and A. Kerlavage, Nov 1997, A tool for analyzing and annotating genomic sequences., Genomics, 46, pp. 37–45.CrossRefGoogle Scholar
  7. [7]
    P. Korning, S. Hebsgaard, P. Rouze, and S. Brunak, (1996), Cleaning the genbank arabidopsis thaliana data set, Nucleic Acids Res., 24, pp. 316–320.CrossRefGoogle Scholar
  8. [8]
    D. Kulp, D. Haussler, M. Reese, and F. Eeckman, (1997), Integrating database homology in a probabilistic gene structure model., in Pacific Symp. Biocomputing, pp. 232–44.Google Scholar
  9. [9]
    A. V. Lukashin and M. Borodovsky, (1998), GeneMark.hmm: new solutions for gene finding, Nucleic Acids Res., 26, pp. 1107–1115.CrossRefGoogle Scholar
  10. [10]
    K. Murakami and T. Takagi, (1998), Gene recognition by combination of several gene-finding programs, BioInformatics, 14, pp. 665–675.CrossRefGoogle Scholar
  11. [11]
    N. Pavy, S. Rombauts, P. Déhais, C. Mathé, D. Ramana, P. Leroy, and P. Rouzé, Nov. 1999, Evaluation of gene prediction software using a genomic data set: application to arabidopsis thaliana sequences., Bioinformatics, 15, pp. 887–99. Also appeared in the Proc. of 2d Georgia Tech conference on BioInformatics.CrossRefGoogle Scholar
  12. [12]
    A. Pedersen and H. Nielsen, (1997), Neural network prediction of translation initiation sites in eukaryotes: prespectives for EST and genome analysis, in Proc. of ISMB’97, AAAI Press, pp. 226–233.Google Scholar
  13. [13]
    G. R, Winter 1998, Assembling genes from predicted exons in linear time with dynamic programming., Journal of Computational Biology, 5, pp. 681–702.Google Scholar
  14. [14]
    L. Rabiner, (1989), A tutorial on hidden markov models and selected application in speech recognition, Proc. IEEE, 77, pp. 257–286.CrossRefGoogle Scholar
  15. [15]
    L. R. Rabiner, (1989), A tutorial on hidden markov models and selected applications in speech recognition, Proc. of the IEEE, 77, pp. 257–286.CrossRefGoogle Scholar
  16. [16]
    I. Rogozin, L. Milanesi, and N. Kolchanov, Jun 1996, Gene structure prediction using information on homologous protein sequence., Comput. Appl. Biosci., 12, pp. 161–70.Google Scholar
  17. [17]
    S. L. Salzberg, A. L. Delcher, S. Kasif, and O. White, (1998), Microbial gene identification using interpolated Markov models, Nucleic Acids Res., 26, pp. 544–548.CrossRefGoogle Scholar
  18. [18]
    E. Snyder and G. Stormo, DNA and protein sequence analysis: a practical approach, IRL Press, Oxford, (1995), ch. Identifying genes in genomic DNA sequences, pp. 209–224.Google Scholar
  19. [19]
    N. Tolstrup et al., (1997), A branch-point consensus from Arabidopsis found by non circular analysis allows for better prediction of acceptor sites, Nucleic Acids Res., 25, pp. 3159–3163.CrossRefGoogle Scholar
  20. [20]
    J. Usuka., W. Zhu., and V. Brendel., (2000), Optimal spliced alignment of homologous cDNA to a genomic DNA template, Bioinformatics, 16, pp. 203–211.CrossRefGoogle Scholar
  21. [21]
    T. D. Wu, (1996), A segment-based dynamic programming algorithm for predicting gene structure, Journal of Computational Biology, 3, pp. 375–394.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Thomas Schiex
    • 1
  • Annick Moisan
    • 1
  • Pierre Rouzé
    • 2
  1. 1.INRAToulouseFrance
  2. 2.INRAGandBelgique

Personalised recommendations