Eugène: An Eukaryotic Gene Finder That Combines Several Sources of Evidence
In this paper, we describe the basis of EuGéne, a gene finder for eukaryotic organisms applied to Arabidopsis thaliana. The specificity of EuGéne, compared to existing gene finding software, is that EuGéne has been designed to combine the output of several information sources, including output of other software or user information. To achieve this, a weighted directed acyclic graph (DAG) is built in such a way that a shortest feasible path in this graph represents the most likely gene structure of the underlying DNA sequence.
The usual simple Bellman linear time shortest path algorithm for DAG has been replaced by a shortest path with constraints algorithm. The constraints express minimum length of introns or intergenic regions. The specificity of the constraints leads to an algorithm which is still linear both in time and space. p] EuGéne effectiveness has been assessed on Araset, a recent dataset of Arabidopsis thaliana sequences used to evaluate several existing gene finding software. It appears that, despite its simplicity, EuGéne gives results which compare very favourably to existing software. We try to analyse the reasons of these results.
Unable to display preview. Download preview PDF.
- R. Bellman, Dynamic Programming, Princeton Univ. Press, Princeton, New Jersey, (1957).Google Scholar
- T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to algorithms, MIT Press, (1990). ISBN: 0-262-03141-8.Google Scholar
- L. Florea, G. Hartzell, Z. Zhang, G. Rubin, and W. Miller, Sept. 1998, A computer program for aligning a cdna sequence with a genomic dna sequence, Genome Res, 8, pp. 967–974.Google Scholar
- D. Kulp, D. Haussler, M. Reese, and F. Eeckman, (1997), Integrating database homology in a probabilistic gene structure model., in Pacific Symp. Biocomputing, pp. 232–44.Google Scholar
- N. Pavy, S. Rombauts, P. Déhais, C. Mathé, D. Ramana, P. Leroy, and P. Rouzé, Nov. 1999, Evaluation of gene prediction software using a genomic data set: application to arabidopsis thaliana sequences., Bioinformatics, 15, pp. 887–99. Also appeared in the Proc. of 2d Georgia Tech conference on BioInformatics.CrossRefGoogle Scholar
- A. Pedersen and H. Nielsen, (1997), Neural network prediction of translation initiation sites in eukaryotes: prespectives for EST and genome analysis, in Proc. of ISMB’97, AAAI Press, pp. 226–233.Google Scholar
- G. R, Winter 1998, Assembling genes from predicted exons in linear time with dynamic programming., Journal of Computational Biology, 5, pp. 681–702.Google Scholar
- I. Rogozin, L. Milanesi, and N. Kolchanov, Jun 1996, Gene structure prediction using information on homologous protein sequence., Comput. Appl. Biosci., 12, pp. 161–70.Google Scholar
- E. Snyder and G. Stormo, DNA and protein sequence analysis: a practical approach, IRL Press, Oxford, (1995), ch. Identifying genes in genomic DNA sequences, pp. 209–224.Google Scholar