Skip to main content

Spliced alignment: A new approach to gene recognition

  • Conference paper
  • First Online:
Combinatorial Pattern Matching (CPM 1996)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1075))

Included in the following conference series:

Abstract

Gene structure prediction is one of the most important problems in computational molecular biology. Previous attempts to solve this problem were based on statistics and artificial intelligence and, surprisingly enough, applications of theoretical computer science methods for gene recognition were almost unexplored. Recent advances in large-scale cDNA sequencing open a way towards a new combinatorial approach to gene recognition. This paper describes a spliced alignment algorithm and a software tool which explores all possible exon assemblies in polynomial time and finds the multiexon structure with the best fit to a related protein. Unlike other existing methods, the algorithm successfully recognizes genes even in the case of short exons or exons with unusual codon usage; we also report correct assemblies for genes with more than 10 exons. On a test sample of human genes with known mammalian relatives the average correlation between the predicted and the actual genes was 99%, which is a very high accuracy as compared with other existing methods. The algorithm correctly reconstructed 87% of genes and the rare discrepancies between the predicted and real exonintron structures were caused by either (i) extremely short (less than 5 amino acids) initial or terminal exons, or (ii) alternative splicing, or (iii) errors in database feature tables. Moreover, the algorithm predicts human genes reasonably well when the homologous protein is non-vertebrate or even prokaryotic. The surprizingly good performance of the method was confirmed by extensive simulations: in particular, with target proteins showing just 25% similarity, the correlation between the predicted and actual genes still was as high as 95%.

The research was supported by DOE grant DE-FG02-95ER61919, Russian Fund of Fundamental Research grant 94-04-12330, grant MTW300 from ISF, and the Russian State Program ”Human Genome”.

The research was supported by DOE grant DE-FG02-95ER61919 and the Russian State Program ”Human Genome”.

The research was supported by DOE grant DE-FG02-95ER61919 by NSF Young Investigator award CCR-9457784.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adams M.D., Kerlavage A.R., Fields C., Venter J.C. (1993) Nature Genet., 4, 256–267.

    PubMed  Google Scholar 

  2. Altschul S.F. (1991) J. Mol. Biol., 219, 555–565.

    PubMed  Google Scholar 

  3. Burset M., Guigo R. (1995) (Submitted).

    Google Scholar 

  4. Chao K.M., Hardison R.S., Miller W. (1994) J. Comp. Biol., 1, 271–291.

    Google Scholar 

  5. Dong S., Searls D.B. (1994) Genomics, 23, 540–551.

    PubMed  Google Scholar 

  6. Dayhoff M.O., Schwartz R.M., Orcutt B.C. (1978) Atlas of Protein Sequence and Structure (Dayhoff M.O.), 5, suppl. 3, 345–352.

    Google Scholar 

  7. Fickett J.W. (1982) Nucleic Acids Res., 10, 5303–5318.

    PubMed  Google Scholar 

  8. Fickett J.W. (1995) Computers Chem., 19, in press.

    Google Scholar 

  9. Farach M., Noordewier M., Savari S., Shepp L., Weiner A., Ziv J. (1995) Proc. 6th Annual ACM-SIAM Symposium on Discrete Algorithms, (San Francisco, CA), 48–57.

    Google Scholar 

  10. Gelfand M.S. (1990) Nucleic Acids Res., 18, 5865–5869.

    PubMed  Google Scholar 

  11. Gelfand M.S. (1995) J. Comput. Biol., 2, 87–115.

    PubMed  Google Scholar 

  12. Gelfand M.S., Podolsky L.I., Astakhova T.V., Roytberg M.A. (1995) J. Comp. Biol. (in press).

    Google Scholar 

  13. Glasser S.W., Korfhagen T.R., Perme C.M., Pilot-Matias T.J., Kister S.E., Whitsett J.A. (1988) J. Biol. Chem., 263, 10326–10331.

    PubMed  Google Scholar 

  14. Gelfand M.S., Roytberg M.A. (1993) BioSystems, 30, 173–182.

    PubMed  Google Scholar 

  15. Gish W., States D.J. (1993) Nature Genet., 3, 266–272.

    PubMed  Google Scholar 

  16. Guigo R., Knudsen S., Drake N., Smith T. (1992) J. Mol. Biol., 226, 141–157.

    PubMed  Google Scholar 

  17. Hirshberg D.S. (1975) Comm. of ACM, 18, 341–343.

    Google Scholar 

  18. Harr R., Haggstrom M., Gustaffson P. (1983) Nucleic Acids Res., 11, 2943–2957.

    PubMed  Google Scholar 

  19. Hood L., Koop B.F., Rowen L., Wang K. (1993) Cold Spring Harbor Symp. Quant. Biol., 58, 339–348.

    PubMed  Google Scholar 

  20. Kelleher K., Bean K., Clark S.C., Leung W.-Y, Yang-Feng T.L., Chen J.W., Lin P.-F.M., Luo W., Yang Y.-C. (1991) Blood, 77, 1436–1441.

    PubMed  Google Scholar 

  21. Knight J., Myers E.W. (1995) Algorithmica, 13, 211–243

    Google Scholar 

  22. Knecht L. (1995) 6th Annu. Symp. on Combinatorial Pattern Matching (Galil Z., Ukkonen E., eds.), Lecture Notes in Computer Science, 937, 215–229 (Springer-Verlag, Berlin, 1995).

    Google Scholar 

  23. Kruskal J.B., Sankoff D. (1983) Time Warps, String Edits, and Macromolecules (Kruskal J.B., Sankoff D., eds.), 265–310 (Addison-Wesley, Reading, MA).

    Google Scholar 

  24. Legouis R. et al. (1991) Cell, 67, 423–435.

    PubMed  Google Scholar 

  25. Myers E.W., Miller W. (1989) Bull. Math. Biol., 51, 5–37.

    PubMed  Google Scholar 

  26. Myers E.W., Miller W. (1995) Proc. 6th Annual ACM-SIAM Symposium on Discrete Algorithm, (San Francisco, CA), 38–47.

    Google Scholar 

  27. Pascarella S., Argos P. (1992) J. Mol. Biol., 224, 461–471.

    PubMed  Google Scholar 

  28. Sankoff D. (1992) Mathematical Biosciences, 111, 279–293.

    PubMed  Google Scholar 

  29. Searls D., Murphy K. (1995) Proc. 3rd Int. Conf. on Intelligent Systems for Molecular Biology, 341–349 (AAAI Press, Cambridge, England).

    Google Scholar 

  30. Song I., Brown D.R., Wiltshire R.N., Gantz I., Trent J.M., Yamada T. (1993) Proc. Natl. Acad. Sci. USA, 90, 9085–9089.

    PubMed  Google Scholar 

  31. Sze S.-H., Gelfand M.S., Mironov A.A., Pevzner P.A. (1995) (In preparation).

    Google Scholar 

  32. Snyder E.E., Stormo G.D. (1993) Nucleic Acids Res., 21, 607–613.

    PubMed  Google Scholar 

  33. Snyder E.E., Stormo G.D. (1995) J. Mol. Biol., 248, 1–18.

    PubMed  Google Scholar 

  34. Solovyev V.V., Salamov A.A., Lawrence C.B. (1994) Nucl. Acids Res., 22, 5156–5163.

    PubMed  Google Scholar 

  35. Uberbacher E., Mural R. (1991) Proc. Natl. Acad. Sci. USA, 88, 11261–11265.

    PubMed  Google Scholar 

  36. Waterman M.S. (1995) Introduction to Computational Biology. (Chapman & Hall).

    Google Scholar 

  37. Wilbur W., Lipman D. (1983) Proc. Natl. Acad. Sci. USA 80, 726–730.

    PubMed  Google Scholar 

  38. Xu Y., Einstein J.R., Mural R.J., Shah M., Uberbacher E.C. (1994) Proc. 2nd Int. Conf. on Intelligent Systems for Molecular Biology (Altman R., Brutlag D., Karp P., Lathrop R., Searls D., eds.), 376–383 (AAAI Press, Menlo Park, CA).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Dan Hirschberg Gene Myers

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gelfand, M.S., Mironov, A.A., Pevzner, P.A. (1996). Spliced alignment: A new approach to gene recognition. In: Hirschberg, D., Myers, G. (eds) Combinatorial Pattern Matching. CPM 1996. Lecture Notes in Computer Science, vol 1075. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-61258-0_12

Download citation

  • DOI: https://doi.org/10.1007/3-540-61258-0_12

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-61258-2

  • Online ISBN: 978-3-540-68390-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics