New Methods for Splice Site Recognition

  • Sören Sonnenburg
  • Gunnar Rätsch
  • Arun Jagota
  • Klaus-Robert Müller
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2415)


Splice sites are locations in DNA which separate protein-coding regions (exons) from noncoding regions (introns). Accurate splice site detectors thus form important components of computational gene finders. We pose splice site recognition as a classification problem with the classifier learnt from a labeled data set consisting of only local information around the potential splice site. Note that finding the correct position of splice sites without using global information is a rather hard task. We analyze the genomes of the nematode Caenorhabditis elegans and of humans using specially designed support vector kernels. One of the kernels is adapted from our previous work on detecting translation initiation sites in vertebrates and another uses an extension to the well-known Fisher-kernel. We find excellent performance on both data sets.


Splice Site Acceptor Site Nematode Caenorhabditis Elegans Fisher Kernel Biological Prior Knowledge 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Genome sequence of the Nematode Caenorhabditis elegans. Science, 282:2012–2018, 1998.Google Scholar
  2. 2.
    P. Baldi, S. Brunak, Y. Chauvin, C.A.F. Andersen, and H. Nielsen. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16(5):412–424, 2000.CrossRefGoogle Scholar
  3. 3.
    C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998.Google Scholar
  4. 4.
    C. Burge and S. Karlin. Prediction of complete gene structures. J. Mol. Biol., 268:78–94, 1997.CrossRefGoogle Scholar
  5. 5.
    A.L. Delcher, D. Harmon, S. Kasif, O. White, and S.L. Salzberg. Improved microbial gene identification with GLIMMER. Nucleic Acids Research, 27(23):4636–4641, 1999.CrossRefGoogle Scholar
  6. 6.
    R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis probabilistic models of proteins and nucleic acids. Cambridge University Press, 1998.Google Scholar
  7. 7.
    D. Cai et al. Modeling splice sites with Bayes networks. Bioinformatics, 16(2): 152–158, 2000.CrossRefGoogle Scholar
  8. 8.
    M.P.S. Brown et al. Knowledge-based analysis by using SVMs. PNAS, 97:262–267, 2000.CrossRefGoogle Scholar
  9. 9.
    T.S. Jaakkola, M. Diekhans, and D. Haussler. J. Comp. Biol., 7:95–114, 2000.CrossRefGoogle Scholar
  10. 10.
    T.S. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In M.S. Kearnsetal., editor, Adv. in Neural Inf. Proc. Systems, volume 11, pages 487–493, 1999.Google Scholar
  11. 11.
    K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf. An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181–201, 2001.CrossRefGoogle Scholar
  12. 12.
    S. Rampone. Recognition of splice junctions on DNA. Bioinformatics, 14(8):676–684, 1998.CrossRefGoogle Scholar
  13. 13.
    M.G. Reese, E H. Eeckman, D. Kulp, and D. Haussler. J. Comp. Biol., 4:311–323, 1997.Google Scholar
  14. 14.
    S. Salzberg, A.L. Delcher, K.H. Fasman, and J. Henderson. J. Comp. Biol., 5(4):667–680, 1998.Google Scholar
  15. 15.
    B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.Google Scholar
  16. 16.
    A.J. Smola and J. MacNicol. Scalable kernel methods. Unpublished Manuscript, 2002.Google Scholar
  17. 17.
    S. Sonnenburg. Hidden Markov Model for Genome Analysis. Humbold University, 2001. Proj. Rep.Google Scholar
  18. 18.
    S. Sonnenburg. New methods for splice site recognition. Master’s thesis, 2002. Forthcoming.Google Scholar
  19. 19.
    K. Tsuda, M. Kawanabe, G. Rätsch, S. Sonnenburg, and K.R. Müller. A new discriminative kernel from probabilistic models. In Adv. in Neural Inf. proc. systems, volume 14, 2002. In press.Google Scholar
  20. 20.
    V.N. Vapnik. The nature of statistical learning theory. Springer Verlag, New York, 1995.zbMATHGoogle Scholar
  21. 21.
    Y. Xu and E. Uberbacher. Automated gene identification. J. Comp. Biol., 4:325–338, 1997.CrossRefGoogle Scholar
  22. 22.
    A. Zien, G. Rätsch, S. Mika, B. Schölkopf, T. Lengauer, and K.-R. Müller. Engineering svm kernels that recognize translation initiation sites. Bioinformatics, 16(9):799–807, 2000.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Sören Sonnenburg
    • 1
  • Gunnar Rätsch
    • 2
  • Arun Jagota
    • 3
  • Klaus-Robert Müller
    • 1
    • 4
  1. 1.Fraunhofer FIRSTBerlinGermany
  2. 2.Australian National UniversityCanberraAustralia
  3. 3.University of California at Santa CruzUSA
  4. 4.University of PotsdamPotsdamGermany

Personalised recommendations