Skip to main content

Learning Interpretable SVMs for Biological Sequence Classification

  • Conference paper
Research in Computational Molecular Biology (RECOMB 2005)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3500))

Abstract

We propose novel algorithms for solving the so-called Support Vector Multiple Kernel Learning problem and show how they can be used to understand the resulting support vector decision function. While classical kernel-based algorithms (such as SVMs) are based on a single kernel, in Multiple Kernel Learning a quadratically-constraint quadratic program is solved in order to find a sparse convex combination of a set of support vector kernels. We show how this problem can be cast into a semi-infinite linear optimization problem which can in turn be solved efficiently using a boosting-like iterative method in combination with standard SVM optimization algorithms. The proposed method is able to deal with thousands of examples while combining hundreds of kernels within reasonable time.

In the second part we show how this technique can be used to understand the obtained decision function in order to extract biologically relevant knowledge about the sequence analysis problem at hand. We consider the problem of splice site identification and combine string kernels at different sequence positions and with various substring (oligomer) lengths. The proposed algorithm computes a sparse weighting over the length and the substring, highlighting which substrings are important for discrimination. Finally, we propose a bootstrap scheme in order to reliably identify a few statistically significant positions, which can then be used for further analysis such as consensus finding.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bach, F.R., Lanckriet, G.R.G., Jordan, M.I.: Multiple kernel learning, conic duality, and the SMO algorithm. In: Twenty-first international conference on Machine learning. ACM Press, New York (2004)

    Google Scholar 

  2. Bennett, K.P., Demiriz, A., Shawe-Taylor, J.: A column generation algorithm for boosting. In: Langley, P. (ed.) Proceedings, 17th ICML, pp. 65–72. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  3. Boguski, M.S., Lowe, T.M., Tolstoshev, C.M.: dbEST–database for expressed sequence tags. Nat. Genet. 4(4), 332–333 (1993)

    Article  Google Scholar 

  4. Breiman, L.: Prediction games and arcing algorithms. Technical Report 504, Statistics Department, University of California (December 1997)

    Google Scholar 

  5. Cortes, C., Vapnik, V.N.: Support vector networks. Machine Learning 20, 273–297 (1995)

    MATH  Google Scholar 

  6. Delcher, A.L., Harmon, D., Kasif, S., White, O., Salzberg, S.L.: Improved microbial gene identification with GLIMMER. Nucleic Acids Research 27(23), 4636–4641 (1999)

    Article  Google Scholar 

  7. Engel, Y., Mannor, S., Meir, R.: Sparse online greedy support vector regression. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS (LNAI), vol. 2430, pp. 84–96. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  8. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: EuroCOLT: European Conference on Computational Learning Theory. LNCS. Springer, Heidelberg (1994)

    Google Scholar 

  9. Harris, T.W., et al.: Wormbase: a multi-species resource for nematode biology and genomics. Nucl. Acids Res. 32 (Database issue:D411-7) (2004)

    Google Scholar 

  10. Hettich, R., Kortanek, K.O.: Semi-infinite programming: Theory, methods and applications. SIAM Review 3, 380–429 (1993)

    Article  MathSciNet  Google Scholar 

  11. Jaakkola, T., Diekhans, M., Haussler, D.: A discriminative framework for detecting remote protein homologies. J. Comput. Biol. 7(1-2), 95–114 (2000)

    Article  Google Scholar 

  12. Joachims, T.: Making large–scale SVM learning practical. In: Schölkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods — Support Vector Learning, pp. 169–184. MIT Press, Cambridge (1999)

    Google Scholar 

  13. Kent, W.J.: Blat–the blast-like alignment tool. Genome Res. 12(4), 656–664 (2002)

    MathSciNet  Google Scholar 

  14. Kuang, R., Ie, E., Wang, K., Wang, K., Siddiqi, M., Freund, Y., Leslie, C.: Profile-based string kernels for remote homology detection and motif extraction. In: Computational Systems Bioinformatics Conference 2004, pp. 146–154 (2004)

    Google Scholar 

  15. Lanckriet, G.R.G., De Bie, T., Cristianini, N., Jordan, M.I., Noble, W.S.: A statistical framework for genomic data fusion. Bioinformatics (2004)

    Google Scholar 

  16. Lehmann, E.L.: Testing Statistical Hypotheses, 2nd edn. Springer, New York (1997)

    MATH  Google Scholar 

  17. Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for SVM protein classification. In: Proceedings of the Pacific Symposium on Biocomputing, Kaua’i, Hawaii (2002)

    Google Scholar 

  18. Mood, A.M., Graybill, F.A., Boes, D.C.: Introduction to the Theory of Statistics, 3rd edn. McGraw-Hill, New York (1974)

    MATH  Google Scholar 

  19. Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., Schölkopf, B.: An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks 12(2), 181–201 (2001)

    Article  Google Scholar 

  20. Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Schölkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods — Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1999)

    Google Scholar 

  21. Rätsch, G.: Robust Boosting via Convex Optimization. PhD thesis, University of Potsdam, Computer Science Dept., August-Bebel-Str. 89, 14482 Potsdam, Germany (2001)

    Google Scholar 

  22. Rätsch, G., Demiriz, A., Bennett, K.: Sparse regression ensembles in infinite and finite hypothesis spaces. Machine Learning 48(1-3), 193–221 (2002); Special Issue on New Methods for Model Selection and Model Combination. Also NeuroCOLT2 Technical Report NC-TR-2000-085

    Google Scholar 

  23. Rätsch, G., Sonnenburg, S.: Accurate Splice Site Prediction for Caenorhabditis Elegans. MIT Press series on Computational Molecular Biology, pp. 277–298. MIT Press, Cambridge (2003)

    Google Scholar 

  24. Rätsch, G., Warmuth, M.K.: Marginal boosting. NeuroCOLT2 Technical Report 97, Royal Holloway College, London (July 2001)

    Google Scholar 

  25. Wheeler, D.L., et al.: Database resources of the national center for biotechnology. Nucl. Acids Res. 31, 33–38 (2003)

    Article  Google Scholar 

  26. Zhang, X.H., Heller, K.A., Hefter, I., Leslie, C.S., Chasin, L.A.: Sequence information for the splicing of human pre-mrna identified by support vector machine classification. Genome Res. 13(12), 637–650 (2003)

    Google Scholar 

  27. Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., Müller, K.-R.: Engineering Support Vector Machine Kernels That Recognize Translation Initiation Sites. BioInformatics 16(9), 799–807 (2000)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sonnenburg, S., Rätsch, G., Schäfer, C. (2005). Learning Interpretable SVMs for Biological Sequence Classification. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds) Research in Computational Molecular Biology. RECOMB 2005. Lecture Notes in Computer Science(), vol 3500. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11415770_30

Download citation

  • DOI: https://doi.org/10.1007/11415770_30

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25866-7

  • Online ISBN: 978-3-540-31950-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics