Learning Interpretable SVMs for Biological Sequence Classification

Sonnenburg, S.; Rätsch, G.; Schäfer, C.

doi:10.1007/11415770_30

S. Sonnenburg²⁵,
G. Rätsch²⁶ &
C. Schäfer²⁵

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3500))

Included in the following conference series:

Annual International Conference on Research in Computational Molecular Biology

1178 Accesses
17 Citations

Abstract

We propose novel algorithms for solving the so-called Support Vector Multiple Kernel Learning problem and show how they can be used to understand the resulting support vector decision function. While classical kernel-based algorithms (such as SVMs) are based on a single kernel, in Multiple Kernel Learning a quadratically-constraint quadratic program is solved in order to find a sparse convex combination of a set of support vector kernels. We show how this problem can be cast into a semi-infinite linear optimization problem which can in turn be solved efficiently using a boosting-like iterative method in combination with standard SVM optimization algorithms. The proposed method is able to deal with thousands of examples while combining hundreds of kernels within reasonable time.

In the second part we show how this technique can be used to understand the obtained decision function in order to extract biologically relevant knowledge about the sequence analysis problem at hand. We consider the problem of splice site identification and combine string kernels at different sequence positions and with various substring (oligomer) lengths. The proposed algorithm computes a sparse weighting over the length and the substring, highlighting which substrings are important for discrimination. Finally, we propose a bootstrap scheme in order to reliably identify a few statistically significant positions, which can then be used for further analysis such as consensus finding.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bach, F.R., Lanckriet, G.R.G., Jordan, M.I.: Multiple kernel learning, conic duality, and the SMO algorithm. In: Twenty-first international conference on Machine learning. ACM Press, New York (2004)
Google Scholar
Bennett, K.P., Demiriz, A., Shawe-Taylor, J.: A column generation algorithm for boosting. In: Langley, P. (ed.) Proceedings, 17th ICML, pp. 65–72. Morgan Kaufmann, San Francisco (2000)
Google Scholar
Boguski, M.S., Lowe, T.M., Tolstoshev, C.M.: dbEST–database for expressed sequence tags. Nat. Genet. 4(4), 332–333 (1993)
Article Google Scholar
Breiman, L.: Prediction games and arcing algorithms. Technical Report 504, Statistics Department, University of California (December 1997)
Google Scholar
Cortes, C., Vapnik, V.N.: Support vector networks. Machine Learning 20, 273–297 (1995)
MATH Google Scholar
Delcher, A.L., Harmon, D., Kasif, S., White, O., Salzberg, S.L.: Improved microbial gene identification with GLIMMER. Nucleic Acids Research 27(23), 4636–4641 (1999)
Article Google Scholar
Engel, Y., Mannor, S., Meir, R.: Sparse online greedy support vector regression. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS (LNAI), vol. 2430, pp. 84–96. Springer, Heidelberg (2002)
Chapter Google Scholar
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: EuroCOLT: European Conference on Computational Learning Theory. LNCS. Springer, Heidelberg (1994)
Google Scholar
Harris, T.W., et al.: Wormbase: a multi-species resource for nematode biology and genomics. Nucl. Acids Res. 32 (Database issue:D411-7) (2004)
Google Scholar
Hettich, R., Kortanek, K.O.: Semi-infinite programming: Theory, methods and applications. SIAM Review 3, 380–429 (1993)
Article MathSciNet Google Scholar
Jaakkola, T., Diekhans, M., Haussler, D.: A discriminative framework for detecting remote protein homologies. J. Comput. Biol. 7(1-2), 95–114 (2000)
Article Google Scholar
Joachims, T.: Making large–scale SVM learning practical. In: Schölkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods — Support Vector Learning, pp. 169–184. MIT Press, Cambridge (1999)
Google Scholar
Kent, W.J.: Blat–the blast-like alignment tool. Genome Res. 12(4), 656–664 (2002)
MathSciNet Google Scholar
Kuang, R., Ie, E., Wang, K., Wang, K., Siddiqi, M., Freund, Y., Leslie, C.: Profile-based string kernels for remote homology detection and motif extraction. In: Computational Systems Bioinformatics Conference 2004, pp. 146–154 (2004)
Google Scholar
Lanckriet, G.R.G., De Bie, T., Cristianini, N., Jordan, M.I., Noble, W.S.: A statistical framework for genomic data fusion. Bioinformatics (2004)
Google Scholar
Lehmann, E.L.: Testing Statistical Hypotheses, 2nd edn. Springer, New York (1997)
MATH Google Scholar
Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for SVM protein classification. In: Proceedings of the Pacific Symposium on Biocomputing, Kaua’i, Hawaii (2002)
Google Scholar
Mood, A.M., Graybill, F.A., Boes, D.C.: Introduction to the Theory of Statistics, 3rd edn. McGraw-Hill, New York (1974)
MATH Google Scholar
Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., Schölkopf, B.: An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks 12(2), 181–201 (2001)
Article Google Scholar
Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Schölkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods — Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1999)
Google Scholar
Rätsch, G.: Robust Boosting via Convex Optimization. PhD thesis, University of Potsdam, Computer Science Dept., August-Bebel-Str. 89, 14482 Potsdam, Germany (2001)
Google Scholar
Rätsch, G., Demiriz, A., Bennett, K.: Sparse regression ensembles in infinite and finite hypothesis spaces. Machine Learning 48(1-3), 193–221 (2002); Special Issue on New Methods for Model Selection and Model Combination. Also NeuroCOLT2 Technical Report NC-TR-2000-085
Google Scholar
Rätsch, G., Sonnenburg, S.: Accurate Splice Site Prediction for Caenorhabditis Elegans. MIT Press series on Computational Molecular Biology, pp. 277–298. MIT Press, Cambridge (2003)
Google Scholar
Rätsch, G., Warmuth, M.K.: Marginal boosting. NeuroCOLT2 Technical Report 97, Royal Holloway College, London (July 2001)
Google Scholar
Wheeler, D.L., et al.: Database resources of the national center for biotechnology. Nucl. Acids Res. 31, 33–38 (2003)
Article Google Scholar
Zhang, X.H., Heller, K.A., Hefter, I., Leslie, C.S., Chasin, L.A.: Sequence information for the splicing of human pre-mrna identified by support vector machine classification. Genome Res. 13(12), 637–650 (2003)
Google Scholar
Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., Müller, K.-R.: Engineering Support Vector Machine Kernels That Recognize Translation Initiation Sites. BioInformatics 16(9), 799–807 (2000)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Fraunhofer Institute FIRST, Kekuléstr. 7, 12489, Berlin, Germany
S. Sonnenburg & C. Schäfer
Friedrich Miescher Lab, Max Planck Society, Spemannstr. 39, Tübingen, Germany
G. Rätsch

Authors

S. Sonnenburg
View author publications
You can also search for this author in PubMed Google Scholar
G. Rätsch
View author publications
You can also search for this author in PubMed Google Scholar
C. Schäfer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, 108-8639, Minato-ku, Tokyo, Japan
Satoru Miyano
Broad Institute of MIT and Harvard, 320 Charles Street, 02141-2023, Cambridge, MA, USA
Jill Mesirov
Computational Genomics Laboratory, Department of Bioengineering, Boston University, 44 Cummington St., 02215, Boston, MA, USA
Simon Kasif
Center for Molecular Biology and Computer Sciecne Department, Brown University, 115 Waterman St., 02912, Providence, RI, USA
Sorin Istrail
University of California, San Diego, USA
Pavel A. Pevzner
Department of Molecular and Computational Biology, University of Southern California, 1050 Childs Way, 90089-2910, Los Angeles, CA, USA
Michael Waterman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sonnenburg, S., Rätsch, G., Schäfer, C. (2005). Learning Interpretable SVMs for Biological Sequence Classification. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds) Research in Computational Molecular Biology. RECOMB 2005. Lecture Notes in Computer Science(), vol 3500. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11415770_30

Download citation

DOI: https://doi.org/10.1007/11415770_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25866-7
Online ISBN: 978-3-540-31950-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics