Abstract
We propose novel algorithms for solving the so-called Support Vector Multiple Kernel Learning problem and show how they can be used to understand the resulting support vector decision function. While classical kernel-based algorithms (such as SVMs) are based on a single kernel, in Multiple Kernel Learning a quadratically-constraint quadratic program is solved in order to find a sparse convex combination of a set of support vector kernels. We show how this problem can be cast into a semi-infinite linear optimization problem which can in turn be solved efficiently using a boosting-like iterative method in combination with standard SVM optimization algorithms. The proposed method is able to deal with thousands of examples while combining hundreds of kernels within reasonable time.
In the second part we show how this technique can be used to understand the obtained decision function in order to extract biologically relevant knowledge about the sequence analysis problem at hand. We consider the problem of splice site identification and combine string kernels at different sequence positions and with various substring (oligomer) lengths. The proposed algorithm computes a sparse weighting over the length and the substring, highlighting which substrings are important for discrimination. Finally, we propose a bootstrap scheme in order to reliably identify a few statistically significant positions, which can then be used for further analysis such as consensus finding.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bach, F.R., Lanckriet, G.R.G., Jordan, M.I.: Multiple kernel learning, conic duality, and the SMO algorithm. In: Twenty-first international conference on Machine learning. ACM Press, New York (2004)
Bennett, K.P., Demiriz, A., Shawe-Taylor, J.: A column generation algorithm for boosting. In: Langley, P. (ed.) Proceedings, 17th ICML, pp. 65–72. Morgan Kaufmann, San Francisco (2000)
Boguski, M.S., Lowe, T.M., Tolstoshev, C.M.: dbEST–database for expressed sequence tags. Nat. Genet. 4(4), 332–333 (1993)
Breiman, L.: Prediction games and arcing algorithms. Technical Report 504, Statistics Department, University of California (December 1997)
Cortes, C., Vapnik, V.N.: Support vector networks. Machine Learning 20, 273–297 (1995)
Delcher, A.L., Harmon, D., Kasif, S., White, O., Salzberg, S.L.: Improved microbial gene identification with GLIMMER. Nucleic Acids Research 27(23), 4636–4641 (1999)
Engel, Y., Mannor, S., Meir, R.: Sparse online greedy support vector regression. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS (LNAI), vol. 2430, pp. 84–96. Springer, Heidelberg (2002)
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: EuroCOLT: European Conference on Computational Learning Theory. LNCS. Springer, Heidelberg (1994)
Harris, T.W., et al.: Wormbase: a multi-species resource for nematode biology and genomics. Nucl. Acids Res. 32 (Database issue:D411-7) (2004)
Hettich, R., Kortanek, K.O.: Semi-infinite programming: Theory, methods and applications. SIAM Review 3, 380–429 (1993)
Jaakkola, T., Diekhans, M., Haussler, D.: A discriminative framework for detecting remote protein homologies. J. Comput. Biol. 7(1-2), 95–114 (2000)
Joachims, T.: Making large–scale SVM learning practical. In: Schölkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods — Support Vector Learning, pp. 169–184. MIT Press, Cambridge (1999)
Kent, W.J.: Blat–the blast-like alignment tool. Genome Res. 12(4), 656–664 (2002)
Kuang, R., Ie, E., Wang, K., Wang, K., Siddiqi, M., Freund, Y., Leslie, C.: Profile-based string kernels for remote homology detection and motif extraction. In: Computational Systems Bioinformatics Conference 2004, pp. 146–154 (2004)
Lanckriet, G.R.G., De Bie, T., Cristianini, N., Jordan, M.I., Noble, W.S.: A statistical framework for genomic data fusion. Bioinformatics (2004)
Lehmann, E.L.: Testing Statistical Hypotheses, 2nd edn. Springer, New York (1997)
Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for SVM protein classification. In: Proceedings of the Pacific Symposium on Biocomputing, Kaua’i, Hawaii (2002)
Mood, A.M., Graybill, F.A., Boes, D.C.: Introduction to the Theory of Statistics, 3rd edn. McGraw-Hill, New York (1974)
Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., Schölkopf, B.: An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks 12(2), 181–201 (2001)
Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Schölkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods — Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1999)
Rätsch, G.: Robust Boosting via Convex Optimization. PhD thesis, University of Potsdam, Computer Science Dept., August-Bebel-Str. 89, 14482 Potsdam, Germany (2001)
Rätsch, G., Demiriz, A., Bennett, K.: Sparse regression ensembles in infinite and finite hypothesis spaces. Machine Learning 48(1-3), 193–221 (2002); Special Issue on New Methods for Model Selection and Model Combination. Also NeuroCOLT2 Technical Report NC-TR-2000-085
Rätsch, G., Sonnenburg, S.: Accurate Splice Site Prediction for Caenorhabditis Elegans. MIT Press series on Computational Molecular Biology, pp. 277–298. MIT Press, Cambridge (2003)
Rätsch, G., Warmuth, M.K.: Marginal boosting. NeuroCOLT2 Technical Report 97, Royal Holloway College, London (July 2001)
Wheeler, D.L., et al.: Database resources of the national center for biotechnology. Nucl. Acids Res. 31, 33–38 (2003)
Zhang, X.H., Heller, K.A., Hefter, I., Leslie, C.S., Chasin, L.A.: Sequence information for the splicing of human pre-mrna identified by support vector machine classification. Genome Res. 13(12), 637–650 (2003)
Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., Müller, K.-R.: Engineering Support Vector Machine Kernels That Recognize Translation Initiation Sites. BioInformatics 16(9), 799–807 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sonnenburg, S., Rätsch, G., Schäfer, C. (2005). Learning Interpretable SVMs for Biological Sequence Classification. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds) Research in Computational Molecular Biology. RECOMB 2005. Lecture Notes in Computer Science(), vol 3500. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11415770_30
Download citation
DOI: https://doi.org/10.1007/11415770_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25866-7
Online ISBN: 978-3-540-31950-4
eBook Packages: Computer ScienceComputer Science (R0)