Machine Learning

, Volume 61, Issue 1–3, pp 129–150 | Cite as

Combined SVM-Based Feature Selection and Classification

  • Julia Neumann
  • Christoph Schnörr
  • Gabriele Steidl
Article

Abstract

Feature selection is an important combinatorial optimisation problem in the context of supervised pattern classification. This paper presents four novel continuous feature selection approaches directly minimising the classifier performance. In particular, we include linear and nonlinear Support Vector Machine classifiers. The key ideas of our approaches are additional regularisation and embedded nonlinear feature selection. To solve our optimisation problems, we apply difference of convex functions programming which is a general framework for non-convex continuous optimisation. Experiments with artificial data and with various real-world problems including organ classification in computed tomography scans demonstrate that our methods accomplish the desired feature selection and classification performance simultaneously.

Keywords

feature selection SVMs embedded methods mathematical programming difference of convex functions programming non-convex optimisation 

References

  1. Bach, F., Lanckriet, G., & Jordan, M. (2004). Multiple kernel learning, conic duality, and the SMO algorithm. In Proceedings of the 21st International Conference on Machine Learning. New York, NY: ACM Press.Google Scholar
  2. Ben-Tal, A., & Zibulevsky, M. (1997). Penalty/Barrier multiplier methods for convex programming problems. SIAM Journal on Optimization, 7:2, 347–366.CrossRefMathSciNetGoogle Scholar
  3. Bennett, K. P., & Mangasarian, O. L. (1992). Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1, 23–34.Google Scholar
  4. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases.Google Scholar
  5. Bradley, P. S. (1998). Mathematical programming approaches to machine learning and data mining. Ph.D. thesis, University of Wisconsin, Computer Sciences Dept., Madison, WI, TR-98-11.Google Scholar
  6. Bradley, P. S., & Mangasarian, O. L. (1998). Feature selection via concave minimization and support vector machines. In J. Shavlik (Ed.), Proceedings of the 15th international conference on machine learning (pp. 82–90). San Francisco, CA: Morgan Kaufmann.Google Scholar
  7. Chapelle, O., Haffner, P., & Vapnik, V. N. (1999). SVMs for histogram-based image classification. IEEE Transactions on Neural Networks, 10:5, 1055–1064.CrossRefGoogle Scholar
  8. Cristianini, N., Shawe-Taylor, J., Elisseeff, A., & Kandola, J. (2002). On kernel-target alignment. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems 14 (pp. 367–373). Cambridge, MA: MIT Press.Google Scholar
  9. Duda, R., Hart, P., & Stork, D. (2000). Pattern classification. New York, NY: John Wiley & Sons, second edition.Google Scholar
  10. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.CrossRefGoogle Scholar
  11. Haasdonk, B., & Bahlmann, C. (2004). Learning with distance substitution kernels. In C. E. Rasmussen, H. H. Bülthoff, M. A. Giese, & B. Schölkopf (Eds.), Pattern recognition, proc. of 26th DAGM symposium, Vol. 3175 of LNCS. (pp. 220–227). Berlin: Springer.Google Scholar
  12. Heiler, M., Cremers, D., & Schnörr, C. (2001). Efficient feature subset selection for support vector machines. Technical Report TR-01-021, Comp. science series, Dept. of Mathematics and Computer Science, University of Mannheim.Google Scholar
  13. Hermes, L., & Buhmann, J. M. (2000). Feature selection for support vector machines. In Proc. of the International Conference on Pattern Recognition (ICPR'00), Vol. 2 (pp. 716–719).Google Scholar
  14. Ilog, Inc.: 2001, ‘ILOG CPLEX 7.5’.Google Scholar
  15. Jakubik, O. J. (2003). Feature selection with concave minimization. Master's thesis, Dept. of Mathematics and Computer Science, University of Mannheim.Google Scholar
  16. Jebara, T., & Jaakkola, T. (2000). Feature selection and dualities in maximum entropy discrimination. In I. Bratko & S. Dzeroski (Eds.), Proceedings of the 16th international conference on machine learning (pp. 291–300). San Francisco, CA: Morgan Kaufmann.Google Scholar
  17. John, G. H., Kohavi, R., & Pfleger, K. (1994). Irrelevant features and the subset selection problem. In R. S. Michalski & G. Tecuci (Eds.), Proc. of the 11th international conference on machine learning (pp. 121–129). San Francisco, CA: Morgan Kaufmann.Google Scholar
  18. Mangasarian, O. L. (1997). Minimum-support solutions of polyhedral concave programs. Technical Report TR-1997-05, Mathematical Programming, University of Wisconsin.Google Scholar
  19. MathWorks. (2002). Optimization toolbox user's Guide. The MathWorks, Inc.Google Scholar
  20. Neumann, J., Schnörr, C., & Steidl, G. (2004). SVM-based feature selection by direct objective minimisation. In C. E. Rasmussen, H. H. Bülthoff, M. A. Giese, & B. Schölkopf (Eds.), Pattern recognition, proc. of 26th DAGM symposium, Vol. 3175 of LNCS (pp. 212–219). Berlin: Springer.Google Scholar
  21. Pham Dinh, T., & Elbernoussi, S. (1988). Duality in d.c. (difference of convex functions optimization. Subgradient Methods. In Trends in Mathematical Optimization, Vol. 84 of Int. Series of Numer. Math. Basel: Birkäuser Verlag (pp. 277–293).Google Scholar
  22. Pham Dinh, T., & Hoai An, L. T. (1998). A D.C. Optimization Algorithm for Solving the Trust-Region Subproblem. SIAM Journal on Optimization, 8:2, 476–505.MathSciNetGoogle Scholar
  23. Rockafellar, R. T. (1970). Convex analysis. Princeton, NJ: Princeton University Press.Google Scholar
  24. Schmidt, S. (2004). Context-sensitive image labeling based on logistic regression. Master's thesis, Dept. of Mathematics and Computer Science, University of Mannheim.Google Scholar
  25. Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press.Google Scholar
  26. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58:1, 267–288.MathSciNetMATHGoogle Scholar
  27. Weston, J., Elisseeff, A., Schölkopf, B., & Tipping, M. (2003). Use of the zero-norm with linear models and kernel methods. Journal of Machine Learning Research, 3, 1439–1461.CrossRefGoogle Scholar
  28. Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., & Vapnik, V. (2001). Feature Selection for SVMs. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems 13 (pp. 668–674). Cambridge, MA: MIT Press.Google Scholar
  29. Yuille, A., & Rangarajan, A. (2003). The convex-concave procedure. Neural Computation, 15, 915–936.CrossRefGoogle Scholar
  30. Zhu, J., Rosset, S., Hastie, T., & Tibshirani, R. (2004). 1-norm support vector machines. In S. Thrun, L. Saul, & B. Schölkopf (Eds.), Advances in neural information processing systems 16. Cambridge, MA: MIT Press.Google Scholar

Copyright information

© Springer Science + Business Media, Inc. 2005

Authors and Affiliations

  • Julia Neumann
    • 1
  • Christoph Schnörr
    • 1
  • Gabriele Steidl
    • 1
  1. 1.Department of Mathematics and Computer ScienceUniversity of MannheimMannheimGermany

Personalised recommendations