Machine Learning

, Volume 54, Issue 1, pp 5–32 | Cite as

Benchmarking Least Squares Support Vector Machine Classifiers

  • Tony van Gestel
  • Johan A.K. Suykens
  • Bart Baesens
  • Stijn Viaene
  • Jan Vanthienen
  • Guido Dedene
  • Bart de Moor
  • Joos Vandewalle


In Support Vector Machines (SVMs), the solution of the classification problem is characterized by a (convex) quadratic programming (QP) problem. In a modified version of SVMs, called Least Squares SVM classifiers (LS-SVMs), a least squares cost function is proposed so as to obtain a linear set of equations in the dual space. While the SVM classifier has a large margin interpretation, the LS-SVM formulation is related in this paper to a ridge regression approach for classification with binary targets and to Fisher's linear discriminant analysis in the feature space. Multiclass categorization problems are represented by a set of binary classifiers using different output coding schemes. While regularization is used to control the effective number of parameters of the LS-SVM classifier, the sparseness property of SVMs is lost due to the choice of the 2-norm. Sparseness can be imposed in a second stage by gradually pruning the support value spectrum and optimizing the hyperparameters during the sparse approximation procedure. In this paper, twenty public domain benchmark datasets are used to evaluate the test set performance of LS-SVM classifiers with linear, polynomial and radial basis function (RBF) kernels. Both the SVM and LS-SVM classifier with RBF kernel in combination with standard cross-validation procedures for hyperparameter selection achieve comparable test set performances. These SVM and LS-SVM performances are consistently very good when compared to a variety of methods described in the literature including decision tree based algorithms, statistical algorithms and instance based learning methods. We show on ten UCI datasets that the LS-SVM sparse approximation procedure can be successfully applied.

least squares support vector machines multiclass support vector machines sparse approximation 


  1. Aha, D., & Kibler, D. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66.Google Scholar
  2. Allwein, E. L., Schapire, R. E., & Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141.Google Scholar
  3. Baudat, G., & Anouar, F. (2000). Generalized discriminant analysis using a Kernel approach. Neural Computation, 12, 2385–2404.Google Scholar
  4. Bay, S. D. (1999). Nearest neighbor classification from multiple feature subsets. Intelligent Data Analysis, 3, 191–209.Google Scholar
  5. Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press.Google Scholar
  6. Blake, C. L., & Merz, C. J. (1998). UCI Repository of machine learning databases []. Irvine, CA: University of California, Dept. of Information and Computer Science.Google Scholar
  7. Boser, B., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In Proc. of the Fifth Annual Workshop on Computational Learning Theory (pp. 144–152). Pittsburgh: ACM.Google Scholar
  8. Bradley, P. S., & Mangasarian, O. L. (1998). Feature selection via concave minimization and support vector machines. In J. Shavlik (ed.), Machine Learning Proc. of the Fifteenth Int. Conf. (ICML'98) (pp. 82–90). Morgan Kaufmann, San Francisco, California.Google Scholar
  9. Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. New York: Chapman and Hall.Google Scholar
  10. Cawley, G. C. (2000). MATLAB Support Vector Machine Toolbox (v0.54β). []. University of East Anglia, School of Information Systems, Norwich, Norfolk, U.K.Google Scholar
  11. Cristianini, N., & Shawe-Taylor, J. (2000) An Introduction to Support Vector Machines. Cambridge University Press.Google Scholar
  12. De Groot, M. H. (1986). Probability and Statistics, 2nd ed. Reading, MA: Addison-Wesley.Google Scholar
  13. Domingos, P. (1996). Unifying instance-based and rule-based induction. Machine Learning, 24, 141–168.Google Scholar
  14. Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10, 1895–1924.Google Scholar
  15. Dietterich, T. G., & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2, 263–286.Google Scholar
  16. Duda, R. O., & Hart, P. E. (1973), Pattern Classification and Scene Analysis. New York: John Wiley.Google Scholar
  17. Evgeniou, T., Pontil, M., & Poggio, T. (2001). Regularization networks and support vector machines. Advances in Computational Mathematics, 13, 1–50.Google Scholar
  18. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7:2, 179–188.Google Scholar
  19. Friedman, J. (1989). Regularized discriminant analysis. Journal of the American Statistical Association, 84, 165–175.Google Scholar
  20. Girosi, F. (1998). An equivalence between sparse approximation and support vector machines. Neural Computation, 10, 1455–1480.Google Scholar
  21. Golub, G. H., & Van Loan, C. F. (1989). Matrix Computations. Baltimore MD: Johns Hopkins University Press.Google Scholar
  22. Hassibi, B., & Stork, D. G. (1993). Second order derivatives for network pruning: Optimal brain surgeon. In Hanson, Cowan, & Giles (Eds.), Advances in Neural Information Processing Systems (Vol. 5). San Mateo, CA: Morgan Kaufmann.Google Scholar
  23. Hastie, T., & Tibshirani, R. (1998). Classification by pairwise coupling. The Annals of Statistics, 26, 451–471.Google Scholar
  24. Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11, 63–90.Google Scholar
  25. John, G. H., & Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (pp. 338–345). Montreal, Quebec, Morgan Kaufmann.Google Scholar
  26. Kwok, J. T. (2000). The evidence framework applied to support vector machines. IEEE Trans. on Neural Networks, 10:5, 1018–1031.Google Scholar
  27. Le Cun, Y., Denker, J. S., & Solla, S. A. (1990). Optimal brain damage. In Touretzky (Ed.), Advances in Neural Information Processing Systems (Vol. 2). San Mateo, CA: Morgan Kaufmann.Google Scholar
  28. Lim, T.-S., Loh, W.-Y., & Shih, Y.-S. (2000). A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40:3, 203–228.Google Scholar
  29. MacKay, D. J. C. (1995). Probable networks and plausible predictions—A review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems, 6, 469–505.Google Scholar
  30. Mika, S., Rätsch, G., Weston, J., Schölkopf, B., & Müller, K.-R. (1999). Fisher discriminant analysis with Kernels. In Proc. IEEE Neural Networks for Signal Processing Workshop 1999, NNSP 99.Google Scholar
  31. Navia-Vázquez, A., Pérez-Cruz, F., Artés-Rodríguez, A., Figueiras-Vidál, A. R. (2001). Weighted least squares training of support vector classifiers leading to compact and adaptive schemes. IEEE Transactions on Neural Networks, 12, 1047–1059.Google Scholar
  32. Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning. Cambridge, MA.Google Scholar
  33. Quinlan, J. (1993). C4.5 Programs for Machine Learning. Morgan Kaufmann.Google Scholar
  34. Rao, P. (1983) Nonparametric Functional Estimation. Orlando: Academic Press.Google Scholar
  35. Ripley, B. D. (1996). Pattern Classification and Neural Networks. Cambridge.Google Scholar
  36. Saunders, C., Gammerman, A., & Vovk, V. (1998). Ridge regression learning algorithm in dual variables. In Proc. of the 15th Int. Conf. on Machine Learning ICML-98 (pp. 515–521). Madison-Wisconsin: Morgan Kaufmann.Google Scholar
  37. Schölkopf, B., Sung, K.-K., Burges, C., Girosi, F., Niyogi, P., Poggio, T., & Vapnik, V. (1997). Comparing support vector machines with Gaussian Kernels to radial basis function classifiers. IEEE Transactions on Signal Processing, 45, 2758–2765.Google Scholar
  38. Schölkopf, B., Burges, C., & Smola, A. (Eds.), (1998). Advances in Kernel Methods—Support Vector Learning. MIT Press.Google Scholar
  39. Sejnowski, T. J., & Rosenberg, C. R. (1987). Parallel networks that learn to pronounce English text. Journal of Complex Systems, 1:1, 145–168.Google Scholar
  40. Smola, A., Schölkopf, B., & Müller, K.-R. (1998). The connection between regularization operators and support vector kernels. Neural Networks, 11, 637–649.Google Scholar
  41. Smola, A. (1999). Learning with Kernels. PhD Thesis, published by: GMD, Birlinghoven.Google Scholar
  42. Suykens, J. A. K., & Vandewalle, J. (Eds.) (1998). Nonlinear Modeling: Advanced Black-Box Techniques. Boston: Kluwer Academic Publishers.Google Scholar
  43. Suykens, J. A. K., & Vandewalle, J. (1999a). Training multilayer perceptron classifiers based on a modified support vector method. IEEE Transactions on Neural Networks, 10, 907–912.Google Scholar
  44. Suykens, J. A. K., & Vandewalle, J. (1999b). Least squares support vector machine classifiers. Neural Processing Letters, 9, 293–300.Google Scholar
  45. Suykens, J. A. K., Lukas, L., Van Dooren, P., De Moor, B., & Vandewalle, J. (1999). Least squares support vector machine classifiers: A large scale algorithm. In Proc. of the European Conf. on Circuit Theory and Design (ECCTD'99) (pp. 839–842).Google Scholar
  46. Suykens, J. A. K., & Vandewalle, J. (1999c). Multiclass least squares support vector machines. In Proc. of the Int. Joint Conf. on Neural Networks (IJCNN'99), Washington, DC.Google Scholar
  47. Suykens, J. A. K., De Brabanter, J., Lukas, L., & Vandewalle, J. (2002). Weighted least squares support vector machines: Robustness and sparse approximation. Neurocomputing, 48:1–4, 85–105.Google Scholar
  48. Suykens, J. A. K., & Vandewalle, J. (2000). Recurrent least squares support vector machines. IEEE Transactions on Circuits and Systems-I, 47, 1109–1114.Google Scholar
  49. Suykens, J. A. K., Vandewalle, J., & De Moor, B. (2001). Optimal control by least squares support vector machines. Neural Networks, 14, 23–35.Google Scholar
  50. Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244.Google Scholar
  51. Utschick, W. (1998). A regularization method for non-trivial codes in polychotomous classification. International Journal of Pattern Recognition and Artificial Intelligence, 12, 453–474.Google Scholar
  52. Van Gestel, T., Suykens, J. A. K., Baestaens, D.-E., Lambrechts, A., Lanckriet, G., Vandaele, B., De Moor, B., & Vandewalle, J. (2001). Predicting financial time series using least squares support vector machines within the evidence framework. IEEE Transactions on Neural Networks, (Special Issue on Financial Engineering), 12, 809–821.Google Scholar
  53. Van Gestel, T., Suykens, J. A. K., Lanckriet, G., Lambrechts, A., De Moor, B., & Vandewalle, J. (2002). A Bayesian framework for least squares support vector machine classifiers. Neural Computation, 14, 1115–1148.Google Scholar
  54. Vapnik, V. (1995). The Nature of Statistical Learning Theory. New-York: Springer-Verlag.Google Scholar
  55. Vapnik, V. (1998a). Statistical Learning Theory. New-York: John Wiley.Google Scholar
  56. Vapnik, V. (1998b). The support vector method of function estimation. In J. A. K. Suykens, & J. Vandewalle, (Eds.), Nonlinear Modeling: Advanced Black-box Techniques. Boston: Kluwer Academic Publishers.Google Scholar
  57. Viaene, S., Baesens, B., Van Gestel, T., Suykens, J. A. K., Van den Poel, D., Vanthienen, J., De Moor, B., & Dedene, G. (2001). Knowledge discovery in a direct marketing case using least squares support vector machine classifiers. International Journal of Intelligent Systems, 9, 1023–1036.Google Scholar
  58. Williams, C. K. I. (1998). Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan (Ed.), Learning and Inference in Graphical Models. Kluwer Academic Press.Google Scholar
  59. Witten, I. H., & Frank, E. (2000). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco: Morgan Kaufmann.Google Scholar

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • Tony van Gestel
    • 1
  • Johan A.K. Suykens
    • 1
  • Bart Baesens
    • 2
  • Stijn Viaene
    • 2
  • Jan Vanthienen
    • 2
  • Guido Dedene
    • 2
  • Bart de Moor
    • 3
  • Joos Vandewalle
    • 3
  1. 1.Department of Electrical Engineering, ESAT/SISTAKatholieke Universiteit LeuvenBelgium
  2. 2.Leuven Institute for Research on Information Systems, Katholieke Universiteit LeuvenBelgium
  3. 3.Department of Electrical Engineering, ESAT/SISTAKatholieke Universiteit LeuvenBelgium

Personalised recommendations