Pegasos: primal estimated sub-gradient solver for SVM

Abstract

We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy \({\epsilon}\) is \({\tilde{O}(1 / \epsilon)}\), where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require \({\Omega(1 / \epsilon^2)}\) iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ, where λ is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is \({\tilde{O}(d/(\lambda \epsilon))}\), where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Our algorithm is particularly well suited for large text classification problems, where we demonstrate an order-of-magnitude speedup over previous SVM learning methods.

This is a preview of subscription content, log in to check access.

References

  1. 1

    Amari S.: Natural gradient works efficiently in learning. Neural Comput. 10, 251–276 (1998)

    Article  Google Scholar 

  2. 2

    Bordes A., Ertekin S., Weston J., Bottou L.: Fast kernel classifiers with online and active learning. J. Mach. Learn. Res. 6, 1579–1619 (2005)

    MathSciNet  Google Scholar 

  3. 3

    Bottou L.: Online Algorithms and Stochastic Approximations. In: Saad, D. (eds) Online learning and neural networks, Cambridge University Press, Cambridge (1998)

    Google Scholar 

  4. 4

    Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Advances in Neural Information Processing Systems 20, pp. 161–168 (2008)

  5. 5

    Bottou L., LeCun Y.: Large scale online learning. In: Thrun, S., Saul, L., Schölkopf, B. (eds) Advances in Neural Information Processing Systems 16, MIT Press, Cambridge (2004)

    Google Scholar 

  6. 6

    Bottou L., Murata N.: Stochastic approximations and efficient learning. In: Arbib, M.A. (eds) The Handbook of Brain Theory and Neural Networks, The MIT Press, Cambridge (2002)

    Google Scholar 

  7. 7

    Boyd S., Vandenberghe L.: Convex Optimization, 2nd edn. Cambridge University Press, Cambridge (2004)

    Google Scholar 

  8. 8

    Censor Y., Zenios S.: Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, New York (1997)

    Google Scholar 

  9. 9

    Cesa-Bianchi N., Conconi A., Gentile C.: On the generalization ability of on-line learning algorithms. IEEE Trans. Inf, Theory 50(9), 2050–2057 (2004)

    Article  MathSciNet  Google Scholar 

  10. 10

    Chapelle, O.: Training a support vector machine in the primal. Neural Comput. 19(5), 1155–1178 (2007). doi:10.1162/neco.2007.19.5.1155. http://www.mitpressjournals.org/doi/abs/10.1162/neco.2007.19.5.1155

  11. 11

    Crammer K., Dekel O., Keshet J., Shalev-Shwartz S., Singer Y.: Online passive aggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006)

    MathSciNet  Google Scholar 

  12. 12

    Cristianini N., Shawe-Taylor J.: An Introduction to Support Vector Machines. Cambridge University Press, Cambridge (2000)

    Google Scholar 

  13. 13

    Do, C., Le, Q., Foo, C.: Proximal regularization for online and batch learning. In: Proceedings of the 26th International Conference on Machine Learning (2009)

  14. 14

    Duda R.O., Hart P.E.: Pattern Classification and Scene Analysis. Wiley, New York (1973)

    Google Scholar 

  15. 15

    Fine S., Scheinberg K.: Efficient SVM training using low-rank kernel representations. J. Mach. Lear. Res. 2, 242–264 (2001)

    Google Scholar 

  16. 16

    Freund Y., Schapire R.E.: Large margin classification using the perceptron algorithm. Mach. Learn. 37(3), 277–296 (1999)

    MATH  Article  Google Scholar 

  17. 17

    Hazan, E., Kalai, A., Kale, S., Agarwal, A.: Logarithmic regret algorithms for online convex optimization. In: Proceedings of the Nineteenth Annual Conference on Computational Learning Theory (2006)

  18. 18

    Hsieh, C., Chang, K., Lin, C., Keerthi, S., Sundararajan, S.: A dual coordinate descent method for large-scale linear SVM. In: ICML, pp. 408–415 (2008)

  19. 19

    Hush, D., Kelly, P., Scovel, C., Steinwart, I.: Qp algorithms with guaranteed accuracy and run time for support vector machines. J. Mach. Learn. Res. (2006)

  20. 20

    Joachims T.: Making large-scale support vector machine learning practical. In: Schölkopf, B., Burges, C., Smola, A. (eds) Advances in Kernel Methods—Support Vector Learning., MIT Press, Cambridge (1998)

    Google Scholar 

  21. 21

    Joachims, T.: Training linear SVMs in linear time. In: Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), pp. 216–226 (2006)

  22. 22

    Kakade, S., Tewari, A.: On the generalization ability of online strongly convex programming algorithms. In: Advances in Neural Information Processing Systems 22 (2009)

  23. 23

    Kimeldorf G., Wahba G.: Some results on tchebycheffian spline functions. J. Math. Anal. Appl. 33, 82–95 (1971)

    MATH  Article  MathSciNet  Google Scholar 

  24. 24

    Kivinen J., Smola A.J., Williamson R.C.: Online learning with kernels. IEEE Trans. Signal Process. 52(8), 2165–2176 (2002)

    Article  MathSciNet  Google Scholar 

  25. 25

    Kushner H., Yin G.: Stochastic Approximation Algorithms and Applications. Springer, New York (1997)

    Google Scholar 

  26. 26

    Murata N.: A statistical study of on-line learning. In: Saad, D. (eds) Online Learning and Neural Networks, Cambridge University Press, Cambridge (1998)

    Google Scholar 

  27. 27

    Murata N., Amari S.: Statistical analysis of learning dynamics. Signal Process. 74(1), 3–28 (1999)

    MATH  Article  Google Scholar 

  28. 28

    Nesterov, Y.: Primal-dual subgradient methods for convex problems. Tech. rep., Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL) (2005)

  29. 29

    Platt J.C.: Fast training of Support Vector Machines using sequential minimal optimization. In: Schölkopf, B., Burges, C., Smola, A. (eds) Advances in Kernel Methods—Support Vector Learning, MIT Press, Cambridge (1998)

    Google Scholar 

  30. 30

    Rockafellar R.: Convex Analysis. Princeton University Press, Princeton (1970)

    Google Scholar 

  31. 31

    Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In: Proceedings of the 24th International Conference on Machine Learning, pp. 807–814 (2007)

  32. 32

    Shalev-Shwartz, S., Srebro, N.: SVM optimization: inverse dependence on training set size. In: Proceedings of the 25th International Conference on Machine Learning, pp. 928–935 (2008)

  33. 33

    Smola, A., Vishwanathan, S., Le, Q.: Bundle methods for machine learning. In: Advances in Neural Information Processing Systems 21 (2007)

  34. 34

    Spall J.C.: Introduction to Stochastic Search and Optimization. Wiley, New York (2003)

    Google Scholar 

  35. 35

    Sridharan, K., Srebro, N., Shalev-Shwartz, S.: Fast rates for regularized objectives. In: Advances in Neural Information Processing Systems 22 (2009)

  36. 36

    Vapnik V.N.: Statistical Learning Theory. Wiley, New York (1998)

    Google Scholar 

  37. 37

    Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-First International Conference on Machine Learning (2004)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Nathan Srebro.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Shalev-Shwartz, S., Singer, Y., Srebro, N. et al. Pegasos: primal estimated sub-gradient solver for SVM. Math. Program. 127, 3–30 (2011). https://doi.org/10.1007/s10107-010-0420-4

Download citation

Keywords

  • SVM
  • Stochastic gradient descent

Mathematics Subject Classification (2000)

  • First
  • Second
  • More