Mathematical Programming

, Volume 127, Issue 1, pp 3–30 | Cite as

Pegasos: primal estimated sub-gradient solver for SVM

  • Shai Shalev-Shwartz
  • Yoram Singer
  • Nathan SrebroEmail author
  • Andrew Cotter
Full Length Paper Series B


We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy \({\epsilon}\) is \({\tilde{O}(1 / \epsilon)}\), where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require \({\Omega(1 / \epsilon^2)}\) iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ, where λ is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is \({\tilde{O}(d/(\lambda \epsilon))}\), where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Our algorithm is particularly well suited for large text classification problems, where we demonstrate an order-of-magnitude speedup over previous SVM learning methods.


SVM Stochastic gradient descent 

Mathematics Subject Classification (2000)

First Second More 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amari S.: Natural gradient works efficiently in learning. Neural Comput. 10, 251–276 (1998)CrossRefGoogle Scholar
  2. 2.
    Bordes A., Ertekin S., Weston J., Bottou L.: Fast kernel classifiers with online and active learning. J. Mach. Learn. Res. 6, 1579–1619 (2005)MathSciNetGoogle Scholar
  3. 3.
    Bottou L.: Online Algorithms and Stochastic Approximations. In: Saad, D. (eds) Online learning and neural networks, Cambridge University Press, Cambridge (1998)Google Scholar
  4. 4.
    Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Advances in Neural Information Processing Systems 20, pp. 161–168 (2008)Google Scholar
  5. 5.
    Bottou L., LeCun Y.: Large scale online learning. In: Thrun, S., Saul, L., Schölkopf, B. (eds) Advances in Neural Information Processing Systems 16, MIT Press, Cambridge (2004)Google Scholar
  6. 6.
    Bottou L., Murata N.: Stochastic approximations and efficient learning. In: Arbib, M.A. (eds) The Handbook of Brain Theory and Neural Networks, The MIT Press, Cambridge (2002)Google Scholar
  7. 7.
    Boyd S., Vandenberghe L.: Convex Optimization, 2nd edn. Cambridge University Press, Cambridge (2004)zbMATHGoogle Scholar
  8. 8.
    Censor Y., Zenios S.: Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, New York (1997)zbMATHGoogle Scholar
  9. 9.
    Cesa-Bianchi N., Conconi A., Gentile C.: On the generalization ability of on-line learning algorithms. IEEE Trans. Inf, Theory 50(9), 2050–2057 (2004)CrossRefMathSciNetGoogle Scholar
  10. 10.
    Chapelle, O.: Training a support vector machine in the primal. Neural Comput. 19(5), 1155–1178 (2007). doi: 10.1162/neco.2007.19.5.1155.
  11. 11.
    Crammer K., Dekel O., Keshet J., Shalev-Shwartz S., Singer Y.: Online passive aggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006)MathSciNetGoogle Scholar
  12. 12.
    Cristianini N., Shawe-Taylor J.: An Introduction to Support Vector Machines. Cambridge University Press, Cambridge (2000)Google Scholar
  13. 13.
    Do, C., Le, Q., Foo, C.: Proximal regularization for online and batch learning. In: Proceedings of the 26th International Conference on Machine Learning (2009)Google Scholar
  14. 14.
    Duda R.O., Hart P.E.: Pattern Classification and Scene Analysis. Wiley, New York (1973)zbMATHGoogle Scholar
  15. 15.
    Fine S., Scheinberg K.: Efficient SVM training using low-rank kernel representations. J. Mach. Lear. Res. 2, 242–264 (2001)Google Scholar
  16. 16.
    Freund Y., Schapire R.E.: Large margin classification using the perceptron algorithm. Mach. Learn. 37(3), 277–296 (1999)zbMATHCrossRefGoogle Scholar
  17. 17.
    Hazan, E., Kalai, A., Kale, S., Agarwal, A.: Logarithmic regret algorithms for online convex optimization. In: Proceedings of the Nineteenth Annual Conference on Computational Learning Theory (2006)Google Scholar
  18. 18.
    Hsieh, C., Chang, K., Lin, C., Keerthi, S., Sundararajan, S.: A dual coordinate descent method for large-scale linear SVM. In: ICML, pp. 408–415 (2008)Google Scholar
  19. 19.
    Hush, D., Kelly, P., Scovel, C., Steinwart, I.: Qp algorithms with guaranteed accuracy and run time for support vector machines. J. Mach. Learn. Res. (2006)Google Scholar
  20. 20.
    Joachims T.: Making large-scale support vector machine learning practical. In: Schölkopf, B., Burges, C., Smola, A. (eds) Advances in Kernel Methods—Support Vector Learning., MIT Press, Cambridge (1998)Google Scholar
  21. 21.
    Joachims, T.: Training linear SVMs in linear time. In: Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), pp. 216–226 (2006)Google Scholar
  22. 22.
    Kakade, S., Tewari, A.: On the generalization ability of online strongly convex programming algorithms. In: Advances in Neural Information Processing Systems 22 (2009)Google Scholar
  23. 23.
    Kimeldorf G., Wahba G.: Some results on tchebycheffian spline functions. J. Math. Anal. Appl. 33, 82–95 (1971)zbMATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    Kivinen J., Smola A.J., Williamson R.C.: Online learning with kernels. IEEE Trans. Signal Process. 52(8), 2165–2176 (2002)CrossRefMathSciNetGoogle Scholar
  25. 25.
    Kushner H., Yin G.: Stochastic Approximation Algorithms and Applications. Springer, New York (1997)zbMATHGoogle Scholar
  26. 26.
    Murata N.: A statistical study of on-line learning. In: Saad, D. (eds) Online Learning and Neural Networks, Cambridge University Press, Cambridge (1998)Google Scholar
  27. 27.
    Murata N., Amari S.: Statistical analysis of learning dynamics. Signal Process. 74(1), 3–28 (1999)zbMATHCrossRefGoogle Scholar
  28. 28.
    Nesterov, Y.: Primal-dual subgradient methods for convex problems. Tech. rep., Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL) (2005)Google Scholar
  29. 29.
    Platt J.C.: Fast training of Support Vector Machines using sequential minimal optimization. In: Schölkopf, B., Burges, C., Smola, A. (eds) Advances in Kernel Methods—Support Vector Learning, MIT Press, Cambridge (1998)Google Scholar
  30. 30.
    Rockafellar R.: Convex Analysis. Princeton University Press, Princeton (1970)zbMATHGoogle Scholar
  31. 31.
    Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In: Proceedings of the 24th International Conference on Machine Learning, pp. 807–814 (2007)Google Scholar
  32. 32.
    Shalev-Shwartz, S., Srebro, N.: SVM optimization: inverse dependence on training set size. In: Proceedings of the 25th International Conference on Machine Learning, pp. 928–935 (2008)Google Scholar
  33. 33.
    Smola, A., Vishwanathan, S., Le, Q.: Bundle methods for machine learning. In: Advances in Neural Information Processing Systems 21 (2007)Google Scholar
  34. 34.
    Spall J.C.: Introduction to Stochastic Search and Optimization. Wiley, New York (2003)zbMATHCrossRefGoogle Scholar
  35. 35.
    Sridharan, K., Srebro, N., Shalev-Shwartz, S.: Fast rates for regularized objectives. In: Advances in Neural Information Processing Systems 22 (2009)Google Scholar
  36. 36.
    Vapnik V.N.: Statistical Learning Theory. Wiley, New York (1998)zbMATHGoogle Scholar
  37. 37.
    Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-First International Conference on Machine Learning (2004)Google Scholar

Copyright information

© Springer and Mathematical Optimization Society 2010

Authors and Affiliations

  • Shai Shalev-Shwartz
    • 1
  • Yoram Singer
    • 2
  • Nathan Srebro
    • 3
    Email author
  • Andrew Cotter
    • 3
  1. 1.School of Computer Science and EngineeringThe Hebrew University of JerusalemJerusalemIsrael
  2. 2.GoogleMountain ViewUSA
  3. 3.Toyota Technological Institute at ChicagoChicagoUSA

Personalised recommendations