Skip to main content
Log in

Sample size selection in optimization methods for machine learning

  • Full Length Paper
  • Series B
  • Published:
Mathematical Programming Submit manuscript

Abstract

This paper presents a methodology for using varying sample sizes in batch-type optimization methods for large-scale machine learning problems. The first part of the paper deals with the delicate issue of dynamic sample selection in the evaluation of the function and gradient. We propose a criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient. We establish an \({O(1/\epsilon)}\) complexity bound on the total cost of a gradient method. The second part of the paper describes a practical Newton method that uses a smaller sample to compute Hessian vector-products than to evaluate the function and the gradient, and that also employs a dynamic sampling technique. The focus of the paper shifts in the third part of the paper to L 1-regularized problems designed to produce sparse solutions. We propose a Newton-like method that consists of two phases: a (minimalistic) gradient projection phase that identifies zero variables, and subspace phase that applies a subsampled Hessian Newton iteration in the free variables. Numerical tests on speech recognition problems illustrate the performance of the algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Agarwal, A., Duchi, J.: Distributed delayed stochastic optimization. Arxiv preprint arXiv:1104.5525 (2011)

    Google Scholar 

  2. Andrew, G., Gao, J.: Scalable training of l 1-regularized log-linear models. In: Proceedings of the 24th International Conference on Machine Learning, pp. 33–40. ACM (2007)

  3. Bastin F., Cirillo C., Toint P.L.: An adaptive monte carlo algorithm for computing mixed logit estimators. Comput. Manag. Sci. 3(1), 55–79 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  4. Beck A., Teboulle M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  5. Bertsekas D.P.: On the Goldstein-Levitin-Poljak gradient projection method. IEEE Trans. Autom. Control AC-21, 174–184 (1976)

    Article  MathSciNet  Google Scholar 

  6. Bottou L., Bousquet O.: The tradeoffs of large scale learning. In: Platt, J., Koller, D., Singer, Y., Roweis, S. (eds) Advances in Neural Information Processing Systems, vol. 20, pp. 161–168. MIT Press, Cambridge, MA (2008)

    Google Scholar 

  7. Byrd, R., Chin, G.M., Neveitt, W., Nocedal, J.: On the use of stochastic Hessian information in unconstrained optimization. SIAM J. Optim. 21(3), 977–995 (2011)

    Google Scholar 

  8. Conn A.R., Gould N.I.M., Toint P.L.: A globally convergent augmented Lagrangian algorithm for optimization with general constraints and simple bounds. SIAM J. Numer. Anal. 28(2), 545–572 (1991)

    Article  MathSciNet  MATH  Google Scholar 

  9. Dai Y., Fletcher R.: Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming. Numerische Mathematik 100(1), 21–47 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  10. Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-batches. Arxiv preprint arXiv:1012.1367 (2010)

    Google Scholar 

  11. Deng G., Ferris M.C.: Variable-number sample-path optimization. Math. Program. 117(1–2), 81–109 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  12. Donoho D.: De-noising by soft-thresholding. Inf. Theory IEEE Trans. 41(3), 613–627 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  13. Duchi, J., Shalev-Shwartz, S., Singer, Y., Tewari, A.: Composite objective mirror descent. In: Proceedings of the Twenty Third Annual Conference on Computational Learning Theory. Citeseer (2010)

  14. Duchi J., Singer Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10, 2899–2934 (2009)

    MathSciNet  MATH  Google Scholar 

  15. Figueiredo M., Nowak R., Wright S.: Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE J. Sel. Top. Signal Process. 1(4), 586–597 (2007)

    Article  Google Scholar 

  16. Freund J.E.: Mathematical Statistics. Prentice Hall, Englewood Cliffs, NJ (1962)

    Google Scholar 

  17. Friedlander, M., Schmidt, M.: Hybrid deterministic-stochastic methods for data fitting. Arxiv preprint arXiv:1104.2373 (2011)

    Google Scholar 

  18. Hager W.W., Zhang H.: A new active set algorithm for box constrained optimization. SIOPT 17(2), 526–557 (2007)

    MathSciNet  Google Scholar 

  19. Homem-de-Mello T.: Variable-sample methods for stochastic optimization. ACM Trans. Model. Comput. Simul. 13(2), 108–133 (2003)

    Article  Google Scholar 

  20. Kleywegt A.J., Shapiro A., Homem-de-Mello T.: The sample average approximation method for stochastic discrete optimization. SIAM J. Optim. 12(2), 479–502 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  21. Lin C., Moré J. et al.: Newton’s method for large bound-constrained optimization problems. SIAM J. Optim. 9(4), 1100–1127 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  22. Martens, J.: Deep learning via Hessian-free optimization. In: Proceedings of the 27th International Conference on Machine Learning (ICML) (2010)

  23. Nesterov Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009). doi:10.1007/s10107-007-0149-x

    Article  MathSciNet  MATH  Google Scholar 

  24. Niu, F., Recht, B., Ré, C., Wright, S.: Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. Arxiv preprint arXiv:1106.5730 (2011)

    Google Scholar 

  25. Polyak B., Juditsky A.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30, 838 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  26. Polyak B.T.: The conjugate gradient method in extremal problems. USSR Comput. Math. Math. Phys. 9, 94–112 (1969)

    Article  Google Scholar 

  27. Robbins H., Monro S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  28. Shapiro A., Homem-de-Mello T.: A simulation-based approach to two-stage stochastic programming with recourse. Math. Program. 81, 301–325 (1998)

    MathSciNet  MATH  Google Scholar 

  29. Shapiro A., Homem-de-Mello T.: On the rate of convergence of optimal solutions of monte carlo approximations of stochastic programs. SIAM J. Optim. 11(1), 70–86 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  30. Shapiro A., Wardi Y.: Convergence of stochastic algorithms. Math. Oper. Res. 21(3), 615–628 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  31. Vishwanathan, S., Schraudolph, N., Schmidt, M., Murphy, K.: Accelerated training of conditional random fields with stochastic gradient methods. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 969–976. ACM (2006)

  32. Wright, S.: Accelerated block-coordinate relaxation for regularized optimization. Tech. rep., Computer Science Department, University of Wisconsin (2010)

  33. Wright S., Nowak R., Figueiredo M.: Sparse reconstruction by separable approximation. Signal Process. IEEE Trans. 57(7), 2479–2493 (2009)

    Article  MathSciNet  Google Scholar 

  34. Xiao L.: Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 9999, 2543–2596 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jorge Nocedal.

Additional information

This work was supported by National Science Foundation grant CMMI 0728190 and grant DMS-0810213, by Department of Energy grant DE-FG02-87ER25047-A004 and grant DE-SC0001774, and by an NSERC fellowship and a Google grant.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Byrd, R.H., Chin, G.M., Nocedal, J. et al. Sample size selection in optimization methods for machine learning. Math. Program. 134, 127–155 (2012). https://doi.org/10.1007/s10107-012-0572-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-012-0572-5

Mathematics Subject Classification

Navigation