Mathematical Programming

, Volume 134, Issue 1, pp 127–155 | Cite as

Sample size selection in optimization methods for machine learning

  • Richard H. Byrd
  • Gillian M. Chin
  • Jorge Nocedal
  • Yuchen Wu
Full Length Paper Series B

Abstract

This paper presents a methodology for using varying sample sizes in batch-type optimization methods for large-scale machine learning problems. The first part of the paper deals with the delicate issue of dynamic sample selection in the evaluation of the function and gradient. We propose a criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient. We establish an \({O(1/\epsilon)}\) complexity bound on the total cost of a gradient method. The second part of the paper describes a practical Newton method that uses a smaller sample to compute Hessian vector-products than to evaluate the function and the gradient, and that also employs a dynamic sampling technique. The focus of the paper shifts in the third part of the paper to L1-regularized problems designed to produce sparse solutions. We propose a Newton-like method that consists of two phases: a (minimalistic) gradient projection phase that identifies zero variables, and subspace phase that applies a subsampled Hessian Newton iteration in the free variables. Numerical tests on speech recognition problems illustrate the performance of the algorithms.

Mathematics Subject Classification

49M15 49M37 65K05 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agarwal, A., Duchi, J.: Distributed delayed stochastic optimization. Arxiv preprint arXiv:1104.5525 (2011)Google Scholar
  2. 2.
    Andrew, G., Gao, J.: Scalable training of l 1-regularized log-linear models. In: Proceedings of the 24th International Conference on Machine Learning, pp. 33–40. ACM (2007)Google Scholar
  3. 3.
    Bastin F., Cirillo C., Toint P.L.: An adaptive monte carlo algorithm for computing mixed logit estimators. Comput. Manag. Sci. 3(1), 55–79 (2006)MathSciNetMATHCrossRefGoogle Scholar
  4. 4.
    Beck A., Teboulle M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)MathSciNetMATHCrossRefGoogle Scholar
  5. 5.
    Bertsekas D.P.: On the Goldstein-Levitin-Poljak gradient projection method. IEEE Trans. Autom. Control AC-21, 174–184 (1976)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Bottou L., Bousquet O.: The tradeoffs of large scale learning. In: Platt, J., Koller, D., Singer, Y., Roweis, S. (eds) Advances in Neural Information Processing Systems, vol. 20, pp. 161–168. MIT Press, Cambridge, MA (2008)Google Scholar
  7. 7.
    Byrd, R., Chin, G.M., Neveitt, W., Nocedal, J.: On the use of stochastic Hessian information in unconstrained optimization. SIAM J. Optim. 21(3), 977–995 (2011)Google Scholar
  8. 8.
    Conn A.R., Gould N.I.M., Toint P.L.: A globally convergent augmented Lagrangian algorithm for optimization with general constraints and simple bounds. SIAM J. Numer. Anal. 28(2), 545–572 (1991)MathSciNetMATHCrossRefGoogle Scholar
  9. 9.
    Dai Y., Fletcher R.: Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming. Numerische Mathematik 100(1), 21–47 (2005)MathSciNetMATHCrossRefGoogle Scholar
  10. 10.
    Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-batches. Arxiv preprint arXiv:1012.1367 (2010)Google Scholar
  11. 11.
    Deng G., Ferris M.C.: Variable-number sample-path optimization. Math. Program. 117(1–2), 81–109 (2009)MathSciNetMATHCrossRefGoogle Scholar
  12. 12.
    Donoho D.: De-noising by soft-thresholding. Inf. Theory IEEE Trans. 41(3), 613–627 (1995)MathSciNetMATHCrossRefGoogle Scholar
  13. 13.
    Duchi, J., Shalev-Shwartz, S., Singer, Y., Tewari, A.: Composite objective mirror descent. In: Proceedings of the Twenty Third Annual Conference on Computational Learning Theory. Citeseer (2010)Google Scholar
  14. 14.
    Duchi J., Singer Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10, 2899–2934 (2009)MathSciNetMATHGoogle Scholar
  15. 15.
    Figueiredo M., Nowak R., Wright S.: Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE J. Sel. Top. Signal Process. 1(4), 586–597 (2007)CrossRefGoogle Scholar
  16. 16.
    Freund J.E.: Mathematical Statistics. Prentice Hall, Englewood Cliffs, NJ (1962)Google Scholar
  17. 17.
    Friedlander, M., Schmidt, M.: Hybrid deterministic-stochastic methods for data fitting. Arxiv preprint arXiv:1104.2373 (2011)Google Scholar
  18. 18.
    Hager W.W., Zhang H.: A new active set algorithm for box constrained optimization. SIOPT 17(2), 526–557 (2007)MathSciNetGoogle Scholar
  19. 19.
    Homem-de-Mello T.: Variable-sample methods for stochastic optimization. ACM Trans. Model. Comput. Simul. 13(2), 108–133 (2003)CrossRefGoogle Scholar
  20. 20.
    Kleywegt A.J., Shapiro A., Homem-de-Mello T.: The sample average approximation method for stochastic discrete optimization. SIAM J. Optim. 12(2), 479–502 (2001)MathSciNetMATHCrossRefGoogle Scholar
  21. 21.
    Lin C., Moré J. et al.: Newton’s method for large bound-constrained optimization problems. SIAM J. Optim. 9(4), 1100–1127 (1999)MathSciNetMATHCrossRefGoogle Scholar
  22. 22.
    Martens, J.: Deep learning via Hessian-free optimization. In: Proceedings of the 27th International Conference on Machine Learning (ICML) (2010)Google Scholar
  23. 23.
    Nesterov Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009). doi:10.1007/s10107-007-0149-x MathSciNetMATHCrossRefGoogle Scholar
  24. 24.
    Niu, F., Recht, B., Ré, C., Wright, S.: Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. Arxiv preprint arXiv:1106.5730 (2011)Google Scholar
  25. 25.
    Polyak B., Juditsky A.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30, 838 (1992)MathSciNetMATHCrossRefGoogle Scholar
  26. 26.
    Polyak B.T.: The conjugate gradient method in extremal problems. USSR Comput. Math. Math. Phys. 9, 94–112 (1969)CrossRefGoogle Scholar
  27. 27.
    Robbins H., Monro S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)MathSciNetMATHCrossRefGoogle Scholar
  28. 28.
    Shapiro A., Homem-de-Mello T.: A simulation-based approach to two-stage stochastic programming with recourse. Math. Program. 81, 301–325 (1998)MathSciNetMATHGoogle Scholar
  29. 29.
    Shapiro A., Homem-de-Mello T.: On the rate of convergence of optimal solutions of monte carlo approximations of stochastic programs. SIAM J. Optim. 11(1), 70–86 (2000)MathSciNetMATHCrossRefGoogle Scholar
  30. 30.
    Shapiro A., Wardi Y.: Convergence of stochastic algorithms. Math. Oper. Res. 21(3), 615–628 (1996)MathSciNetMATHCrossRefGoogle Scholar
  31. 31.
    Vishwanathan, S., Schraudolph, N., Schmidt, M., Murphy, K.: Accelerated training of conditional random fields with stochastic gradient methods. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 969–976. ACM (2006)Google Scholar
  32. 32.
    Wright, S.: Accelerated block-coordinate relaxation for regularized optimization. Tech. rep., Computer Science Department, University of Wisconsin (2010)Google Scholar
  33. 33.
    Wright S., Nowak R., Figueiredo M.: Sparse reconstruction by separable approximation. Signal Process. IEEE Trans. 57(7), 2479–2493 (2009)MathSciNetCrossRefGoogle Scholar
  34. 34.
    Xiao L.: Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 9999, 2543–2596 (2010)Google Scholar

Copyright information

© Springer and Mathematical Optimization Society 2012

Authors and Affiliations

  • Richard H. Byrd
    • 1
  • Gillian M. Chin
    • 2
  • Jorge Nocedal
    • 2
  • Yuchen Wu
    • 3
  1. 1.Department of Computer ScienceUniversity of ColoradoBoulderUSA
  2. 2.Department of Industrial Engineering and Management SciencesNorthwestern UniversityEvanstonUSA
  3. 3.Google Inc.Mountain ViewUSA

Personalised recommendations