Abstract
This paper presents a methodology for using varying sample sizes in batch-type optimization methods for large-scale machine learning problems. The first part of the paper deals with the delicate issue of dynamic sample selection in the evaluation of the function and gradient. We propose a criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient. We establish an \({O(1/\epsilon)}\) complexity bound on the total cost of a gradient method. The second part of the paper describes a practical Newton method that uses a smaller sample to compute Hessian vector-products than to evaluate the function and the gradient, and that also employs a dynamic sampling technique. The focus of the paper shifts in the third part of the paper to L 1-regularized problems designed to produce sparse solutions. We propose a Newton-like method that consists of two phases: a (minimalistic) gradient projection phase that identifies zero variables, and subspace phase that applies a subsampled Hessian Newton iteration in the free variables. Numerical tests on speech recognition problems illustrate the performance of the algorithms.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agarwal, A., Duchi, J.: Distributed delayed stochastic optimization. Arxiv preprint arXiv:1104.5525 (2011)
Andrew, G., Gao, J.: Scalable training of l 1-regularized log-linear models. In: Proceedings of the 24th International Conference on Machine Learning, pp. 33–40. ACM (2007)
Bastin F., Cirillo C., Toint P.L.: An adaptive monte carlo algorithm for computing mixed logit estimators. Comput. Manag. Sci. 3(1), 55–79 (2006)
Beck A., Teboulle M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Bertsekas D.P.: On the Goldstein-Levitin-Poljak gradient projection method. IEEE Trans. Autom. Control AC-21, 174–184 (1976)
Bottou L., Bousquet O.: The tradeoffs of large scale learning. In: Platt, J., Koller, D., Singer, Y., Roweis, S. (eds) Advances in Neural Information Processing Systems, vol. 20, pp. 161–168. MIT Press, Cambridge, MA (2008)
Byrd, R., Chin, G.M., Neveitt, W., Nocedal, J.: On the use of stochastic Hessian information in unconstrained optimization. SIAM J. Optim. 21(3), 977–995 (2011)
Conn A.R., Gould N.I.M., Toint P.L.: A globally convergent augmented Lagrangian algorithm for optimization with general constraints and simple bounds. SIAM J. Numer. Anal. 28(2), 545–572 (1991)
Dai Y., Fletcher R.: Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming. Numerische Mathematik 100(1), 21–47 (2005)
Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-batches. Arxiv preprint arXiv:1012.1367 (2010)
Deng G., Ferris M.C.: Variable-number sample-path optimization. Math. Program. 117(1–2), 81–109 (2009)
Donoho D.: De-noising by soft-thresholding. Inf. Theory IEEE Trans. 41(3), 613–627 (1995)
Duchi, J., Shalev-Shwartz, S., Singer, Y., Tewari, A.: Composite objective mirror descent. In: Proceedings of the Twenty Third Annual Conference on Computational Learning Theory. Citeseer (2010)
Duchi J., Singer Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10, 2899–2934 (2009)
Figueiredo M., Nowak R., Wright S.: Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE J. Sel. Top. Signal Process. 1(4), 586–597 (2007)
Freund J.E.: Mathematical Statistics. Prentice Hall, Englewood Cliffs, NJ (1962)
Friedlander, M., Schmidt, M.: Hybrid deterministic-stochastic methods for data fitting. Arxiv preprint arXiv:1104.2373 (2011)
Hager W.W., Zhang H.: A new active set algorithm for box constrained optimization. SIOPT 17(2), 526–557 (2007)
Homem-de-Mello T.: Variable-sample methods for stochastic optimization. ACM Trans. Model. Comput. Simul. 13(2), 108–133 (2003)
Kleywegt A.J., Shapiro A., Homem-de-Mello T.: The sample average approximation method for stochastic discrete optimization. SIAM J. Optim. 12(2), 479–502 (2001)
Lin C., Moré J. et al.: Newton’s method for large bound-constrained optimization problems. SIAM J. Optim. 9(4), 1100–1127 (1999)
Martens, J.: Deep learning via Hessian-free optimization. In: Proceedings of the 27th International Conference on Machine Learning (ICML) (2010)
Nesterov Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009). doi:10.1007/s10107-007-0149-x
Niu, F., Recht, B., Ré, C., Wright, S.: Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. Arxiv preprint arXiv:1106.5730 (2011)
Polyak B., Juditsky A.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30, 838 (1992)
Polyak B.T.: The conjugate gradient method in extremal problems. USSR Comput. Math. Math. Phys. 9, 94–112 (1969)
Robbins H., Monro S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Shapiro A., Homem-de-Mello T.: A simulation-based approach to two-stage stochastic programming with recourse. Math. Program. 81, 301–325 (1998)
Shapiro A., Homem-de-Mello T.: On the rate of convergence of optimal solutions of monte carlo approximations of stochastic programs. SIAM J. Optim. 11(1), 70–86 (2000)
Shapiro A., Wardi Y.: Convergence of stochastic algorithms. Math. Oper. Res. 21(3), 615–628 (1996)
Vishwanathan, S., Schraudolph, N., Schmidt, M., Murphy, K.: Accelerated training of conditional random fields with stochastic gradient methods. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 969–976. ACM (2006)
Wright, S.: Accelerated block-coordinate relaxation for regularized optimization. Tech. rep., Computer Science Department, University of Wisconsin (2010)
Wright S., Nowak R., Figueiredo M.: Sparse reconstruction by separable approximation. Signal Process. IEEE Trans. 57(7), 2479–2493 (2009)
Xiao L.: Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 9999, 2543–2596 (2010)
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by National Science Foundation grant CMMI 0728190 and grant DMS-0810213, by Department of Energy grant DE-FG02-87ER25047-A004 and grant DE-SC0001774, and by an NSERC fellowship and a Google grant.
Rights and permissions
About this article
Cite this article
Byrd, R.H., Chin, G.M., Nocedal, J. et al. Sample size selection in optimization methods for machine learning. Math. Program. 134, 127–155 (2012). https://doi.org/10.1007/s10107-012-0572-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-012-0572-5