Advertisement

Mathematical Programming

, Volume 174, Issue 1–2, pp 253–292 | Cite as

On variance reduction for stochastic smooth convex optimization with multiplicative noise

  • Alejandro Jofré
  • Philip ThompsonEmail author
Full Length Paper Series B
  • 309 Downloads

Abstract

We propose dynamic sampled stochastic approximation (SA) methods for stochastic optimization with a heavy-tailed distribution (with finite 2nd moment). The objective is the sum of a smooth convex function with a convex regularizer. Typically, it is assumed an oracle with an upper bound \(\sigma ^2\) on its variance (OUBV). Differently, we assume an oracle with multiplicative noise. This rarely addressed setup is more aggressive but realistic, where the variance may not be uniformly bounded. Our methods achieve optimal iteration complexity and (near) optimal oracle complexity. For the smooth convex class, we use an accelerated SA method a la FISTA which achieves, given tolerance \(\varepsilon >0\), the optimal iteration complexity of \(\mathscr {O}(\varepsilon ^{-\frac{1}{2}})\) with a near-optimal oracle complexity of \(\mathscr {O}(\varepsilon ^{-2})[\ln (\varepsilon ^{-\frac{1}{2}})]^2\). This improves upon Ghadimi and Lan (Math Program 156:59–99, 2016) where it is assumed an OUBV. For the strongly convex class, our method achieves optimal iteration complexity of \(\mathscr {O}(\ln (\varepsilon ^{-1}))\) and optimal oracle complexity of \(\mathscr {O}(\varepsilon ^{-1})\). This improves upon Byrd et al. (Math Program 134:127–155, 2012) where it is assumed an OUBV. In terms of variance, our bounds are local: they depend on variances \(\sigma (x^*)^2\) at solutions \(x^*\) and the per unit distance multiplicative variance \(\sigma ^2_L\). For the smooth convex class, there exist policies such that our bounds resemble, up to absolute constants, those obtained in the mentioned papers if it was assumed an OUBV with \(\sigma ^2:=\sigma (x^*)^2\). For the strongly convex class such property is obtained exactly if the condition number is estimated or in the limit for better conditioned problems or for larger initial batch sizes. In any case, if it is assumed an OUBV, our bounds are thus sharper since typically \(\max \{\sigma (x^*)^2,\sigma _L^2\}\ll \sigma ^2\).

Keywords

Stochastic approximation Smooth convex optimization Composite optimization Multiplicative noise Acceleration Dynamic sampling Variance reduction Complexity 

Mathematics Subject Classification

65K05 62L20 90C25 90C15 68Q25 

Notes

Acknowledgements

The authors thank the referees for improving the presentation of the paper.

References

  1. 1.
    Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods (2016). Preprint at arXiv:1603.05953
  2. 2.
    Agarwal, A., Barlett, P., Ravikumar, P., Wainwright, M.J.: Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans. Inf. Theory 58(5), 3235–3249 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Atchadé, Y.F., Fort, G., Moulines, E.: On perturbed proximal gradient algorithms. J. Mach. Learn. Res. 18, 1–33 (2017)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Bach, F.: Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. J. Mach. Learn. Res. 15, 595–627 (2014)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Bach, F., Moulines, E.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Conference Paper, Advances in Neural Information Processing Systems (NIPS) (2011)Google Scholar
  6. 6.
    Balamurugan, P., Bach, F.: Stochastic variance reduction methods for saddle-point problems. In: Conference Paper, Advances in Neural Information Processing Systems (NIPS) (2016)Google Scholar
  7. 7.
    Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning (2016). Preprint at arXiv:1606.04838
  9. 9.
    Byrd, R.H., Chin, G.M., Nocedal, J., Wu, Y.: Sample size selection in optimization methods for machine learning. Math. Program. Ser. B 134(1), 127–155 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Chung, K.L.: On a stochastic approximation method. Ann. Math. Stat. 25, 463–483 (1954)MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 1646–1654. Curran Associates, Inc., Red Hook (2014)Google Scholar
  12. 12.
    Dieuleveut, A., Flammarion, N., Bach, F.: Harder, better, faster stronger convergence rates for least-squares regression (2016). Preprint at arXiv:1602.05419
  13. 13.
    Duchi, J., Singer, Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10, 2899–2934 (2009)MathSciNetzbMATHGoogle Scholar
  14. 14.
    Dvoretzky, A.: On stochastic approximation. In: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, pp. 39–55. University of California Press (1956)Google Scholar
  15. 15.
    Flammarion, N., Bach, F.: Stochastic composite least-squares regression with convergence rate \(O(1/n)\) (2017). Preprint at arXiv:1702.06429
  16. 16.
    Friedlander, M., Schmidt, M.: Hybrid deterministic-stochastic methods for data fitting. SIAM J. Sci. Comput. 34(3), 1380–1405 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Frostig, R., Ge, R., Kakade, S.M., Sidford, A.: Competing with the empirical risk minimizer in a single pass. In: COLT 2015 Proceedings (2015)Google Scholar
  18. 18.
    Fu, M.C. (ed.): Handbook of Simulation Optimization. Springer, New York (2015)zbMATHGoogle Scholar
  19. 19.
    Gadat, S., Panloup, F.: Optimal non-asymptotic bound of the Ruppert–Polyak averaging without strong convexity (2017). Preprint at arXiv:1709.03342
  20. 20.
    Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, I: a generic algorithmic framework. SIAM J. Optim. 22(4), 1469–1492 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms. SIAM J. Optim. 23(4), 2061–2089 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. Ser. A 156(1), 59–99 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.: Convergence rate of incremental aggregated gradient algorithms. SIAM J. Optim. Preprint at arXiv:1506.02081
  24. 24.
    Hazan, E., Kale, S.: Beyond the regret minimization barrier optimal algorithms for stochastic strongly-convex optimization. J. Mach. Learn. Res. 15, 2489–2512 (2014)MathSciNetzbMATHGoogle Scholar
  25. 25.
    Hu, C., Kwok, J.T., Pan, W.: Accelerated gradient methods for stochastic optimization and online learning. In: Advances in Neural Information Processing Systems (NIPS) (2009)Google Scholar
  26. 26.
    Iusem, A., Jofré, A., Thompson, P.: Incremental constraint projection methods for monotone stochastic variational inequalities. Math. Oper. Res. Preprint at arXiv:1703.00272
  27. 27.
    Iusem, A., Jofré, A., Oliveira, R.I., Thompson, P.: Extragradient methods with variance reduction for stochastic variational inequalities. SIAM J. Optim. 27(2), 686–724 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Iusem, A., Jofré, A., Oliveira, R.I., Thompson, P.: Variance-based stochastic extragradient methods with line search for stochastic variational inequalities (2016). Preprint at arXiv:1703.00262
  29. 29.
    Jain, P., Kakade, S.M., Kidambi, R., Netrapalli, P., Sidford, A.: Parallelizing stochastic approximation through mini-batching and tail-averaging (2016). Preprint at arXiv:1610.03774
  30. 30.
    Jain, P., Kakade, S.M., Kidambi, R., Netrapalli, P., Sidford, A.: Accelerating stochastic gradient descent (2017). Preprint at arXiv:1704.08227
  31. 31.
    Jiang, H., Xu, H.: Stochastic approximation approaches to the stochastic variational inequality problem. IEEE Trans. Autom. Control 53(6), 1462–1475 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  32. 32.
    Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems (NIPS) (2013)Google Scholar
  33. 33.
    Juditsky, A.B., Nazin, A.V., Tsybakov, A.B., Vayatis, N.: Recursive aggregation of estimators via the mirror descent algorithm with averaging. Probl. Inf. Transm. 41(4), 368–384 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  34. 34.
    Juditsky, A., Nemirovski, A., Tauvel, C.: Solving variational inequalities with stochastic mirror-prox algorithm. Stoch. Syst. 1(1), 17–58 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  35. 35.
    Juditsky, A., Rigollet, P., Tsybakov, A.B.: Learning by mirror averaging. Ann. Stat. 36(5), 2183–2206 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Lan, G.: An optimal method for stochastic composite optimization. Math. Program. Ser. A 133(1), 365–397 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  37. 37.
    Le Roux, N., Schmidt, M., Bach, F.R.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Advances in Neural Information Processing Systems 25 (NIPS) (2012)Google Scholar
  38. 38.
    Lee, S., Wright, S.: Manifold identification in dual averaging for regularized stochastic online learning. J. Mach. Learn. Res. 13, 1705–1744 (2012)MathSciNetzbMATHGoogle Scholar
  39. 39.
    Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. In: Advances in Neural Information Processing Systems (NIPS) (2015)Google Scholar
  40. 40.
    Lin, Q., Chen, X., Peña, J.: A sparsity preserving stochastic gradient methods for sparse regression. Comput. Optim. Appl. 58(2), 455–482 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  41. 41.
    Needell, D., Srebro, N., Ward, R.: Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Math. Program. Ser. A 155(1), 549–573 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  42. 42.
    Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  43. 43.
    Nemirovskii, A., Yudin, D.: On Cezari’s convergence of the steepest descent method for approximating saddle point of convex-concave functions. (in Russian)—Doklady Akademii Nauk SSSR, 239, 5 (1978) (English translation: Soviet Math. Dokl. 19, 2 (1978))Google Scholar
  44. 44.
    Nemirovski, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience Series in Discrete Mathematics. Wiley, New York (1983)Google Scholar
  45. 45.
    Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence \(O(1/k^2)\). Soviet Math. Doklady 27, 372–376 (1983)Google Scholar
  46. 46.
    Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, Cambridge (2004)CrossRefzbMATHGoogle Scholar
  47. 47.
    Nesterov, Y.: Gradient methods for minimizing composite objective function. Math. Program. Ser. B 140(1), 125–161 (2013)CrossRefzbMATHGoogle Scholar
  48. 48.
    Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. Ser. B 120(1), 221–259 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  49. 49.
    Nesterov, Y.V.: Confidence level solutions for stochastic programming. Automatica 44(6), 1559–1568 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  50. 50.
    Polyak, B.T.: New method of stochastic approximation type. Autom. Remote Control 51, 937–946 (1991)MathSciNetzbMATHGoogle Scholar
  51. 51.
    Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30, 838–855 (1992)MathSciNetCrossRefzbMATHGoogle Scholar
  52. 52.
    Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)MathSciNetCrossRefzbMATHGoogle Scholar
  53. 53.
    Rosasco, L., Villa, S., Vũ, B.C.: Convergence of a stochastic proximal gradient algorithm (2014). Preprint at arXiv:1403.5074
  54. 54.
    Ruppert, D.: Efficient estimations from a slowly convergent Robbins–Monro process. Technical report, Cornell University Operations Research and Industrial Engineering (1988). Preprint at https://ecommons.cornell.edu/handle/1813/8664
  55. 55.
    Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. Ser. A 162(1), 83–112 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  56. 56.
    Shapiro, A., Dentcheva, D., Ruszczyński, A.: Lectures on Stochastic Programming: Modeling and Theory. MOS-SIAM Series on Optimization. SIAM, Philadelphia (2009)CrossRefzbMATHGoogle Scholar
  57. 57.
    Sra, S., Nowozin, S., Wright, S.J. (eds.): Optimization for Machine Learning. The MIT Press, Cambridge, MA (2012)zbMATHGoogle Scholar
  58. 58.
    Tseng, P.: On Accelerated Proximal Gradient Methods for Convex–Concave Optimization. University of Washington, Seattle (2008)Google Scholar
  59. 59.
    Xiao, L.: Dual averaging methods for regularized stochastic learning and online optimization. J. Mach. Learn. Res. 9, 2543–2596 (2010)MathSciNetzbMATHGoogle Scholar
  60. 60.
    Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  61. 61.
    Woodworth, B.E., Srebro, N.: Tight complexity bounds for optimizing composite objectives. In: Advances in Neural Information Processing Systems 29 (NIPS) (2016)Google Scholar
  62. 62.
    Zhang, L., Yang, T., Jin, R., He, X.: \(O(log T)\) projections for stochastic optimization of smooth and strongly convex functions. In: Proceedings of the International Conference on International Conference on Machine Learning (ICML), Vol. 28 (2013)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature and Mathematical Optimization Society 2018

Authors and Affiliations

  1. 1.Center for Mathematical Modeling (CMM) and DIMUniversidad de ChileSantiagoChile
  2. 2.Center for Mathematical Modeling (CMM)Universidad de ChileSantiagoChile

Personalised recommendations