Advertisement

Compositional Stochastic Average Gradient for Machine Learning and Related Applications

  • Tsung-Yu HsiehEmail author
  • Yasser EL-Manzalawy
  • Yiwei Sun
  • Vasant Honavar
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11314)

Abstract

Many machine learning, and statistical inference problems require minimization of a composition of expected value functions (CEVF). Of particular interest is the finite-sum versions of such compositional optimization problems (FS-CEVF). Compositional stochastic variance reduced gradient (C-SVRG) methods that combine stochastic compositional gradient descent (SCGD) and stochastic variance reduced gradient descent (SVRG) methods are the state-of-the-art methods for FS-CEVF problems. We introduce compositional stochastic average gradient descent (C-SAG) a novel extension of the stochastic average gradient method (SAG) to minimize composition of finite-sum functions. C-SAG, like SAG, estimates gradient by incorporating memory of previous gradient information. We present theoretical analyses of C-SAG which show that C-SAG, like C-SVRG, achieves a linear convergence rate for strongly convex objective function; However, C-CAG achieves lower oracle query complexity per iteration than C-SVRG. Finally, we present results of experiments showing that C-SAG converges substantially faster than full gradient (FG), as well as C-SVRG.

Keywords

Machine learning Stochastic gradient descent Compositional finite-sum optimization Convex optimization 

Notes

Acknowledgement

This project was supported in part by the National Center for Advancing Translational Sciences, National Institutes of Health through the grant UL1 TR000127 and TR002014, by the National Science Foundation, through the grants 1518732, 1640834, and 1636795, the Pennsylvania State Universitys Institute for Cyberscience and the Center for Big Data Analytics and Discovery Informatics, the Edward Frymoyer Endowed Professorship in Information Sciences and Technology at Pennsylvania State University and the Sudha Murty Distinguished Visiting Chair in Neurocomputing and Data Science funded by the Pratiksha Trust at the Indian Institute of Science [both held by Vasant Honavar]. The content is solely the responsibility of the authors and does not necessarily represent the official views of the sponsors.

References

  1. 1.
    Amari, S.I.: Backpropagation and stochastic gradient descent method. Neurocomputing 5(4–5), 185–196 (1993)CrossRefGoogle Scholar
  2. 2.
    Bishop, C.: Pattern Recognition and Machine Learning. Springer, New york (2006).  https://doi.org/10.1007/978-1-4615-7566-5CrossRefzbMATHGoogle Scholar
  3. 3.
    Bottou, L.: Stochastic gradient learning in neural networks. Proc. Neuro-Nımes 91(8), 12 (1991)Google Scholar
  4. 4.
    Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Lechevallier, Y., Saporta, G. (eds.) Proceedings of COMPSTAT, pp. 177–186. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-7908-2604-3_16CrossRefGoogle Scholar
  5. 5.
    Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 3(1), 1–122 (2011)zbMATHGoogle Scholar
  7. 7.
    Cauwenberghs, G.: A fast stochastic error-descent algorithm for supervised learning and optimization. In: Advances in Neural Information Processing Systems, pp. 244–251 (1993)Google Scholar
  8. 8.
    Dai, B., He, N., Pan, Y., Boots, B., Song, L.: Learning from conditional distributions via dual embeddings. arXiv preprint arXiv:1607.04579 (2016)
  9. 9.
    Darken, C., Moody, J.: Fast adaptive K-means clustering: some empirical results. In: International Joint Conference on Neural Networks, pp. 233–238. IEEE (1990)Google Scholar
  10. 10.
    Dentcheva, D., Penev, S., Ruszczyński, A.: Statistical estimation of composite risk functionals and risk optimization problems. Ann. Inst. Stat. Math. 69(4), 737–760 (2017)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Ermoliev, Y.: Stochastic quasigradient methods. In: Ermoliev, Y., Wets, R.J.-B. (eds.) Numerical techniques for stochastic optimization, no. 10. Springer, Heidelberg (1988)Google Scholar
  12. 12.
    Fagan, F., Iyengar, G.: Unbiased scalable softmax optimization. arXiv preprint arXiv:1803.08577 (2018)
  13. 13.
    Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, vol. 1. Springer, New York (2001).  https://doi.org/10.1007/978-0-387-21606-5CrossRefzbMATHGoogle Scholar
  14. 14.
    Hu, J., Zhou, E., Fan, Q.: Model-based annealing random search with stochastic averaging. ACM Trans Model. Comput. Simul. 24(4), 21 (2014)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Huo, Z., Gu, B., Huang, H.: Accelerated method for stochastic composition optimization with nonsmooth regularization. arXiv preprint arXiv:1711.03937 (2017)
  16. 16.
    Jain, P., Kakade, S.M., Kidambi, R., Netrapalli, P., Sidford, A.: Accelerating stochastic gradient descent for least squares regression. In: Conference on Learning Theory, pp. 545–604 (2018)Google Scholar
  17. 17.
    Jin, C., Kakade, S.M., Netrapalli, P.: Provable efficient online matrix completion via non-convex stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 4520–4528 (2016)Google Scholar
  18. 18.
    Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)Google Scholar
  19. 19.
    Kiefer, J., Wolfowitz, J.: Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 23(3), 462–466 (1952)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Le, Q.V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., Ng, A.Y.: On optimization methods for deep learning. In: Proceedings of the 28th International Conference on Machine Learning, pp. 265–272. Omnipress (2011)Google Scholar
  21. 21.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)CrossRefGoogle Scholar
  22. 22.
    LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient BackProp. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 9–48. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-35289-8_3CrossRefGoogle Scholar
  23. 23.
    Lian, X., Wang, M., Liu, J.: Finite-sum composition optimization via variance reduced gradient descent. In: Artificial Intelligence and Statistics, pp. 1159–1167 (2017)Google Scholar
  24. 24.
    Lin, T., Fan, C., Wang, M., Jordan, M.I.: Improved oracle complexity for stochastic compositional variance reduced gradient. arXiv preprint arXiv:1806.00458 (2018)
  25. 25.
    Liu, L., Liu, J., Tao, D.: Variance reduced methods for non-convex composition optimization. arXiv preprint arXiv:1711.04416 (2017)
  26. 26.
    Mandt, S., Hoffman, M.D., Blei, D.M.: Stochastic gradient descent as approximate Bayesian inference. J. Mach. Learn. Res. 18(1), 4873–4907 (2017)MathSciNetzbMATHGoogle Scholar
  27. 27.
    Needell, D., Ward, R., Srebro, N.: Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. In: Advances in Neural Information Processing Systems, pp. 1017–1025 (2014)Google Scholar
  28. 28.
    Ravikumar, P., Lafferty, J., Liu, H., Wasserman, L.: Sparse additive models. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 71(5), 1009–1030 (2009)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Robbins, H., Monro, S.: A stochastic approximation method. In: Lai, T.L., Siegmund, D. (eds.) Herbert Robbins Selected Papers, pp. 102–109. Springer, New York (1985)CrossRefGoogle Scholar
  30. 30.
    Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Shamir, O.: Convergence of stochastic gradient descent for PCA. In: International Conference on Machine Learning, pp. 257–265 (2016)Google Scholar
  32. 32.
    Shapiro, A., Dentcheva, D., Ruszczyński, A.: Lectures on Stochastic Programming: Modeling and Theory. SIAM, Philadelphia (2009)CrossRefGoogle Scholar
  33. 33.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
  34. 34.
    Tan, C., Ma, S., Dai, Y.H., Qian, Y.: Barzilai-Borwein step size for stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 685–693 (2016)Google Scholar
  35. 35.
    Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Society. Ser. B (Methodol.) 267–288 (1996)Google Scholar
  36. 36.
    Wang, L., Yang, Y., Min, R., Chakradhar, S.: Accelerating deep neural network training with inconsistent stochastic gradient descent. Neural Netw. 93, 219–229 (2017)CrossRefGoogle Scholar
  37. 37.
    Wang, M., Fang, E.X., Liu, H.: Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Math. Program. 161(1–2), 419–449 (2017)MathSciNetCrossRefGoogle Scholar
  38. 38.
    Wang, M., Liu, J., Fang, E.: Accelerating stochastic composition optimization. In: Advances in Neural Information Processing Systems, pp. 1714–1722 (2016)Google Scholar
  39. 39.
    Yu, Y., Huang, L.: Fast stochastic variance reduced ADMM for stochastic composition optimization. arXiv preprint arXiv:1705.04138 (2017)
  40. 40.
    Yuan, M., Lin, Y.: Model selection and estimation in the Gaussian graphical model. Biometrika 94(1), 19–35 (2007)MathSciNetCrossRefGoogle Scholar
  41. 41.
    Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the 21st International Conference on Machine Learning, p. 116. ACM (2004)Google Scholar
  42. 42.
    Zhao, S.Y., Li, W.J.: Fast asynchronous parallel stochastic gradient descent: a lock-free approach with convergence guarantee. In: AAAI, pp. 2379–2385 (2016)Google Scholar
  43. 43.
    Zinkevich, M., Weimer, M., Li, L., Smola, A.J.: Parallelized stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 2595–2603 (2010)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Tsung-Yu Hsieh
    • 1
    • 2
    Email author
  • Yasser EL-Manzalawy
    • 1
    • 3
  • Yiwei Sun
    • 1
    • 2
  • Vasant Honavar
    • 1
    • 2
    • 3
  1. 1.Artificial Intelligence Research LaboratoryThe Pennsylvania State UniversityUniversity ParkUSA
  2. 2.Department of Computer Science and EngineeringThe Pennsylvania State UniversityUniversity ParkUSA
  3. 3.College of Information Science and TechnologyThe Pennsylvania State UniversityUniversity ParkUSA

Personalised recommendations