Skip to main content
Log in

Stochastic Generalized Gradient Methods for Training Nonconvex Nonsmooth Neural Networks

  • Published:
Cybernetics and Systems Analysis Aims and scope

Abstract

The paper observes a similarity between the stochastic optimal control of discrete dynamical systems and the learning multilayer neural networks. It focuses on contemporary deep networks with nonconvex nonsmooth loss and activation functions. The machine learning problems are treated as nonconvex nonsmooth stochastic optimization problems. As a model of nonsmooth nonconvex dependences, the so-called generalized-differentiable functions are used. The backpropagation method for calculating stochastic generalized gradients of the learning quality functional for such systems is substantiated basing on Hamilton–Pontryagin formalism. Stochastic generalized gradient learning algorithms are extended for training nonconvex nonsmooth neural networks. The performance of a stochastic generalized gradient algorithm is illustrated by the linear multiclass classification problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, Vol. 323, 533–536 (1986). https://doi.org/10.1038/323533a0.

    Article  MATH  Google Scholar 

  2. Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller, “Efficient BackProp,” in: G. Montavon, G. B. Orr, and K.-R. Muller (eds.). “NN: Tricks of the trade,” LNCS, Vol. 7700, Springer-Verlag, Berlin–Heidelberg (2012), pp. 9–48.

  3. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” 436, Nature, Vol. 521, 436–444 (2015). https://doi.org/10.1038/nature14539.

  4. P. Flach, Machine Learning. The Art and Science of Algorithms that Make Sense of Data, Cambridge University Press (2012).

  5. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press (2016).

  6. S. Nikolenko, A. Kadurin, and E. Arhangelskaia, Deep Learning [in Russian], Piter, St. Petersburg (2018).

    Google Scholar 

  7. K. V. Vorontsov, Machine Learning: A Year Course, URL: http://www.machinelearning.ru/wiki/ (Last accessed: 07.11.2019).

  8. J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, Vol. 61, 85–117 (2015). https://doi.org/10.1016/j.neunet.2014.09.003.

    Article  Google Scholar 

  9. L. Bottou, F. E. Curtisy, and J. Nocedalz, “Optimization methods for large-scale machine learning,” SIAM Rev., Vol. 60, No. 2, 223–311 (2018). https://doi.org/10.1137/16M1080173.

    Article  MathSciNet  Google Scholar 

  10. C. Zhang, Q. Liao, A. Rakhlin, B. Miranda, N. Golowich, and T. Poggio, “Theory of deep learning IIb: Optimization properties of SGD,” CBMM Memo, No. 072, Center for Brains, Minds, and Machines, McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA (2018). arXiv:1801.02254v1[cs.LG] 7 Jan 2018.

  11. D. Newton, F. Yousefian, and R. Pasupathy, “Stochastic gradient descent: Recent trends,” INFORMS TutORials in Operations Research (2018), pp. 193–220. https://doi.org/10.1287/educ.2018.0191.

  12. H. Robbins and S. Monro, “A stochastic approximation method,” The Annals of Mathematical Statistics, Vol. 22, No. 3, 400–407 (1951).

    Article  MathSciNet  Google Scholar 

  13. Y. M. Ermoliev, Methods of Stochastic Programming [in Russian], Nauka, Moscow (1976).

    Google Scholar 

  14. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM J. on Optimization, Vol. 19, No. 4, 1574-1609 (2009).

    Article  MathSciNet  Google Scholar 

  15. A. Shapiro, D. Dentcheva, and A. Ruszczyński, Lectures on Stochastic Programming: Modeling and Theory, SIAM, Philadelphia (2009).

    Book  Google Scholar 

  16. D. Davis, D. Drusvyatskiy, S. Kakade, and J. D. Lee, “Stochastic subgradient method converges on tame functions,” Found. Comput. Math., Vol. 20, Iss. 1, 119–154 (2020). https://doi.org/10.1007/s10208-018-09409-5.

  17. R. Zhu, D. Niu, and Z. Li, “Asynchronous stochastic proximal methods for nonconvex nonsmooth optimization,” arXiv:1802.08880v3 [cs.LG] 15 Sep 2018.

  18. Z. Li and J. Li, “A simple proximal stochastic gradient method for nonsmooth nonconvex optimization, in: S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems (2018), pp. 5564–5574.

  19. S. Majewski, B. Miasojedow, and E. Moulines, “Analysis of nonsmooth stochastic approximation: The diferential inclusion approach,” arXiv:1805.01916v1 [math.OC], May 4 (2018).

  20. V. Kungurtsev, M. Egan, B. Chatterjee, and D. Alistarh, “Asynchronous stochastic subgradient methods for general nonsmooth nonconvex optimization,” arXiv:1905.11845, May 28 (2019).

  21. D. Davis and D. Drusvyatskiy, “Stochastic model-based minimization of weakly convex functions,” SIAM J. Optim., Vol. 29, Iss.1, 207–239 (2019). https://doi.org/10.1137/18M1178244.

  22. F. H. Clarke, “Optimization and nonsmooth analysis,” in: Classics in Applied Mathematics, Vol. 5, Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1990).

  23. V. I. Norkin, “Generalized-differentiable functions,” Cybernetics, Vol. 16, No. 1, 10–12 (1980). https://doi.org/10.1007/BF01099354.

    Article  MathSciNet  Google Scholar 

  24. V. S. Mikhalevich, A. M. Gupal, and V. I. Norkin, Methods of Nonconvex Optimization [in Russian], Nauka, Moscow (1987).

    MATH  Google Scholar 

  25. E. A. Nurminskii, “Minimization of nondifferentiable functions in the presence of noise,” Cybernetics, Vol. 10, No. 4, 619–621 (1974). https://doi.org/10.1007/BF01071541.

    Article  MathSciNet  Google Scholar 

  26. E. A. Nurminski, Numerical Methods for Solving Stochastic Minimax Problems [in Russian], Naukova Dumka, Kyiv (1979).

    Google Scholar 

  27. Y. M. Ermoliev and V. I. Norkin, “Stochastic generalized gradient method for solving nonconvex nonsmooth stochastic optimization problems,” Cybern. Syst. Analysis, Vol. 34, No. 2, 196–215 (1998). https://doi.org/10.1007/BF02742069.

    Article  MATH  Google Scholar 

  28. V. I. Norkin, Stochastic Methods for Solving Nonconvex Stochastic Optimization Problems and their Applications [in Ukrainian], Extended Abstract, PhD Thesis, Kyiv (1998). http://library.nuft.edu.ua/ebook/file/01.05.01%20Norkin%20VI.pdf.

  29. Y. M. Ermoliev and V. I. Norkin, “Solution of nonconvex nonsmooth stochastic optimization problems,” Cybern. Syst. Analysis, Vol. 39, No. 5, 701–715 (2003). https://doi.org/10.1023/B:CASA.0000012091.84864.65.

    Article  MATH  Google Scholar 

  30. J. Burke, A. Lewis, and M. Overton, “A robust gradient sampling algorithm for nonsmooth nonconvex optimization,” SIAM J. Opt., Vol. 15, Iss. 3, 751–779 (2005).

  31. S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall (2006).

  32. K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?” in: Proc. IEEE 12th Intern. Conf. on Computer Vision (ICCV), Sept. 29–Oct. 2, 2009, Kyoto, Japan, Kyoto (2009), pp. 2146–2153. https://doi.org/10.1109/iccv.2009.5459469.

  33. V. I. Norkin, “Nonlocal minimization algorithms of nondifferentiable functions,” Cybernetics, Vol. 14, No. 5, 704–707 (1978). https://doi.org/10.1007/BF01069307.

    Article  MATH  Google Scholar 

  34. R. Mifflin, “An algorithm for constrained optimization with semi-smooth functions,” Math. Oper. Res., Vol. 2, No. 2, 191–207 (1977).

    Article  MathSciNet  Google Scholar 

  35. J. Bolte, A. Daniilidis, and A. Lewis, “Tame functions are semismooth,” Math. Program., Vol. 117, Iss. 1–2, 5–19 (2009). https://doi.org/10.1007/s10107-007-0166-9.

  36. V. I. Norkin, “Stochastic generalized-differentiable functions in the problem of nonconvex nonsmooth stochastic optimization,” Cybernetics, Vol. 22, No. 6, 804–809 (1986). https://doi.org/10.1007/BF01068698.

    Article  MATH  Google Scholar 

  37. V. I. Norkin, “Generalized gradients in problems of dynamic optimization, optimal control, and machine learning,” Cybern. Syst. Analysis, Vol. 56, No. 2, 243–258 (2020). https://doi.org/10.1007/s10559-020-00240-x.

    Article  MATH  Google Scholar 

  38. L. S. Pontryagin, V. G. Boltyanskii, R. V. Gamkrelidze, and E. F. Mishchenko, The Mathematical Theory of Optimal Processes, Interscience Publishers, New York–London (1962).

    MATH  Google Scholar 

  39. V. G. Boltyanskii, Optimal Control of Discrete Systems [in Russian], Nauka, Moscow (1973).

    Google Scholar 

  40. Yu. M. Ermoliev and V. I. Norkin, “Stochastic generalized gradient method with application to insurance risk management,” Interim Report IR-97-021, Int. Inst. for Appl. Syst. Anal., Laxenburg, Austria (1997). URL: http://pure.iiasa.ac.at/id/eprint/5270/1/IR-97-021.pdf.

  41. I. M. Gel’fand and M. L. Tzeitlin, “The principle of the nonlocal search in automatic optimization systems,” Dokl. Akad. Nauk SSSR, Vol. 137(2), 295–298 (1961).

    MathSciNet  Google Scholar 

  42. Y. Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k)2,” Soviet Math. Dokl., Vol. 27(2), 372–376 (1983).

    MATH  Google Scholar 

  43. S. P. Urjas’ev, “Step control for direct stochastic-programming methods,” Cybernetics, Vol. 16, No. 6, 886–890 (1980). https://doi.org/10.1007/BF01069063.

    Article  MathSciNet  Google Scholar 

  44. S. P. Urjas’ev, “Stochastic quasi-gradient algorithms with adaptively controlled parameters,” Working Paper WP-86-32, Int. Inst. for Appl. Syst. Anal., Laxenburg, Austria (1986). URL: http://pure.iiasa.ac.at/id/eprint/2827/1/WP-86-032.pdf.

  45. S. P. Urjas’ev, Adaptive Algorithms of Stochastic Optimization and Game Theory [in Russian], Nauka, Moscow (1990).

  46. A. M. Gupal, Stochastic Methods for Solution of Nonsmooth Extremum Problems [in Russian], Naukova Dumka, Kyiv (1979).

    Google Scholar 

  47. O. N. Granichin and B. T. Polyak, Randomized Algorithms of Optimization and Estimation under Almost Arbitrary Noise [in Russian], Nauka, Moscow (2003).

  48. V. N. Vapnik, Statistical Learning Theory, Wiley & Sons, New York (1998).

    MATH  Google Scholar 

  49. M. Shlezinger and V. Glavach, Ten Lectures on the Statistical and Structural Recognition [in Russian], Nakova Dumka, Kyiv (2004).

  50. G.-X. Yuan, C.-H. Ho, and C.-J. Lin, “Recent advances of large-scale linear classification,” Proc. IEEE, Vol. 100, Iss. 9, 2584–2603 (2012). https://doi.org/10.1109/JPROC.2012.2188013.

  51. Yu. P. Laptin, Yu. I. Zhuravlev, and A. P. Vinogradov, “Empirical risk minimization and problems of constructing linear classifiers,” Cybern. Syst. Analysis, Vol. 47, No. 4, 640–648 (2011). https://doi.org/10.1007/s10559-011-9344-0.

    Article  MathSciNet  MATH  Google Scholar 

  52. Yu. I. Zhuravlev, Yu. P. Laptin, A. P. Vinogradov, N. G. Zhurbenko, O. P. Lykhovyd, and O. A. Berezovskyi, “Linear classifiers and selection of informative features,” Pattern Recognition and Image Analysis, Vol. 27, No. 3, 426–432 (2017). https://doi.org/10.1134/S1054661817030336.

    Article  Google Scholar 

  53. Yu. I. Zhuravlev, Yu. P. Laptin, and A. P. Vinogradov, “A comparison of some approaches to classification problems, and possibilities to construct optimal solutions efficiently,” Pattern Recognition and Image Analysis, Vol. 24, No. 2, 189–195 (2014). https://doi.org/10.1134/S1054661814020175.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to V. I. Norkin.

Additional information

Translated from Kibernetyka ta Systemnyi Analiz, No. 5, September–October, 2021, pp. 54–71.

The work was partially supported by grant CPEA-LT-2016/10003 funded by the Norwegian Agency for International Cooperation and Quality Enhancement in Higher Education (Diku) and grant 2020.02/0121 of the National Research Foundation of Ukraine.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Norkin, V.I. Stochastic Generalized Gradient Methods for Training Nonconvex Nonsmooth Neural Networks. Cybern Syst Anal 57, 714–729 (2021). https://doi.org/10.1007/s10559-021-00397-z

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10559-021-00397-z

Keywords

Navigation