Stochastic Generalized Gradient Methods for Training Nonconvex Nonsmooth Neural Networks

Norkin, V. I.

doi:10.1007/s10559-021-00397-z

Stochastic Generalized Gradient Methods for Training Nonconvex Nonsmooth Neural Networks

Published: 29 September 2021

Volume 57, pages 714–729, (2021)
Cite this article

Cybernetics and Systems Analysis Aims and scope

V. I. Norkin^1,2

56 Accesses
2 Citations
Explore all metrics

Abstract

The paper observes a similarity between the stochastic optimal control of discrete dynamical systems and the learning multilayer neural networks. It focuses on contemporary deep networks with nonconvex nonsmooth loss and activation functions. The machine learning problems are treated as nonconvex nonsmooth stochastic optimization problems. As a model of nonsmooth nonconvex dependences, the so-called generalized-differentiable functions are used. The backpropagation method for calculating stochastic generalized gradients of the learning quality functional for such systems is substantiated basing on Hamilton–Pontryagin formalism. Stochastic generalized gradient learning algorithms are extended for training nonconvex nonsmooth neural networks. The performance of a stochastic generalized gradient algorithm is illustrated by the linear multiclass classification problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generalized Gradients in Dynamic Optimization, Optimal Control, and Machine Learning Problems*

Article 01 March 2020

Deep relaxation: partial differential equations for optimizing deep neural networks

Article 28 June 2018

Treating Artificial Neural Net Training as a Nonsmooth Global Optimization Problem

References

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, Vol. 323, 533–536 (1986). https://doi.org/10.1038/323533a0.
Article MATH Google Scholar
Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller, “Efficient BackProp,” in: G. Montavon, G. B. Orr, and K.-R. Muller (eds.). “NN: Tricks of the trade,” LNCS, Vol. 7700, Springer-Verlag, Berlin–Heidelberg (2012), pp. 9–48.
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” 436, Nature, Vol. 521, 436–444 (2015). https://doi.org/10.1038/nature14539.
P. Flach, Machine Learning. The Art and Science of Algorithms that Make Sense of Data, Cambridge University Press (2012).
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press (2016).
S. Nikolenko, A. Kadurin, and E. Arhangelskaia, Deep Learning [in Russian], Piter, St. Petersburg (2018).
Google Scholar
K. V. Vorontsov, Machine Learning: A Year Course, URL: http://www.machinelearning.ru/wiki/ (Last accessed: 07.11.2019).
J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, Vol. 61, 85–117 (2015). https://doi.org/10.1016/j.neunet.2014.09.003.
Article Google Scholar
L. Bottou, F. E. Curtisy, and J. Nocedalz, “Optimization methods for large-scale machine learning,” SIAM Rev., Vol. 60, No. 2, 223–311 (2018). https://doi.org/10.1137/16M1080173.
Article MathSciNet Google Scholar
C. Zhang, Q. Liao, A. Rakhlin, B. Miranda, N. Golowich, and T. Poggio, “Theory of deep learning IIb: Optimization properties of SGD,” CBMM Memo, No. 072, Center for Brains, Minds, and Machines, McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA (2018). arXiv:1801.02254v1[cs.LG] 7 Jan 2018.
D. Newton, F. Yousefian, and R. Pasupathy, “Stochastic gradient descent: Recent trends,” INFORMS TutORials in Operations Research (2018), pp. 193–220. https://doi.org/10.1287/educ.2018.0191.
H. Robbins and S. Monro, “A stochastic approximation method,” The Annals of Mathematical Statistics, Vol. 22, No. 3, 400–407 (1951).
Article MathSciNet Google Scholar
Y. M. Ermoliev, Methods of Stochastic Programming [in Russian], Nauka, Moscow (1976).
Google Scholar
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM J. on Optimization, Vol. 19, No. 4, 1574-1609 (2009).
Article MathSciNet Google Scholar
A. Shapiro, D. Dentcheva, and A. Ruszczyński, Lectures on Stochastic Programming: Modeling and Theory, SIAM, Philadelphia (2009).
Book Google Scholar
D. Davis, D. Drusvyatskiy, S. Kakade, and J. D. Lee, “Stochastic subgradient method converges on tame functions,” Found. Comput. Math., Vol. 20, Iss. 1, 119–154 (2020). https://doi.org/10.1007/s10208-018-09409-5.
R. Zhu, D. Niu, and Z. Li, “Asynchronous stochastic proximal methods for nonconvex nonsmooth optimization,” arXiv:1802.08880v3 [cs.LG] 15 Sep 2018.
Z. Li and J. Li, “A simple proximal stochastic gradient method for nonsmooth nonconvex optimization, in: S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems (2018), pp. 5564–5574.
S. Majewski, B. Miasojedow, and E. Moulines, “Analysis of nonsmooth stochastic approximation: The diferential inclusion approach,” arXiv:1805.01916v1 [math.OC], May 4 (2018).
V. Kungurtsev, M. Egan, B. Chatterjee, and D. Alistarh, “Asynchronous stochastic subgradient methods for general nonsmooth nonconvex optimization,” arXiv:1905.11845, May 28 (2019).
D. Davis and D. Drusvyatskiy, “Stochastic model-based minimization of weakly convex functions,” SIAM J. Optim., Vol. 29, Iss.1, 207–239 (2019). https://doi.org/10.1137/18M1178244.
F. H. Clarke, “Optimization and nonsmooth analysis,” in: Classics in Applied Mathematics, Vol. 5, Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1990).
V. I. Norkin, “Generalized-differentiable functions,” Cybernetics, Vol. 16, No. 1, 10–12 (1980). https://doi.org/10.1007/BF01099354.
Article MathSciNet Google Scholar
V. S. Mikhalevich, A. M. Gupal, and V. I. Norkin, Methods of Nonconvex Optimization [in Russian], Nauka, Moscow (1987).
MATH Google Scholar
E. A. Nurminskii, “Minimization of nondifferentiable functions in the presence of noise,” Cybernetics, Vol. 10, No. 4, 619–621 (1974). https://doi.org/10.1007/BF01071541.
Article MathSciNet Google Scholar
E. A. Nurminski, Numerical Methods for Solving Stochastic Minimax Problems [in Russian], Naukova Dumka, Kyiv (1979).
Google Scholar
Y. M. Ermoliev and V. I. Norkin, “Stochastic generalized gradient method for solving nonconvex nonsmooth stochastic optimization problems,” Cybern. Syst. Analysis, Vol. 34, No. 2, 196–215 (1998). https://doi.org/10.1007/BF02742069.
Article MATH Google Scholar
V. I. Norkin, Stochastic Methods for Solving Nonconvex Stochastic Optimization Problems and their Applications [in Ukrainian], Extended Abstract, PhD Thesis, Kyiv (1998). http://library.nuft.edu.ua/ebook/file/01.05.01%20Norkin%20VI.pdf.
Y. M. Ermoliev and V. I. Norkin, “Solution of nonconvex nonsmooth stochastic optimization problems,” Cybern. Syst. Analysis, Vol. 39, No. 5, 701–715 (2003). https://doi.org/10.1023/B:CASA.0000012091.84864.65.
Article MATH Google Scholar
J. Burke, A. Lewis, and M. Overton, “A robust gradient sampling algorithm for nonsmooth nonconvex optimization,” SIAM J. Opt., Vol. 15, Iss. 3, 751–779 (2005).
S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall (2006).
K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?” in: Proc. IEEE 12th Intern. Conf. on Computer Vision (ICCV), Sept. 29–Oct. 2, 2009, Kyoto, Japan, Kyoto (2009), pp. 2146–2153. https://doi.org/10.1109/iccv.2009.5459469.
V. I. Norkin, “Nonlocal minimization algorithms of nondifferentiable functions,” Cybernetics, Vol. 14, No. 5, 704–707 (1978). https://doi.org/10.1007/BF01069307.
Article MATH Google Scholar
R. Mifflin, “An algorithm for constrained optimization with semi-smooth functions,” Math. Oper. Res., Vol. 2, No. 2, 191–207 (1977).
Article MathSciNet Google Scholar
J. Bolte, A. Daniilidis, and A. Lewis, “Tame functions are semismooth,” Math. Program., Vol. 117, Iss. 1–2, 5–19 (2009). https://doi.org/10.1007/s10107-007-0166-9.
V. I. Norkin, “Stochastic generalized-differentiable functions in the problem of nonconvex nonsmooth stochastic optimization,” Cybernetics, Vol. 22, No. 6, 804–809 (1986). https://doi.org/10.1007/BF01068698.
Article MATH Google Scholar
V. I. Norkin, “Generalized gradients in problems of dynamic optimization, optimal control, and machine learning,” Cybern. Syst. Analysis, Vol. 56, No. 2, 243–258 (2020). https://doi.org/10.1007/s10559-020-00240-x.
Article MATH Google Scholar
L. S. Pontryagin, V. G. Boltyanskii, R. V. Gamkrelidze, and E. F. Mishchenko, The Mathematical Theory of Optimal Processes, Interscience Publishers, New York–London (1962).
MATH Google Scholar
V. G. Boltyanskii, Optimal Control of Discrete Systems [in Russian], Nauka, Moscow (1973).
Google Scholar
Yu. M. Ermoliev and V. I. Norkin, “Stochastic generalized gradient method with application to insurance risk management,” Interim Report IR-97-021, Int. Inst. for Appl. Syst. Anal., Laxenburg, Austria (1997). URL: http://pure.iiasa.ac.at/id/eprint/5270/1/IR-97-021.pdf.
I. M. Gel’fand and M. L. Tzeitlin, “The principle of the nonlocal search in automatic optimization systems,” Dokl. Akad. Nauk SSSR, Vol. 137(2), 295–298 (1961).
MathSciNet Google Scholar
Y. Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k)²,” Soviet Math. Dokl., Vol. 27(2), 372–376 (1983).
MATH Google Scholar
S. P. Urjas’ev, “Step control for direct stochastic-programming methods,” Cybernetics, Vol. 16, No. 6, 886–890 (1980). https://doi.org/10.1007/BF01069063.
Article MathSciNet Google Scholar
S. P. Urjas’ev, “Stochastic quasi-gradient algorithms with adaptively controlled parameters,” Working Paper WP-86-32, Int. Inst. for Appl. Syst. Anal., Laxenburg, Austria (1986). URL: http://pure.iiasa.ac.at/id/eprint/2827/1/WP-86-032.pdf.
S. P. Urjas’ev, Adaptive Algorithms of Stochastic Optimization and Game Theory [in Russian], Nauka, Moscow (1990).
A. M. Gupal, Stochastic Methods for Solution of Nonsmooth Extremum Problems [in Russian], Naukova Dumka, Kyiv (1979).
Google Scholar
O. N. Granichin and B. T. Polyak, Randomized Algorithms of Optimization and Estimation under Almost Arbitrary Noise [in Russian], Nauka, Moscow (2003).
V. N. Vapnik, Statistical Learning Theory, Wiley & Sons, New York (1998).
MATH Google Scholar
M. Shlezinger and V. Glavach, Ten Lectures on the Statistical and Structural Recognition [in Russian], Nakova Dumka, Kyiv (2004).
G.-X. Yuan, C.-H. Ho, and C.-J. Lin, “Recent advances of large-scale linear classification,” Proc. IEEE, Vol. 100, Iss. 9, 2584–2603 (2012). https://doi.org/10.1109/JPROC.2012.2188013.
Yu. P. Laptin, Yu. I. Zhuravlev, and A. P. Vinogradov, “Empirical risk minimization and problems of constructing linear classifiers,” Cybern. Syst. Analysis, Vol. 47, No. 4, 640–648 (2011). https://doi.org/10.1007/s10559-011-9344-0.
Article MathSciNet MATH Google Scholar
Yu. I. Zhuravlev, Yu. P. Laptin, A. P. Vinogradov, N. G. Zhurbenko, O. P. Lykhovyd, and O. A. Berezovskyi, “Linear classifiers and selection of informative features,” Pattern Recognition and Image Analysis, Vol. 27, No. 3, 426–432 (2017). https://doi.org/10.1134/S1054661817030336.
Article Google Scholar
Yu. I. Zhuravlev, Yu. P. Laptin, and A. P. Vinogradov, “A comparison of some approaches to classification problems, and possibilities to construct optimal solutions efficiently,” Pattern Recognition and Image Analysis, Vol. 24, No. 2, 189–195 (2014). https://doi.org/10.1134/S1054661814020175.
Article Google Scholar

Download references

Author information

Authors and Affiliations

V. M. Glushkov Institute of Cybernetics, National Academy of Sciences of Ukraine, Kyiv, Ukraine
V. I. Norkin
National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv, Ukraine
V. I. Norkin

Authors

V. I. Norkin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to V. I. Norkin.

Additional information

Translated from Kibernetyka ta Systemnyi Analiz, No. 5, September–October, 2021, pp. 54–71.

The work was partially supported by grant CPEA-LT-2016/10003 funded by the Norwegian Agency for International Cooperation and Quality Enhancement in Higher Education (Diku) and grant 2020.02/0121 of the National Research Foundation of Ukraine.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Norkin, V.I. Stochastic Generalized Gradient Methods for Training Nonconvex Nonsmooth Neural Networks. Cybern Syst Anal 57, 714–729 (2021). https://doi.org/10.1007/s10559-021-00397-z

Download citation

Received: 05 April 2021
Published: 29 September 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s10559-021-00397-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stochastic Generalized Gradient Methods for Training Nonconvex Nonsmooth Neural Networks

Abstract

Access this article

Similar content being viewed by others

Generalized Gradients in Dynamic Optimization, Optimal Control, and Machine Learning Problems*

Deep relaxation: partial differential equations for optimizing deep neural networks

Treating Artificial Neural Net Training as a Nonsmooth Global Optimization Problem

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Stochastic Generalized Gradient Methods for Training Nonconvex Nonsmooth Neural Networks

Abstract

Access this article

Similar content being viewed by others

Generalized Gradients in Dynamic Optimization, Optimal Control, and Machine Learning Problems*

Deep relaxation: partial differential equations for optimizing deep neural networks

Treating Artificial Neural Net Training as a Nonsmooth Global Optimization Problem

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation