Abstract
The paper observes a similarity between the stochastic optimal control of discrete dynamical systems and the learning multilayer neural networks. It focuses on contemporary deep networks with nonconvex nonsmooth loss and activation functions. The machine learning problems are treated as nonconvex nonsmooth stochastic optimization problems. As a model of nonsmooth nonconvex dependences, the so-called generalized-differentiable functions are used. The backpropagation method for calculating stochastic generalized gradients of the learning quality functional for such systems is substantiated basing on Hamilton–Pontryagin formalism. Stochastic generalized gradient learning algorithms are extended for training nonconvex nonsmooth neural networks. The performance of a stochastic generalized gradient algorithm is illustrated by the linear multiclass classification problem.
Similar content being viewed by others
References
D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, Vol. 323, 533–536 (1986). https://doi.org/10.1038/323533a0.
Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller, “Efficient BackProp,” in: G. Montavon, G. B. Orr, and K.-R. Muller (eds.). “NN: Tricks of the trade,” LNCS, Vol. 7700, Springer-Verlag, Berlin–Heidelberg (2012), pp. 9–48.
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” 436, Nature, Vol. 521, 436–444 (2015). https://doi.org/10.1038/nature14539.
P. Flach, Machine Learning. The Art and Science of Algorithms that Make Sense of Data, Cambridge University Press (2012).
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press (2016).
S. Nikolenko, A. Kadurin, and E. Arhangelskaia, Deep Learning [in Russian], Piter, St. Petersburg (2018).
K. V. Vorontsov, Machine Learning: A Year Course, URL: http://www.machinelearning.ru/wiki/ (Last accessed: 07.11.2019).
J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, Vol. 61, 85–117 (2015). https://doi.org/10.1016/j.neunet.2014.09.003.
L. Bottou, F. E. Curtisy, and J. Nocedalz, “Optimization methods for large-scale machine learning,” SIAM Rev., Vol. 60, No. 2, 223–311 (2018). https://doi.org/10.1137/16M1080173.
C. Zhang, Q. Liao, A. Rakhlin, B. Miranda, N. Golowich, and T. Poggio, “Theory of deep learning IIb: Optimization properties of SGD,” CBMM Memo, No. 072, Center for Brains, Minds, and Machines, McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA (2018). arXiv:1801.02254v1[cs.LG] 7 Jan 2018.
D. Newton, F. Yousefian, and R. Pasupathy, “Stochastic gradient descent: Recent trends,” INFORMS TutORials in Operations Research (2018), pp. 193–220. https://doi.org/10.1287/educ.2018.0191.
H. Robbins and S. Monro, “A stochastic approximation method,” The Annals of Mathematical Statistics, Vol. 22, No. 3, 400–407 (1951).
Y. M. Ermoliev, Methods of Stochastic Programming [in Russian], Nauka, Moscow (1976).
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM J. on Optimization, Vol. 19, No. 4, 1574-1609 (2009).
A. Shapiro, D. Dentcheva, and A. Ruszczyński, Lectures on Stochastic Programming: Modeling and Theory, SIAM, Philadelphia (2009).
D. Davis, D. Drusvyatskiy, S. Kakade, and J. D. Lee, “Stochastic subgradient method converges on tame functions,” Found. Comput. Math., Vol. 20, Iss. 1, 119–154 (2020). https://doi.org/10.1007/s10208-018-09409-5.
R. Zhu, D. Niu, and Z. Li, “Asynchronous stochastic proximal methods for nonconvex nonsmooth optimization,” arXiv:1802.08880v3 [cs.LG] 15 Sep 2018.
Z. Li and J. Li, “A simple proximal stochastic gradient method for nonsmooth nonconvex optimization, in: S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems (2018), pp. 5564–5574.
S. Majewski, B. Miasojedow, and E. Moulines, “Analysis of nonsmooth stochastic approximation: The diferential inclusion approach,” arXiv:1805.01916v1 [math.OC], May 4 (2018).
V. Kungurtsev, M. Egan, B. Chatterjee, and D. Alistarh, “Asynchronous stochastic subgradient methods for general nonsmooth nonconvex optimization,” arXiv:1905.11845, May 28 (2019).
D. Davis and D. Drusvyatskiy, “Stochastic model-based minimization of weakly convex functions,” SIAM J. Optim., Vol. 29, Iss.1, 207–239 (2019). https://doi.org/10.1137/18M1178244.
F. H. Clarke, “Optimization and nonsmooth analysis,” in: Classics in Applied Mathematics, Vol. 5, Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1990).
V. I. Norkin, “Generalized-differentiable functions,” Cybernetics, Vol. 16, No. 1, 10–12 (1980). https://doi.org/10.1007/BF01099354.
V. S. Mikhalevich, A. M. Gupal, and V. I. Norkin, Methods of Nonconvex Optimization [in Russian], Nauka, Moscow (1987).
E. A. Nurminskii, “Minimization of nondifferentiable functions in the presence of noise,” Cybernetics, Vol. 10, No. 4, 619–621 (1974). https://doi.org/10.1007/BF01071541.
E. A. Nurminski, Numerical Methods for Solving Stochastic Minimax Problems [in Russian], Naukova Dumka, Kyiv (1979).
Y. M. Ermoliev and V. I. Norkin, “Stochastic generalized gradient method for solving nonconvex nonsmooth stochastic optimization problems,” Cybern. Syst. Analysis, Vol. 34, No. 2, 196–215 (1998). https://doi.org/10.1007/BF02742069.
V. I. Norkin, Stochastic Methods for Solving Nonconvex Stochastic Optimization Problems and their Applications [in Ukrainian], Extended Abstract, PhD Thesis, Kyiv (1998). http://library.nuft.edu.ua/ebook/file/01.05.01%20Norkin%20VI.pdf.
Y. M. Ermoliev and V. I. Norkin, “Solution of nonconvex nonsmooth stochastic optimization problems,” Cybern. Syst. Analysis, Vol. 39, No. 5, 701–715 (2003). https://doi.org/10.1023/B:CASA.0000012091.84864.65.
J. Burke, A. Lewis, and M. Overton, “A robust gradient sampling algorithm for nonsmooth nonconvex optimization,” SIAM J. Opt., Vol. 15, Iss. 3, 751–779 (2005).
S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall (2006).
K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?” in: Proc. IEEE 12th Intern. Conf. on Computer Vision (ICCV), Sept. 29–Oct. 2, 2009, Kyoto, Japan, Kyoto (2009), pp. 2146–2153. https://doi.org/10.1109/iccv.2009.5459469.
V. I. Norkin, “Nonlocal minimization algorithms of nondifferentiable functions,” Cybernetics, Vol. 14, No. 5, 704–707 (1978). https://doi.org/10.1007/BF01069307.
R. Mifflin, “An algorithm for constrained optimization with semi-smooth functions,” Math. Oper. Res., Vol. 2, No. 2, 191–207 (1977).
J. Bolte, A. Daniilidis, and A. Lewis, “Tame functions are semismooth,” Math. Program., Vol. 117, Iss. 1–2, 5–19 (2009). https://doi.org/10.1007/s10107-007-0166-9.
V. I. Norkin, “Stochastic generalized-differentiable functions in the problem of nonconvex nonsmooth stochastic optimization,” Cybernetics, Vol. 22, No. 6, 804–809 (1986). https://doi.org/10.1007/BF01068698.
V. I. Norkin, “Generalized gradients in problems of dynamic optimization, optimal control, and machine learning,” Cybern. Syst. Analysis, Vol. 56, No. 2, 243–258 (2020). https://doi.org/10.1007/s10559-020-00240-x.
L. S. Pontryagin, V. G. Boltyanskii, R. V. Gamkrelidze, and E. F. Mishchenko, The Mathematical Theory of Optimal Processes, Interscience Publishers, New York–London (1962).
V. G. Boltyanskii, Optimal Control of Discrete Systems [in Russian], Nauka, Moscow (1973).
Yu. M. Ermoliev and V. I. Norkin, “Stochastic generalized gradient method with application to insurance risk management,” Interim Report IR-97-021, Int. Inst. for Appl. Syst. Anal., Laxenburg, Austria (1997). URL: http://pure.iiasa.ac.at/id/eprint/5270/1/IR-97-021.pdf.
I. M. Gel’fand and M. L. Tzeitlin, “The principle of the nonlocal search in automatic optimization systems,” Dokl. Akad. Nauk SSSR, Vol. 137(2), 295–298 (1961).
Y. Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k)2,” Soviet Math. Dokl., Vol. 27(2), 372–376 (1983).
S. P. Urjas’ev, “Step control for direct stochastic-programming methods,” Cybernetics, Vol. 16, No. 6, 886–890 (1980). https://doi.org/10.1007/BF01069063.
S. P. Urjas’ev, “Stochastic quasi-gradient algorithms with adaptively controlled parameters,” Working Paper WP-86-32, Int. Inst. for Appl. Syst. Anal., Laxenburg, Austria (1986). URL: http://pure.iiasa.ac.at/id/eprint/2827/1/WP-86-032.pdf.
S. P. Urjas’ev, Adaptive Algorithms of Stochastic Optimization and Game Theory [in Russian], Nauka, Moscow (1990).
A. M. Gupal, Stochastic Methods for Solution of Nonsmooth Extremum Problems [in Russian], Naukova Dumka, Kyiv (1979).
O. N. Granichin and B. T. Polyak, Randomized Algorithms of Optimization and Estimation under Almost Arbitrary Noise [in Russian], Nauka, Moscow (2003).
V. N. Vapnik, Statistical Learning Theory, Wiley & Sons, New York (1998).
M. Shlezinger and V. Glavach, Ten Lectures on the Statistical and Structural Recognition [in Russian], Nakova Dumka, Kyiv (2004).
G.-X. Yuan, C.-H. Ho, and C.-J. Lin, “Recent advances of large-scale linear classification,” Proc. IEEE, Vol. 100, Iss. 9, 2584–2603 (2012). https://doi.org/10.1109/JPROC.2012.2188013.
Yu. P. Laptin, Yu. I. Zhuravlev, and A. P. Vinogradov, “Empirical risk minimization and problems of constructing linear classifiers,” Cybern. Syst. Analysis, Vol. 47, No. 4, 640–648 (2011). https://doi.org/10.1007/s10559-011-9344-0.
Yu. I. Zhuravlev, Yu. P. Laptin, A. P. Vinogradov, N. G. Zhurbenko, O. P. Lykhovyd, and O. A. Berezovskyi, “Linear classifiers and selection of informative features,” Pattern Recognition and Image Analysis, Vol. 27, No. 3, 426–432 (2017). https://doi.org/10.1134/S1054661817030336.
Yu. I. Zhuravlev, Yu. P. Laptin, and A. P. Vinogradov, “A comparison of some approaches to classification problems, and possibilities to construct optimal solutions efficiently,” Pattern Recognition and Image Analysis, Vol. 24, No. 2, 189–195 (2014). https://doi.org/10.1134/S1054661814020175.
Author information
Authors and Affiliations
Corresponding author
Additional information
Translated from Kibernetyka ta Systemnyi Analiz, No. 5, September–October, 2021, pp. 54–71.
The work was partially supported by grant CPEA-LT-2016/10003 funded by the Norwegian Agency for International Cooperation and Quality Enhancement in Higher Education (Diku) and grant 2020.02/0121 of the National Research Foundation of Ukraine.
Rights and permissions
About this article
Cite this article
Norkin, V.I. Stochastic Generalized Gradient Methods for Training Nonconvex Nonsmooth Neural Networks. Cybern Syst Anal 57, 714–729 (2021). https://doi.org/10.1007/s10559-021-00397-z
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10559-021-00397-z