Abstract
Entropy-SGD is a first-order optimization method which has been used successfully to train deep neural networks. This algorithm, which was motivated by statistical physics, is now interpreted as gradient descent on a modified loss function. The modified, or relaxed, loss function is the solution of a viscous Hamilton–Jacobi partial differential equation (PDE). Experimental results on modern, high-dimensional neural networks demonstrate that the algorithm converges faster than the benchmark stochastic gradient descent (SGD). Well-established PDE regularity results allow us to analyze the geometry of the relaxed energy landscape, confirming empirical evidence. Stochastic homogenization theory allows us to better understand the convergence of the algorithm. A stochastic control interpretation is used to prove that a modified algorithm converges faster than SGD in expectation.
Similar content being viewed by others
Notes
The empirical loss is a sample approximation of the expected loss, \(\mathbb {E}_{x \sim P} f(x)\), which cannot be computed since the data distribution P is unknown. The extent to which the empirical loss (or a regularized version thereof) approximates the expected loss relates to generalization, i.e., the value of the loss function on (“test” or “validation”) data drawn from P but not part of the training set D.
For example, the ImageNet dataset [38] has \(N = 1.25\) million RGB images of size \(224\times 224\) (i.e., \(d \approx 10^5\)) and \(K=1000\) distinct classes. A typical model, e.g., the Inception network [65] used for classification on this dataset has about \(N = 10\) million parameters and is trained by running (7) for \(k \approx 10^5\) updates; this takes roughly 100 hours with 8 graphics processing units (GPUs).
References
Achille, A., Soatto, S.: Information dropout: learning optimal representations through noise (2016). arXiv:1611.01353
Bakry, D., Émery, M.: Diffusions hypercontractives. In: Séminaire de Probabilités XIX 1983/84, pp. 177–206. Springer (1985)
Baldassi, C., Borgs, C., Chayes, J., Ingrosso, A., Lucibello, C., Saglietti, L., Zecchina, R.: Unreasonable effectiveness of learning neural networks: from accessible states and robust ensembles to basic algorithmic schemes. PNAS 113(48), E7655–E7662 (2016a)
Baldassi, C., Ingrosso, A., Lucibello, C., Saglietti, L., Zecchina, R.: Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses. Phys. Rev. Lett. 115(12), 128101 (2015)
Baldassi, C., Ingrosso, A., Lucibello, C., Saglietti, L., Zecchina, R.: Local entropy as a measure for sampling solutions in constraint satisfaction problems. J. Stat. Mech. Theory Exp. 2016(2), 023301 (2016)
Baldi, P., Hornik, K.: Neural networks and principal component analysis: learning from examples without local minima. Neural Netw. 2, 53–58 (1989)
Bertsekas, D.P., Nedi, A., Ozdaglar, A.E.: Convex analysis and optimization (2003)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning (2016). arXiv:1606.04838
Bray, A.J., Dean, D.S.: Statistics of critical points of Gaussian fields on large-dimensional spaces. Phys. Rev. Lett. 98(15), 150201 (2007)
Cannarsa, P., Sinestrari, C.: Semiconcave Functions, Hamilton–Jacobi Equations, and Optimal Control, vol. 58. Springer, Berlin (2004)
Carrillo, J.A., McCann, R.J., Villani, C.: Contractions in the 2-wasserstein length space and thermalization of granular media. Arch. Ration. Mech. Anal. 179(2), 217–263 (2006)
Chaudhari, P., Baldassi, C., Zecchina, R., Soatto, S., Talwalkar, A., Oberman, A.: Parle: parallelizing stochastic gradient descent (2017). arXiv:1707.00424
Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., Zecchina, R.: Entropy-SGD: biasing gradient descent into wide valleys (2016). arXiv:1611.01838
Chaudhari, P., Soatto, S.: On the energy landscape of deep networks (2015). arXiv:1511.06485
Chaudhari, P., Soatto, S.: Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks (2017) arXiv:1710.11029
Chen, X.: Smoothing methods for nonsmooth, nonconvex minimization. Math. Program. 134(1), 71–99 (2012)
Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G., LeCun, Y.: The loss surfaces of multilayer networks. In: AISTATS (2015)
Coates, A., Lee, H., Ng, A.Y.: An analysis of single-layer networks in unsupervised feature learning. Ann Arbor 1001(48109), 2 (2010)
Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: NIPS (2014)
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In: NIPS (2014)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. JMLR 12, 2121–2159 (2011)
E, W.: Principles of Multiscale Modeling. Cambridge University Press, Cambridge (2011)
Evans, L.C.: Partial Differential Equations, volume 19 of Graduate Studies in Mathematics. American Mathematical Society (1998)
Fleming, W.H., Rishel, R.W.: Deterministic and sTochastic Optimal Control, vol. 1. Springer, Berlin (2012)
Fleming, W.H., Soner, H.M.: Controlled Markov Processes and Viscosity Solutions, vol. 25. Springer, Berlin (2006)
Fyodorov, Y., Williams, I.: Replica symmetry breaking condition exposed by random matrix calculation of landscape complexity. J. Stat. Phys. 129(5–6), 1081–1116 (2007)
Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch SGD: training Imagenet in 1 hour (2017). arXiv:1706.02677
Gulcehre, C., Moczulski, M., Denil, M., Bengio, Y.: Noisy activation functions. In: ICML(2016)
Haeffele, B., Vidal, R.: Global optimality in tensor factorization, deep learning, and beyond (2015). arXiv:1506.07540
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks (2016). arXiv:1603.05027
Huang, M., Malhamé, R.P., Caines, P.E., et al.: Large population stochastic dynamic games: closed-loop McKean–Vlasov systems and the Nash certainty equivalence principle. Commun. Inf. Syst. 6(3), 221–252 (2006)
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift (2015). arXiv:1502.03167
Jordan, R., Kinderlehrer, D., Otto, F.: The variational formulation of the Fokker–Planck equation. SIAM J. Math. Anal. 29(1), 1–17 (1998)
Kingma, D., Ba, J.: Adam: A method for stochastic optimization (2014). arXiv:1412.6980
Kingma, D.P., Salimans, T., Welling, M.: Variational dropout and the local reparameterization trick. In: NIPS (2015)
Krizhevsky, A.: Learning multiple layers of features from tiny images. Master’s Thesis, Computer Science, University of Toronto (2009)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Lasry, J.-M., Lions, P.-L.: Mean field games. Jpn. J. Math. 2(1), 229–260 (2007)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Li, H., Xu, Z., Taylor, G., Goldstein, T.: Visualizing the loss landscape of neural nets (2017a). arXiv:1712.09913
Li, Q., Tai, C., et al.: Stochastic modified equations and adaptive stochastic gradient algorithms (2017b). arXiv:1511.06251
Marshall, A.W., Olkin, I., Arnold, B.C.: Inequalities: Theory of Majorization and Its Applications, vol. 143. Springer, Berlin (1979)
Mézard, M., Parisi, G., Virasoro, M.: Spin Glass Theory and Beyond: An Introduction to the Replica Method and Its Applications, Vol. 9. World Scientific Publishing Company (1987)
Mobahi, H.: Training recurrent neural networks by diffusion (2016). arXiv:1601.04114
Moreau, J.-J.: Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. Fr. 93, 273–299 (1965)
Nesterov, Y.: A method of solving a convex programming problem with convergence rate o (1/k2). Sov. Math. Dokl. 27, 372–376 (1983)
Oberman, A.M.: Convergent difference schemes for degenerate elliptic and parabolic equations: Hamilton–Jacobi equations and free boundary problems. SIAM J. Numer. Anal. 44(2), 879–895 (2006). (electronic)
Pavliotis, G.A.: Stochastic Processes and Applications. Springer, Berlin (2014)
Pavliotis, G.A., Stuart, A.: Multiscale Methods: Averaging and Homogenization. Springer, Berlin (2008)
Risken, H.: Fokker–Planck equation. In: The Fokker–Planck Equation, pp. 63–95. Springer (1984)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J Control Optim. 14(5), 877–898 (1976)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Cognit. Model. 5(3), 1 (1988)
Sagun, L., Bottou, L., LeCun, Y.: Singularity of the hessian in deep learning (2016). arXiv:1611.07476
Santambrogio, F.: Optimal Transport for Applied Mathematicians. Birkäuser, New York (2015)
Saxe, A., McClelland, J., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In: ICLR (2014)
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)
Schur, I.: Uber eine Klasse von Mittelbildungen mit Anwendungen auf die Determinantentheorie. Sitzungsberichte der Berliner Mathematischen Gesellschaft 22, 9–20 (1923)
Soudry, D., Carmon, Y.: No bad local minima: Data independent training error guarantees for multilayer neural networks (2016). arXiv:1605.08361
Springenberg, J., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net (2014). arXiv:1412.6806
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. JMLR 15(1), 1929–1958 (2014)
Stoltz, G., Rousset, M., et al.: Free Energy Computations: A Mathematical Perspective. World Scientific, Singapore (2010)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR (2015)
Zhang, S., Choromanska, A., and LeCun, Y.: Deep learning with elastic averaging SGD. In: NIPS (2015)
Acknowlegements
AO is supported by a grant from the Simons Foundation (395980); PC and SS by ONR N000141712072, AFOSR FA95501510229 and ARO W911NF151056466731CS; SO by ONR N000141410683, N000141210838, N000141712162 and DOE DE-SC00183838. AO would like to thank the hospitality of the UCLA mathematics department where this work was completed.
Author information
Authors and Affiliations
Corresponding author
Additional information
Pratik Chaudhari and Adam Oberman are joint first authors.
Rights and permissions
About this article
Cite this article
Chaudhari, P., Oberman, A., Osher, S. et al. Deep relaxation: partial differential equations for optimizing deep neural networks. Res Math Sci 5, 30 (2018). https://doi.org/10.1007/s40687-018-0148-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s40687-018-0148-y