Deep relaxation: partial differential equations for optimizing deep neural networks

  • Pratik Chaudhari
  • Adam Oberman
  • Stanley Osher
  • Stefano Soatto
  • Guillaume Carlier


Entropy-SGD is a first-order optimization method which has been used successfully to train deep neural networks. This algorithm, which was motivated by statistical physics, is now interpreted as gradient descent on a modified loss function. The modified, or relaxed, loss function is the solution of a viscous Hamilton–Jacobi partial differential equation (PDE). Experimental results on modern, high-dimensional neural networks demonstrate that the algorithm converges faster than the benchmark stochastic gradient descent (SGD). Well-established PDE regularity results allow us to analyze the geometry of the relaxed energy landscape, confirming empirical evidence. Stochastic homogenization theory allows us to better understand the convergence of the algorithm. A stochastic control interpretation is used to prove that a modified algorithm converges faster than SGD in expectation.


Deep learning Partial differential equations Stochastic gradient descent Neural networks Optimal control Proximal 



AO is supported by a grant from the Simons Foundation (395980); PC and SS by ONR N000141712072, AFOSR FA95501510229 and ARO W911NF151056466731CS; SO by ONR N000141410683, N000141210838, N000141712162 and DOE DE-SC00183838. AO would like to thank the hospitality of the UCLA mathematics department where this work was completed.


  1. 1.
    Achille, A., Soatto, S.: Information dropout: learning optimal representations through noise (2016). arXiv:1611.01353
  2. 2.
    Bakry, D., Émery, M.: Diffusions hypercontractives. In: Séminaire de Probabilités XIX 1983/84, pp. 177–206. Springer (1985)Google Scholar
  3. 3.
    Baldassi, C., Borgs, C., Chayes, J., Ingrosso, A., Lucibello, C., Saglietti, L., Zecchina, R.: Unreasonable effectiveness of learning neural networks: from accessible states and robust ensembles to basic algorithmic schemes. PNAS 113(48), E7655–E7662 (2016a)CrossRefGoogle Scholar
  4. 4.
    Baldassi, C., Ingrosso, A., Lucibello, C., Saglietti, L., Zecchina, R.: Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses. Phys. Rev. Lett. 115(12), 128101 (2015)CrossRefGoogle Scholar
  5. 5.
    Baldassi, C., Ingrosso, A., Lucibello, C., Saglietti, L., Zecchina, R.: Local entropy as a measure for sampling solutions in constraint satisfaction problems. J. Stat. Mech. Theory Exp. 2016(2), 023301 (2016)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Baldi, P., Hornik, K.: Neural networks and principal component analysis: learning from examples without local minima. Neural Netw. 2, 53–58 (1989)CrossRefGoogle Scholar
  7. 7.
    Bertsekas, D.P., Nedi, A., Ozdaglar, A.E.: Convex analysis and optimization (2003)Google Scholar
  8. 8.
    Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning (2016). arXiv:1606.04838
  9. 9.
    Bray, A.J., Dean, D.S.: Statistics of critical points of Gaussian fields on large-dimensional spaces. Phys. Rev. Lett. 98(15), 150201 (2007)CrossRefGoogle Scholar
  10. 10.
    Cannarsa, P., Sinestrari, C.: Semiconcave Functions, Hamilton–Jacobi Equations, and Optimal Control, vol. 58. Springer, Berlin (2004)zbMATHGoogle Scholar
  11. 11.
    Carrillo, J.A., McCann, R.J., Villani, C.: Contractions in the 2-wasserstein length space and thermalization of granular media. Arch. Ration. Mech. Anal. 179(2), 217–263 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Chaudhari, P., Baldassi, C., Zecchina, R., Soatto, S., Talwalkar, A., Oberman, A.: Parle: parallelizing stochastic gradient descent (2017). arXiv:1707.00424
  13. 13.
    Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., Zecchina, R.: Entropy-SGD: biasing gradient descent into wide valleys (2016). arXiv:1611.01838
  14. 14.
    Chaudhari, P., Soatto, S.: On the energy landscape of deep networks (2015). arXiv:1511.06485
  15. 15.
    Chaudhari, P., Soatto, S.: Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks (2017) arXiv:1710.11029
  16. 16.
    Chen, X.: Smoothing methods for nonsmooth, nonconvex minimization. Math. Program. 134(1), 71–99 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G., LeCun, Y.: The loss surfaces of multilayer networks. In: AISTATS (2015)Google Scholar
  18. 18.
    Coates, A., Lee, H., Ng, A.Y.: An analysis of single-layer networks in unsupervised feature learning. Ann Arbor 1001(48109), 2 (2010)Google Scholar
  19. 19.
    Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: NIPS (2014)Google Scholar
  20. 20.
    Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In: NIPS (2014)Google Scholar
  21. 21.
    Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. JMLR 12, 2121–2159 (2011)MathSciNetzbMATHGoogle Scholar
  22. 22.
    E, W.: Principles of Multiscale Modeling. Cambridge University Press, Cambridge (2011)Google Scholar
  23. 23.
    Evans, L.C.: Partial Differential Equations, volume 19 of Graduate Studies in Mathematics. American Mathematical Society (1998)Google Scholar
  24. 24.
    Fleming, W.H., Rishel, R.W.: Deterministic and sTochastic Optimal Control, vol. 1. Springer, Berlin (2012)zbMATHGoogle Scholar
  25. 25.
    Fleming, W.H., Soner, H.M.: Controlled Markov Processes and Viscosity Solutions, vol. 25. Springer, Berlin (2006)zbMATHGoogle Scholar
  26. 26.
    Fyodorov, Y., Williams, I.: Replica symmetry breaking condition exposed by random matrix calculation of landscape complexity. J. Stat. Phys. 129(5–6), 1081–1116 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  27. 27.
    Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch SGD: training Imagenet in 1 hour (2017). arXiv:1706.02677
  29. 29.
    Gulcehre, C., Moczulski, M., Denil, M., Bengio, Y.: Noisy activation functions. In: ICML(2016)Google Scholar
  30. 30.
    Haeffele, B., Vidal, R.: Global optimality in tensor factorization, deep learning, and beyond (2015). arXiv:1506.07540
  31. 31.
    He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks (2016). arXiv:1603.05027
  32. 32.
    Huang, M., Malhamé, R.P., Caines, P.E., et al.: Large population stochastic dynamic games: closed-loop McKean–Vlasov systems and the Nash certainty equivalence principle. Commun. Inf. Syst. 6(3), 221–252 (2006)MathSciNetzbMATHGoogle Scholar
  33. 33.
    Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift (2015). arXiv:1502.03167
  34. 34.
    Jordan, R., Kinderlehrer, D., Otto, F.: The variational formulation of the Fokker–Planck equation. SIAM J. Math. Anal. 29(1), 1–17 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
  35. 35.
    Kingma, D., Ba, J.: Adam: A method for stochastic optimization (2014). arXiv:1412.6980
  36. 36.
    Kingma, D.P., Salimans, T., Welling, M.: Variational dropout and the local reparameterization trick. In: NIPS (2015)Google Scholar
  37. 37.
    Krizhevsky, A.: Learning multiple layers of features from tiny images. Master’s Thesis, Computer Science, University of Toronto (2009)Google Scholar
  38. 38.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  39. 39.
    Lasry, J.-M., Lions, P.-L.: Mean field games. Jpn. J. Math. 2(1), 229–260 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  40. 40.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRefGoogle Scholar
  41. 41.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  42. 42.
    Li, H., Xu, Z., Taylor, G., Goldstein, T.: Visualizing the loss landscape of neural nets (2017a). arXiv:1712.09913
  43. 43.
    Li, Q., Tai, C., et al.: Stochastic modified equations and adaptive stochastic gradient algorithms (2017b). arXiv:1511.06251
  44. 44.
    Marshall, A.W., Olkin, I., Arnold, B.C.: Inequalities: Theory of Majorization and Its Applications, vol. 143. Springer, Berlin (1979)zbMATHGoogle Scholar
  45. 45.
    Mézard, M., Parisi, G., Virasoro, M.: Spin Glass Theory and Beyond: An Introduction to the Replica Method and Its Applications, Vol. 9. World Scientific Publishing Company (1987)Google Scholar
  46. 46.
    Mobahi, H.: Training recurrent neural networks by diffusion (2016). arXiv:1601.04114
  47. 47.
    Moreau, J.-J.: Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. Fr. 93, 273–299 (1965)CrossRefzbMATHGoogle Scholar
  48. 48.
    Nesterov, Y.: A method of solving a convex programming problem with convergence rate o (1/k2). Sov. Math. Dokl. 27, 372–376 (1983)zbMATHGoogle Scholar
  49. 49.
    Oberman, A.M.: Convergent difference schemes for degenerate elliptic and parabolic equations: Hamilton–Jacobi equations and free boundary problems. SIAM J. Numer. Anal. 44(2), 879–895 (2006). (electronic)MathSciNetCrossRefzbMATHGoogle Scholar
  50. 50.
    Pavliotis, G.A.: Stochastic Processes and Applications. Springer, Berlin (2014)CrossRefzbMATHGoogle Scholar
  51. 51.
    Pavliotis, G.A., Stuart, A.: Multiscale Methods: Averaging and Homogenization. Springer, Berlin (2008)zbMATHGoogle Scholar
  52. 52.
    Risken, H.: Fokker–Planck equation. In: The Fokker–Planck Equation, pp. 63–95. Springer (1984)Google Scholar
  53. 53.
    Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)MathSciNetCrossRefzbMATHGoogle Scholar
  54. 54.
    Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J Control Optim. 14(5), 877–898 (1976)MathSciNetCrossRefzbMATHGoogle Scholar
  55. 55.
    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Cognit. Model. 5(3), 1 (1988)zbMATHGoogle Scholar
  56. 56.
    Sagun, L., Bottou, L., LeCun, Y.: Singularity of the hessian in deep learning (2016). arXiv:1611.07476
  57. 57.
    Santambrogio, F.: Optimal Transport for Applied Mathematicians. Birkäuser, New York (2015)CrossRefzbMATHGoogle Scholar
  58. 58.
    Saxe, A., McClelland, J., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In: ICLR (2014)Google Scholar
  59. 59.
    Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  60. 60.
    Schur, I.: Uber eine Klasse von Mittelbildungen mit Anwendungen auf die Determinantentheorie. Sitzungsberichte der Berliner Mathematischen Gesellschaft 22, 9–20 (1923)zbMATHGoogle Scholar
  61. 61.
    Soudry, D., Carmon, Y.: No bad local minima: Data independent training error guarantees for multilayer neural networks (2016). arXiv:1605.08361
  62. 62.
    Springenberg, J., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net (2014). arXiv:1412.6806
  63. 63.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. JMLR 15(1), 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  64. 64.
    Stoltz, G., Rousset, M., et al.: Free Energy Computations: A Mathematical Perspective. World Scientific, Singapore (2010)zbMATHGoogle Scholar
  65. 65.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR (2015)Google Scholar
  66. 66.
    Zhang, S., Choromanska, A., and LeCun, Y.: Deep learning with elastic averaging SGD. In: NIPS (2015)Google Scholar

Copyright information

© SpringerNature 2018

Authors and Affiliations

  1. 1.Computer Science DepartmentUniversity of CaliforniaLos AngelesUSA
  2. 2.Department of Mathematics and StatisticsMcGill UniversityMontrealCanada
  3. 3.Department of Mathematics and Institute for Pure and Applied MathematicsUniversity of CaliforniaLos AngelesUSA
  4. 4.CEREMADE, Université Paris IX DauphineParisFrance

Personalised recommendations