Abstract
Neural networks have been achieving high generalization performance on many tasks despite being highly over-parameterized. Since classical statistical learning theory struggles to explain this behaviour, much effort has recently been focused on uncovering the mechanisms behind it, in the hope of developing a more adequate theoretical framework and having a better control over the trained models. In this work, we adopt an alternative perspective, viewing the neural network as a dynamical system displacing input particles over time. We conduct a series of experiments and, by analyzing the network’s behaviour through its displacements, we show the presence of a low kinetic energy bias in the transport map of the network, and link this bias with generalization performance. From this observation, we reformulate the learning problem as follows: find neural networks that solve the task while transporting the data as efficiently as possible. This offers a novel formulation of the learning problem which allows us to provide regularity results for the solution network, based on Optimal Transport theory. From a practical viewpoint, this allows us to propose a new learning algorithm, which automatically adapts to the complexity of the task, and leads to networks with a high generalization ability even in low data regimes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
\(T_\sharp \alpha \) is the push-forward measure: \(T_\sharp \alpha (B) = \alpha (T^{-1}(B))\) for any measurable set B.
- 2.
By this, we mean that \(\Vert T^{\theta ^\star }-T^\star \Vert _\infty \le \epsilon \) where \(T^\star \) is the OT map.
References
Belkin, M., Hsu, D., Ma, S., Mandal, S.: Reconciling modern machine-learning practice and the classical bias–variance trade-off. PNAS 116, 15849–15854 (2019)
Belkin, M., Ma, S., Mandal, S.: To understand deep learning we need to understand kernel learning. In: 35th International Conference on Machine Learning (2018)
Benamou, J., Brenier, Y.: A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numerische Mathematik 84, 375–393 (2000)
Bolley, F.: Separability and completeness for the Wasserstein distance. In: Donati-Martin, C., Émery, M., Rouault, A., Stricker, C. (eds.) Séminaire de Probabilités XLI. LNM, vol. 1934, pp. 371–377. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-77913-1_17
de Bézennac, E., Ayed, I., Gallinari, P.: Optimal unsupervised domain translation (2019)
Chang, B., et al.: Reversible architectures for arbitrarily deep residual neural networks. In: AAAI Conference on Artificial Intelligence (2018)
Chen, R., Rubanova, Y., Bettencourt, J., Duvenaud, D.K.: Neural ordinary differential equations. In: Advances in Neural Information Processing Systems (2018)
De Palma, G., Kiani, B., Lloyd, S.: Random deep neural networks are biased towards simple functions. In: Advances in Neural Information Processing Systems (2019)
Feynman, R.P.: The principle of least action in quantum mechanics. In: Feynman’s Thesis - A New Approach to Quantum Theory. World Scientific Publishing (2005)
Garcia-Morales, V., Pellicer, J., Manzanares, J.: Thermodynamics based on the principle of least abbreviated action. Ann. Phys. 323, 1844–1858 (2008)
Gray, C.G.: Principle of least action. Scholarpedia (2009)
Haber, E., Lensink, K., Treister, E., Ruthotto, L.: IMEXnet a forward stable deep neural network. In: 36th International Conference on Machine Learning (2019)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. SSS. Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7
Hauser, M.: On residual networks learning a perturbation from identity (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: convergence and generalization in neural networks. In: Advances in Neural Information Processing Systems (2018)
Jastrzebski, S., Arpit, D., Ballas, N., Verma, V., Che, T., Bengio, Y.: Residual connections encourage iterative inference. In: ICLR (2018)
Li, Q., Chen, L., Tai, C., Weinan, E.: Maximum principle based algorithms for deep learning. J. Mach. Learn. Res. 18, 1–29 (2018)
Lu, Y., Zhong, A., Li, Q., Dong, B.: Beyond finite layer neural networks: bridging deep architectures and numerical differential equations. In: 35th International Conference on Machine Learning (2018)
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt. In: ICLR (2020)
Novak, R., Bahri, Y., Abolafia, D.A., Pennington, J., Sohl-Dickstein, J.: Sensitivity and generalization in neural networks: an empirical study. In: ICLR (2018)
Peyre, G., Cuturi, M.: Computational Optimal Transport. Now Publishers (2019)
Rahaman, N., et al.: On the spectral bias of neural networks. In: 36th International Conference on Machine Learning (2019)
Ruthotto, L., Haber, E.: Deep neural networks motivated by partial differential equations. J. Math. Imaging Vis. 62(3), 352–364 (2019). https://doi.org/10.1007/s10851-019-00903-1
Sandler, M., Baccash, J., Zhmoginov, A., Howard, A.: Non-discriminative data or weak model? on the relative importance of data and model resolution. In: International Conference on Computer Vision Workshop (ICCVW) (2019)
Santambrogio, F.: Optimal transport for Applied Mathematicians. Birkhäuser (2015)
Saxe, A.M., Mcclelland, J.L., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural network. In: ICLR (2014)
Sonoda, S., Murata, N.: Transport analysis of infinitely deep neural network. J. Mach. Learn. Res. 20, 31–81 (2019)
Weinan, E.: A proposal on machine learning via dynamical systems. Commun. Math. Stat. 5(1), 1–11 (2017). https://doi.org/10.1007/s40304-017-0103-z
Xie, S., et al.: Aggregated residual transformations for deep neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Yan, H., Du, J., Tan, V., Feng, J.: On robustness of neural ordinary differential equations. In: ICLR (2020)
Yoshida, Y., Miyato, T.: Spectral norm regularization for improving the generalizability of deep learning (2017)
Zagoruyko, S., Komodakis, N.: Wide residual networks. In: Proceedings of the British Machine Vision Conference (BMVC). BMVA Press (2016)
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: ICLR (2017)
Zhang, J., et al.: Towards robust resnet: a small step but a giant leap. In: Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI) (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Karkar, S., Ayed, I., Bézenac, E.d., Gallinari, P. (2021). A Principle of Least Action for the Training of Neural Networks. In: Hutter, F., Kersting, K., Lijffijt, J., Valera, I. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12458. Springer, Cham. https://doi.org/10.1007/978-3-030-67661-2_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-67661-2_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67660-5
Online ISBN: 978-3-030-67661-2
eBook Packages: Computer ScienceComputer Science (R0)