Abstract
In this paper, we introduce a theoretical framework for semi-discrete optimization using ideas from optimal transport. Our primary motivation is in the field of deep learning, and specifically in the task of neural architecture search. With this aim in mind, we discuss the geometric and theoretical motivation for new techniques for neural architecture search [in the companion work (García-Trillos et al. in Traditional and accelerated gradient descent for neural architecture search, 2021); we show that algorithms inspired by our framework are competitive with contemporaneous methods]. We introduce a Riemannian like metric on the space of probability measures over a semi-discrete space \({\mathbb {R}}^d \times \mathcal {G}\) where \(\mathcal {G}\) is a finite weighted graph. With such Riemannian structure in hand, we derive formal expressions for the gradient flow of a relative entropy functional, as well as second-order dynamics for the optimization of said energy. Then, with the aim of providing a rigorous motivation for the gradient flow equations derived formally we also consider an iterative procedure known as minimizing movement scheme (i.e., Implicit Euler scheme, or JKO scheme) and apply it to the relative entropy with respect to a suitable cost function. For some specific choices of metric and cost, we rigorously show that the minimizing movement scheme of the relative entropy functional converges to the gradient flow process provided by the formal Riemannian structure. This flow coincides with a system of reaction–diffusion equations on \({\mathbb {R}}^d\).
Similar content being viewed by others
References
Ambrosio, L., Gigli, N.: A User’s Guide to Optimal Transport, pp. 1–155. Springer, Berlin (2013)
Ambrosio, L., Gigli, N., Savaré, G.: Gradient Flows in Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics ETH Zürich. Birkhäuser, Basel (2005)
Benamou, J.-D., Brenier, Y.: A computational fluid mechanics solution to the Monge–Kantorovich mass transfer problem. Numer. Math. 84(3), 375–393 (2000)
Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24, pp. 2546–2554. Curran Associates Inc., Red Hook (2011)
Chow, S.-N., Huang, W., Li, Y., Zhou, H.: Fokker-Planck equations for a free energy functional or Markov process on a graph. Arch. Ration. Mech. Anal. 203(3), 969–1008 (2012)
Chung, F.: Spectral Graph Theory. American Mathematical Society, Providence (1996)
do Carmo, M.P.: Riemannian Geometry. Mathematics: Theory & Applications. Birkhäuser Boston, Inc., Boston (1992) (Translated from the second Portuguese edition by Francis Flaherty)
Elsken, T., Metzen, J.-H., Hutter, F.: Simple and efficient architecture search for convolutional neural networks (2017). arXiv:1711.04528
Erbar, M., Fathi, M., Laschos, V., Schlichting, A.: Gradient flow structure for Mckean–Vlasov equations on discrete spaces (2016)
Erbar, M., Maas, J.: Ricci curvature of finite Markov chains via convexity of the entropy. Arch. Ration. Mech. Anal. 206(3), 997–1038 (2012)
Esposito, A., Patacchini, F.S., Schlichting, A., Slepcev, D.: Nonlocal-interaction equation on graphs: gradient flow structure and continuum limit (2019). arXiv:abs/1912.09834
Figalli, A., Gigli, N.: A new transportation distance between non-negative measures, with applications to gradients flows with dirichlet boundary conditions. J. Math. Pures Appl. 94(2), 107–130 (2010)
Garbuno-Inigo, A., Hoffmann, F., Li, W., Stuart, A.M.: Interacting Langevin diffusions: gradient structure and ensemble Kalman sampler (2019). arXiv:1903.08866
García Trillos, N.: Gromov-Hausdorff limit of Wasserstein spaces on point clouds. Calc. Var. 59, 73 (2020). https://doi.org/10.1007/s00526-020-1729-3
Gigli, N., Maas, J.: Gromov–Hausdorff convergence of discrete transportation metrics. SIAM J. Math. Anal. 45(2), 879–899 (2013)
Gladbach, P., Kopfer, E., Maas, J.: Scaling limits of discrete optimal transport. SIAM J. Math. Anal. 52(3), 2759–2802 (2020)
Gladbach, P., Kopfer, E., Maas, J., Portinale, L.: Homogenisation of one-dimensional discrete optimal transport. J. Math. Pures Appl. 139, 204–234 (2020)
Jordan, R., Kinderlehrer, D., Otto, F.: The variational formulation of the Fokker–Planck equation. SIAM J. Math. Anal. 29(1), 1–17 (1998)
Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L.-J., Fei-Fei, L., Yuille, A., Huang, J., Murphy, K.: Progressive neural architecture search. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision—ECCV 2018, pp. 19–35. Springer, Cham (2018)
Maas, J.: Gradient flows of the entropy for finite Markov chains. J. Funct. Anal. 261(8), 2250–2292 (2011)
Mielke, A.: A gradient structure for reaction–diffusion systems and for energy-drift-diffusion systems. Nonlinearity 24(4), 1329–1346 (2011)
Mielke, A.: Geodesic convexity of the relative entropy in reversible Markov chains. Calc. Var. Partial Differ. Equ. 48(1), 1–31 (2013)
Peyré, G., Cuturi, M.: Computational Optimal Transport: With Applications to Data Science, Foundations and Trends in Machine Learning, vol. 11, pp. 355–607 (2019)
Pham, H., Guan, M., Zoph, B., Le, Q., Dean, J.: Efficient neural architecture search via parameters sharing. In: Dy, J., Krause, A. (eds) Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research: PMLR, pp. 4095–4104. Stockholmsmässan, Stockholm, 10–15 Jul (2018)
Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. In: AAAI (2018)
Simon, J.: Compact sets in the space Lp(o, t; b). Annali di Matematica Pura ed Applicata 146, 65–96 (1986)
Stanley, K.O., Miikkulainen, R.: Evolving neural networks through augmenting topologies. Evol. Comput. 10(2), 99–127 (2002)
Su, W., Boyd, S., Candès, E.J.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. J. Mach. Learn. Res. 17(153), 1–43 (2016)
Trillos, N.G., Morales, F., Morales J.: Traditional and accelerated gradient descent for neural architecture search. In: Nielsen F., Barbaresco F. (eds.) Geometric Science of Information. GSI 2021. Lecture Notes in Computer Science, vol. 12829. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-80209-7_55
Villani, C.: Optimal Transport. Springer, Berlin (2009)
Williams, R.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(8), 229–256 (1992)
Yu, T., Zhu, H.: Hyper-parameter optimization: a review of algorithms and applications (2020). arXiv:2003.05689
Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning (2016). arXiv:1611.01578
Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition (2017). arXiv:1707.07012
Acknowledgements
N. García Trillos was supported by NSF-DMS 2005797. The work of J. Morales was supported by NSF grants DMS16-13911, RNMS11-07444 (KI-Net) and ONR grant N00014-1812465. Support for this research was provided by the Office of the Vice Chancellor for Research and Graduate Education at the University of Wisconsin-Madison with funding from the Wisconsin Alumni Research Foundation.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Mary Pugh.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
García Trillos, N., Morales, J. Semi-discrete Optimization Through Semi-discrete Optimal Transport: A Framework for Neural Architecture Search. J Nonlinear Sci 32, 27 (2022). https://doi.org/10.1007/s00332-022-09780-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00332-022-09780-2