Geometrical Insights for Implicit Generative Modeling

Bottou, Leon; Arjovsky, Martin; Lopez-Paz, David; Oquab, Maxime

doi:10.1007/978-3-319-99492-5_11

Leon Bottou¹⁶,
Martin Arjovsky¹⁷,
David Lopez-Paz¹⁸ &
…
Maxime Oquab¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11100))

1736 Accesses
6 Citations

Abstract

Learning algorithms for implicit generative models can optimize a variety of criteria that measure how the data distribution differs from the implicit model distribution, including the Wasserstein distance, the Energy distance, and the Maximum Mean Discrepancy criterion. A careful look at the geometries induced by these distances on the space of probability measures reveals interesting differences. In particular, we can establish surprising approximate global convergence guarantees for the 1-Wasserstein distance, even when the parametric generator has a nonconvex parametrization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Although failing to satisfy the separation property (2.i) can have serious practical consequences, recall that a pseudodistance always becomes a full fledged distance on the quotient space \(\mathcal {X}/\mathcal {R}\) where \(\mathcal {R}\) denotes the equivalence relation \(x\mathcal {R}y\Leftrightarrow {d(x,y)} = 0\). All the theory applies as long as one never distinguishes two points separated by a zero distance.
2.
We use the notation to denote the probability distribution obtained by applying function f or expression f(x) to samples of the distribution \(\mu \).
3.
Stochastic gradient descent often relies on unbiased gradient estimates (for a more general condition, see [10, Assumption 4.3]). This is not a given: estimating the Wasserstein distance (14) and its gradients on small minibatches gives severely biased estimates [7]. This is in fact very obvious for minibatches of size one. Theorem 2.1 therefore provides an imperfect but useful alternative.
4.
The statement holds when there is an \(M > 0\) such that \(\mu \{x:|\textit{f}\,(q(x)/p(x))| > M\} = 0\) Restricting \(\mu \) to exclude such subsets and taking the limit \(M\rightarrow \infty \) may not work because \(\lim \sup \ne \sup \lim \) in general. Yet, in practice, the result can be verified by elementary calculus for the usual choices of \(\textit{f}\), such as those shown in Table 1.
5.
We take the square root because this is the quantity that behaves like a distance.
6.
The curious reader can pick an expression of \(F_d(t)=P\{\Vert x-x_i\Vert <t\}\) in [23], then derive an asymptotic bound for \(P\{\min _i\Vert x-x_i\Vert <t\}=1-(1-F_d(t))^n\).
7.
Note that it is then important to use the \(\log (D)\) trick succinctly discussed in the original GAN paper [20].
8.
See [54] for the relation between Energy Distance and Cramér distance.
9.
For instance the set of probability measures on \(\mathbb {R}\) equipped with the total variation distance (6) is not separable because any dense subset needs one element in each of the disjoint balls \(B_x=\{\,P{\in }\mathcal {P}_{\!{\mathbb {R}}}:D_{TV}(P,\delta _x) < 1/2\,\}\).
10.
For the Wasserstein distance, see [56, Theorem 6.18]. For the Energy distance, both properties can be derived from Theorem 2.17 after recaling that \(\varPhi _\mathcal {X}\subset \mathcal {H}\) is both complete and separable because it is isometric to \(\mathcal {X}\) which is Polish.

References

Aizerman, M.A., Braverman, É.M., Rozonoér, L.I.: Theoretical foundations of the potential function method in pattern recognition learning. Autom. Remote Control 25, 821–837 (1964)
MATH Google Scholar
Amari, S.I., Nagaoka, H.: Methods of Information Geometry, vol. 191. American Mathematical Society (2007)
Google Scholar
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: Proceedings of the 34nd International Conference on Machine Learning, ICML 2017, Sydney, Australia, 7–9 August 2017
Google Scholar
Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Mathe. Soc. 68, 337–404 (1950)
Article MathSciNet Google Scholar
Arora, S., Ge, R., Liang, Y., Ma, T., Zhang, Y.: Generalization and equilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573 (2017)
Auffinger, A., Ben Arous, G.: Complexity of random smooth functions of many variables. Ann. Probab. 41(6), 4214–4247 (2013)
Article MathSciNet Google Scholar
Bellemare, M.G., et al.: The cramer distance as a solution to biased Wasserstein gradients. arXiv preprint arXiv:1705.10743 (2017)
Berti, P., Pratelli, L., Rigo, P., et al.: Gluing lemmas and skorohod representations. Electron. Commun. Probab. 20 (2015)
Google Scholar
Borkar, V.S.: Stochastic approximation with two time scales. Syst. Control Lett. 29(5), 291–294 (1997)
Article MathSciNet Google Scholar
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. CoRR abs/1606.04838 (2016)
Google Scholar
Bouchacourt, D., Mudigonda, P.K., Nowozin, S.: DISCO nets: DISsimilarity cOefficients networks. In: Advances in Neural Information Processing Systems, vol. 29, pp. 352–360 (2016)
Google Scholar
Burago, D., Burago, Y., Ivanov, S.: A Course in Metric Geometry. Volume 33 of AMS Graduate Studies in Mathematics, American Mathematical Society (2001)
Google Scholar
Challis, E., Barber, D.: Affine independent variational inference. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 2186–2194. Curran Associates, Inc. (2012)
Google Scholar
Cramér, H.: Mathematical Methods of Statistics. Princeton University Press, Princeton (1946)
MATH Google Scholar
Denton, E., Chintala, S., Szlam, A., Fergus, R.: Deep generative image models using a laplacian pyramid of adversarial networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 1486–1494. Curran Associates, Inc. (2015)
Google Scholar
Dereich, S., Scheutzow, M., Schottstedt, R.: Constructive quantization: approximation by empirical measures. Annales de l’I.H.P. Probabilités et statistiques 49(4), 1183–1203 (2013)
Article MathSciNet Google Scholar
Dziugaite, G.K., Roy, D.M., Ghahramani, Z.: Training generative neural networks via maximum mean discrepancy optimization. In: Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI, pp. 258–267 (2015)
Google Scholar
Fournier, N., Guillin, A.: On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theor. Relat. Fields 162(3), 707–738 (2015)
Article MathSciNet Google Scholar
Freeman, C.D., Bruna, J.: Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540 (2016)
Goodfellow, I.J., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27, pp. 2672–2680. Curran Associates, Inc. (2014)
Google Scholar
Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13, 723–773 (2012)
MathSciNet MATH Google Scholar
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028 (2017)
Hammersley, J.M.: The distribution of distance in a hypersphere. Ann. Mathe. Stat. 21(3), 447–452 (1950)
Article MathSciNet Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics, 2nd edn. Springer, New York (2009)
MATH Google Scholar
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)
Khinchin, A.Y.: Sur la loi des grandes nombres. Comptes Rendus de l’Académie des Sciences (1929)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. CoRR abs/1312.6114 (2013)
Google Scholar
Kocaoglu, M., Snyder, C., Dimakis, A.G., Vishwanath, S.: CausalGAN: learning causal implicit generative models with adversarial training. arXiv preprint arXiv:1709.02023 (2017)
Konda, V.R., Tsitsiklis, J.N.: Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab., 796–819 (2004)
Google Scholar
Kulkarni, T.D., Kohli, P., Tenenbaum, J.B., Mansinghka, V.: Picture: A probabilistic programming language for scene perception. In: Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition, CVPR 2015, pp. 4390–4399 (2015)
Google Scholar
Lee, M.W., Nevatia, R.: Dynamic human pose estimation using Markov Chain Monte Carlo approach. In: 7th IEEE Workshop on Applications of Computer Vision/IEEE Workshop on Motion and Video Computing (WACV/MOTION 2005), pp. 168–175 (2005)
Google Scholar
Li, C.L., Chang, W.C., Cheng, Y., Yang, Y., Póczos, B.: MMD GAN: towards deeper understanding of moment matching network. arXiv preprint arXiv:1705.08584 (2017)
Li, Y., Swersky, K., Zemel, R.: Generative moment matching networks. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning, ICML 2015, vol. 37, pp. 1718–1727 (2015)
Google Scholar
Liu, S., Bousquet, O., Chaudhuri, K.: Approximation and convergence properties of generative adversarial learning. arXiv preprint arXiv:1705.08991 (2017). to appear in NIPS 2017
Milgrom, P., Segal, I.: Envelope theorems for arbitrary choice sets. Econometrica 70(2), 583–601 (2002)
Article MathSciNet Google Scholar
von Mises, R.: On the asymptotic distribution of differentiable statistical functions. Ann. Mathe. Stat. 18(3), 309–348 (1947)
Article MathSciNet Google Scholar
Müller, A.: Integral probability metrics and their generating classes of functions. Adv. Appl. Probab. 29(2), 429–443 (1997)
Article MathSciNet Google Scholar
Neal, R.M.: Annealed importance sampling. Stat. Comput. 11(2), 125–139 (2001)
Article MathSciNet Google Scholar
Nguyen, X., Wainwright, M.J., Jordan, M.I.: Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theor. 56(11), 5847–5861 (2010)
Article MathSciNet Google Scholar
Nowozin, S., Cseke, B., Tomioka, R.: f-GAN: training generative neural samplers using variational divergence minimization. In: Advances in Neural Information Processing Systems, vol. 29, pp. 271–279 (2016)
Google Scholar
Rachev, S.T., Klebanov, L., Stoyanov, S.V., Fabozzi, F.: The Methods of Distances in the Theory of Probability and Statistics. Springer, New York (2013)
Book Google Scholar
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: Proceedings of the 31st International Conference on Machine Learning, ICML 2014, pp. 1278–1286 (2014)
Google Scholar
Romaszko, L., Williams, C.K., Moreno, P., Kohli, P.: Vision-as-inverse-graphics: obtaining a rich 3D explanation of a scene from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pp. 851–859 (2017)
Google Scholar
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: Advances in Neural Information Processing Systems, vol. 29, pp. 2234–2242 (2016)
Google Scholar
Schoenberg, I.J.: Metric spaces and positive definite functions. Trans. Am. Mathe. Soc. 44, 522–536 (1938)
Article MathSciNet Google Scholar
Schölkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge, MA (2002)
MATH Google Scholar
Sejdinovic, D., Sriperumbudur, B., Gretton, A., Fukumizu, K.: Equivalence of distance-based and rkhs-based statistics in hypothesis testing. Ann. Stat. 41(5), 2263–2291 (2013)
Article MathSciNet Google Scholar
Serfling, R.J.: Approximation Theorems of Mathematical Statistics. Wiley, New York; Chichester (1980)
Book Google Scholar
Sriperumbudur, B.: On the optimal estimation of probability measures in weak and strong topologies. Bernoulli 22(3), 1839–1893 (2016)
Article MathSciNet Google Scholar
Sriperumbudur, B.K., Fukumizu, K., Gretton, A., Schölkopf, B., Lanckriet, G.R.: On the empirical estimation of integral probability metrics. Electron. J. Stat. 6, 1550–1599 (2012)
Article MathSciNet Google Scholar
Sriperumbudur, B.K., Fukumizu, K., Lanckriet, G.R.: Universality, characteristic kernels and RKHS embedding of measures. J. Mach. Learn. Res. 12, 2389–2410 (2011)
MathSciNet MATH Google Scholar
Székely, G.J., Rizzo, M.L.: Energy statistics: a class of statistics based on distances. J. Stat. Plan. Infer. 143(8), 1249–1272 (2013)
Article MathSciNet Google Scholar
Székely, J.G.: E-statistics: The energy of statistical samples. Technical report, 02–16, Bowling Green State University, Department of Mathematics and Statistics (2002)
Google Scholar
Theis, L., van den Oord, A., Bethge, M.: A note on the evaluation of generative models. In: International Conference on Learning Representations (2016)
Google Scholar
Villani, C.: Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer, Berlin (2009)
Book Google Scholar
Zinger, A.A., Kakosyan, A.V., Klebanov, L.B.: A characterization of distributions by mean values of statistics and certain probabilistic metrics. J. Sov. Mathe. 4(59), 914–920 (1992). Translated from Problemy Ustoichivosti Stokhasticheskikh Modelei-Trudi seminara, pp. 47–55 (1989)
Article MathSciNet Google Scholar

Download references

Acknowledgements

We would like to thank Joan Bruna, Marco Cuturi, Arthur Gretton, Yann Ollivier, and Arthur Szlam for stimulating discussions and also for pointing out numerous related works.

Author information

Authors and Affiliations

Facebook AI Research, New York, USA
Leon Bottou
New York University, New York, USA
Martin Arjovsky
Facebook AI Research, Paris, France
David Lopez-Paz
Inria, Paris, France
Maxime Oquab

Authors

Leon Bottou
View author publications
You can also search for this author in PubMed Google Scholar
Martin Arjovsky
View author publications
You can also search for this author in PubMed Google Scholar
David Lopez-Paz
View author publications
You can also search for this author in PubMed Google Scholar
Maxime Oquab
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leon Bottou .

Editor information

Editors and Affiliations

West Newton, MA, USA
Lev Rozonoer
National Research University Higher School of Economics, Moscow, Russia
Boris Mirkin
Rutgers University, Piscataway, NJ, USA
Ilya Muchnik

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bottou, L., Arjovsky, M., Lopez-Paz, D., Oquab, M. (2018). Geometrical Insights for Implicit Generative Modeling. In: Rozonoer, L., Mirkin, B., Muchnik, I. (eds) Braverman Readings in Machine Learning. Key Ideas from Inception to Current State. Lecture Notes in Computer Science(), vol 11100. Springer, Cham. https://doi.org/10.1007/978-3-319-99492-5_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-99492-5_11
Published: 23 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99491-8
Online ISBN: 978-3-319-99492-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics