Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11100))

Abstract

Learning algorithms for implicit generative models can optimize a variety of criteria that measure how the data distribution differs from the implicit model distribution, including the Wasserstein distance, the Energy distance, and the Maximum Mean Discrepancy criterion. A careful look at the geometries induced by these distances on the space of probability measures reveals interesting differences. In particular, we can establish surprising approximate global convergence guarantees for the 1-Wasserstein distance, even when the parametric generator has a nonconvex parametrization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Although failing to satisfy the separation property (2.i) can have serious practical consequences, recall that a pseudodistance always becomes a full fledged distance on the quotient space \(\mathcal {X}/\mathcal {R}\) where \(\mathcal {R}\) denotes the equivalence relation \(x\mathcal {R}y\Leftrightarrow {d(x,y)} = 0\). All the theory applies as long as one never distinguishes two points separated by a zero distance.

  2. 2.

    We use the notation to denote the probability distribution obtained by applying function f or expression f(x) to samples of the distribution \(\mu \).

  3. 3.

    Stochastic gradient descent often relies on unbiased gradient estimates (for a more general condition, see [10, Assumption 4.3]). This is not a given: estimating the Wasserstein distance (14) and its gradients on small minibatches gives severely biased estimates [7]. This is in fact very obvious for minibatches of size one. Theorem 2.1 therefore provides an imperfect but useful alternative.

  4. 4.

    The statement holds when there is an \(M > 0\) such that \(\mu \{x:|\textit{f}\,(q(x)/p(x))| > M\} = 0\) Restricting \(\mu \) to exclude such subsets and taking the limit \(M\rightarrow \infty \) may not work because \(\lim \sup \ne \sup \lim \) in general. Yet, in practice, the result can be verified by elementary calculus for the usual choices of \(\textit{f}\), such as those shown in Table 1.

  5. 5.

    We take the square root because this is the quantity that behaves like a distance.

  6. 6.

    The curious reader can pick an expression of \(F_d(t)=P\{\Vert x-x_i\Vert <t\}\) in [23], then derive an asymptotic bound for \(P\{\min _i\Vert x-x_i\Vert <t\}=1-(1-F_d(t))^n\).

  7. 7.

    Note that it is then important to use the \(\log (D)\) trick succinctly discussed in the original GAN paper [20].

  8. 8.

    See [54] for the relation between Energy Distance and Cramér distance.

  9. 9.

    For instance the set of probability measures on \(\mathbb {R}\) equipped with the total variation distance (6) is not separable because any dense subset needs one element in each of the disjoint balls \(B_x=\{\,P{\in }\mathcal {P}_{\!{\mathbb {R}}}:D_{TV}(P,\delta _x) < 1/2\,\}\).

  10. 10.

    For the Wasserstein distance, see [56, Theorem 6.18]. For the Energy distance, both properties can be derived from Theorem 2.17 after recaling that \(\varPhi _\mathcal {X}\subset \mathcal {H}\) is both complete and separable because it is isometric to \(\mathcal {X}\) which is Polish.

References

  1. Aizerman, M.A., Braverman, É.M., Rozonoér, L.I.: Theoretical foundations of the potential function method in pattern recognition learning. Autom. Remote Control 25, 821–837 (1964)

    MATH  Google Scholar 

  2. Amari, S.I., Nagaoka, H.: Methods of Information Geometry, vol. 191. American Mathematical Society (2007)

    Google Scholar 

  3. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: Proceedings of the 34nd International Conference on Machine Learning, ICML 2017, Sydney, Australia, 7–9 August 2017

    Google Scholar 

  4. Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Mathe. Soc. 68, 337–404 (1950)

    Article  MathSciNet  Google Scholar 

  5. Arora, S., Ge, R., Liang, Y., Ma, T., Zhang, Y.: Generalization and equilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573 (2017)

  6. Auffinger, A., Ben Arous, G.: Complexity of random smooth functions of many variables. Ann. Probab. 41(6), 4214–4247 (2013)

    Article  MathSciNet  Google Scholar 

  7. Bellemare, M.G., et al.: The cramer distance as a solution to biased Wasserstein gradients. arXiv preprint arXiv:1705.10743 (2017)

  8. Berti, P., Pratelli, L., Rigo, P., et al.: Gluing lemmas and skorohod representations. Electron. Commun. Probab. 20 (2015)

    Google Scholar 

  9. Borkar, V.S.: Stochastic approximation with two time scales. Syst. Control Lett. 29(5), 291–294 (1997)

    Article  MathSciNet  Google Scholar 

  10. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. CoRR abs/1606.04838 (2016)

    Google Scholar 

  11. Bouchacourt, D., Mudigonda, P.K., Nowozin, S.: DISCO nets: DISsimilarity cOefficients networks. In: Advances in Neural Information Processing Systems, vol. 29, pp. 352–360 (2016)

    Google Scholar 

  12. Burago, D., Burago, Y., Ivanov, S.: A Course in Metric Geometry. Volume 33 of AMS Graduate Studies in Mathematics, American Mathematical Society (2001)

    Google Scholar 

  13. Challis, E., Barber, D.: Affine independent variational inference. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 2186–2194. Curran Associates, Inc. (2012)

    Google Scholar 

  14. Cramér, H.: Mathematical Methods of Statistics. Princeton University Press, Princeton (1946)

    MATH  Google Scholar 

  15. Denton, E., Chintala, S., Szlam, A., Fergus, R.: Deep generative image models using a laplacian pyramid of adversarial networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 1486–1494. Curran Associates, Inc. (2015)

    Google Scholar 

  16. Dereich, S., Scheutzow, M., Schottstedt, R.: Constructive quantization: approximation by empirical measures. Annales de l’I.H.P. Probabilités et statistiques 49(4), 1183–1203 (2013)

    Article  MathSciNet  Google Scholar 

  17. Dziugaite, G.K., Roy, D.M., Ghahramani, Z.: Training generative neural networks via maximum mean discrepancy optimization. In: Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI, pp. 258–267 (2015)

    Google Scholar 

  18. Fournier, N., Guillin, A.: On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theor. Relat. Fields 162(3), 707–738 (2015)

    Article  MathSciNet  Google Scholar 

  19. Freeman, C.D., Bruna, J.: Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540 (2016)

  20. Goodfellow, I.J., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27, pp. 2672–2680. Curran Associates, Inc. (2014)

    Google Scholar 

  21. Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13, 723–773 (2012)

    MathSciNet  MATH  Google Scholar 

  22. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028 (2017)

  23. Hammersley, J.M.: The distribution of distance in a hypersphere. Ann. Mathe. Stat. 21(3), 447–452 (1950)

    Article  MathSciNet  Google Scholar 

  24. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics, 2nd edn. Springer, New York (2009)

    MATH  Google Scholar 

  25. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)

  26. Khinchin, A.Y.: Sur la loi des grandes nombres. Comptes Rendus de l’Académie des Sciences (1929)

    Google Scholar 

  27. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. CoRR abs/1312.6114 (2013)

    Google Scholar 

  28. Kocaoglu, M., Snyder, C., Dimakis, A.G., Vishwanath, S.: CausalGAN: learning causal implicit generative models with adversarial training. arXiv preprint arXiv:1709.02023 (2017)

  29. Konda, V.R., Tsitsiklis, J.N.: Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab., 796–819 (2004)

    Google Scholar 

  30. Kulkarni, T.D., Kohli, P., Tenenbaum, J.B., Mansinghka, V.: Picture: A probabilistic programming language for scene perception. In: Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition, CVPR 2015, pp. 4390–4399 (2015)

    Google Scholar 

  31. Lee, M.W., Nevatia, R.: Dynamic human pose estimation using Markov Chain Monte Carlo approach. In: 7th IEEE Workshop on Applications of Computer Vision/IEEE Workshop on Motion and Video Computing (WACV/MOTION 2005), pp. 168–175 (2005)

    Google Scholar 

  32. Li, C.L., Chang, W.C., Cheng, Y., Yang, Y., Póczos, B.: MMD GAN: towards deeper understanding of moment matching network. arXiv preprint arXiv:1705.08584 (2017)

  33. Li, Y., Swersky, K., Zemel, R.: Generative moment matching networks. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning, ICML 2015, vol. 37, pp. 1718–1727 (2015)

    Google Scholar 

  34. Liu, S., Bousquet, O., Chaudhuri, K.: Approximation and convergence properties of generative adversarial learning. arXiv preprint arXiv:1705.08991 (2017). to appear in NIPS 2017

  35. Milgrom, P., Segal, I.: Envelope theorems for arbitrary choice sets. Econometrica 70(2), 583–601 (2002)

    Article  MathSciNet  Google Scholar 

  36. von Mises, R.: On the asymptotic distribution of differentiable statistical functions. Ann. Mathe. Stat. 18(3), 309–348 (1947)

    Article  MathSciNet  Google Scholar 

  37. Müller, A.: Integral probability metrics and their generating classes of functions. Adv. Appl. Probab. 29(2), 429–443 (1997)

    Article  MathSciNet  Google Scholar 

  38. Neal, R.M.: Annealed importance sampling. Stat. Comput. 11(2), 125–139 (2001)

    Article  MathSciNet  Google Scholar 

  39. Nguyen, X., Wainwright, M.J., Jordan, M.I.: Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theor. 56(11), 5847–5861 (2010)

    Article  MathSciNet  Google Scholar 

  40. Nowozin, S., Cseke, B., Tomioka, R.: f-GAN: training generative neural samplers using variational divergence minimization. In: Advances in Neural Information Processing Systems, vol. 29, pp. 271–279 (2016)

    Google Scholar 

  41. Rachev, S.T., Klebanov, L., Stoyanov, S.V., Fabozzi, F.: The Methods of Distances in the Theory of Probability and Statistics. Springer, New York (2013)

    Book  Google Scholar 

  42. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)

  43. Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: Proceedings of the 31st International Conference on Machine Learning, ICML 2014, pp. 1278–1286 (2014)

    Google Scholar 

  44. Romaszko, L., Williams, C.K., Moreno, P., Kohli, P.: Vision-as-inverse-graphics: obtaining a rich 3D explanation of a scene from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pp. 851–859 (2017)

    Google Scholar 

  45. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: Advances in Neural Information Processing Systems, vol. 29, pp. 2234–2242 (2016)

    Google Scholar 

  46. Schoenberg, I.J.: Metric spaces and positive definite functions. Trans. Am. Mathe. Soc. 44, 522–536 (1938)

    Article  MathSciNet  Google Scholar 

  47. Schölkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge, MA (2002)

    MATH  Google Scholar 

  48. Sejdinovic, D., Sriperumbudur, B., Gretton, A., Fukumizu, K.: Equivalence of distance-based and rkhs-based statistics in hypothesis testing. Ann. Stat. 41(5), 2263–2291 (2013)

    Article  MathSciNet  Google Scholar 

  49. Serfling, R.J.: Approximation Theorems of Mathematical Statistics. Wiley, New York; Chichester (1980)

    Book  Google Scholar 

  50. Sriperumbudur, B.: On the optimal estimation of probability measures in weak and strong topologies. Bernoulli 22(3), 1839–1893 (2016)

    Article  MathSciNet  Google Scholar 

  51. Sriperumbudur, B.K., Fukumizu, K., Gretton, A., Schölkopf, B., Lanckriet, G.R.: On the empirical estimation of integral probability metrics. Electron. J. Stat. 6, 1550–1599 (2012)

    Article  MathSciNet  Google Scholar 

  52. Sriperumbudur, B.K., Fukumizu, K., Lanckriet, G.R.: Universality, characteristic kernels and RKHS embedding of measures. J. Mach. Learn. Res. 12, 2389–2410 (2011)

    MathSciNet  MATH  Google Scholar 

  53. Székely, G.J., Rizzo, M.L.: Energy statistics: a class of statistics based on distances. J. Stat. Plan. Infer. 143(8), 1249–1272 (2013)

    Article  MathSciNet  Google Scholar 

  54. Székely, J.G.: E-statistics: The energy of statistical samples. Technical report, 02–16, Bowling Green State University, Department of Mathematics and Statistics (2002)

    Google Scholar 

  55. Theis, L., van den Oord, A., Bethge, M.: A note on the evaluation of generative models. In: International Conference on Learning Representations (2016)

    Google Scholar 

  56. Villani, C.: Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer, Berlin (2009)

    Book  Google Scholar 

  57. Zinger, A.A., Kakosyan, A.V., Klebanov, L.B.: A characterization of distributions by mean values of statistics and certain probabilistic metrics. J. Sov. Mathe. 4(59), 914–920 (1992). Translated from Problemy Ustoichivosti Stokhasticheskikh Modelei-Trudi seminara, pp. 47–55 (1989)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We would like to thank Joan Bruna, Marco Cuturi, Arthur Gretton, Yann Ollivier, and Arthur Szlam for stimulating discussions and also for pointing out numerous related works.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leon Bottou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Bottou, L., Arjovsky, M., Lopez-Paz, D., Oquab, M. (2018). Geometrical Insights for Implicit Generative Modeling. In: Rozonoer, L., Mirkin, B., Muchnik, I. (eds) Braverman Readings in Machine Learning. Key Ideas from Inception to Current State. Lecture Notes in Computer Science(), vol 11100. Springer, Cham. https://doi.org/10.1007/978-3-319-99492-5_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99492-5_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99491-8

  • Online ISBN: 978-3-319-99492-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics