Skip to main content
Log in

Variational representations of annealing paths: Bregman information under monotonic embedding

  • Research Paper
  • Published:
Information Geometry Aims and scope Submit manuscript

A Publisher Correction to this article was published on 26 February 2024

This article has been updated

Abstract

Markov chain Monte Carlo methods for sampling from complex distributions and estimating normalization constants often simulate samples from a sequence of intermediate distributions along an annealing path, which bridges between a tractable initial distribution and a target density of interest. Prior works have constructed annealing paths using quasi-arithmetic means, and interpreted the resulting intermediate densities as minimizing an expected divergence to the endpoints. To analyze these variational representations of annealing paths, we extend known results showing that the arithmetic mean over arguments minimizes the expected Bregman divergence to a single representative point. In particular, we obtain an analogous result for quasi-arithmetic means, when the inputs to the Bregman divergence are transformed under a monotonic embedding function. Our analysis highlights the interplay between quasi-arithmetic means, parametric families, and divergence functionals using the rho-tau representational Bregman divergence framework, and associates common divergence functionals with intermediate densities along an annealing path.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1

Similar content being viewed by others

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Change history

Notes

  1. The identity dual representation leads to favorable properties for divergence minimization under linear constraints. Csiszár [20] characterize Beta divergences as providing scale-invariant projection onto the set of positive measures satisfying expectation constraints. Naudts [48] Chap. 8 discuss related thermodynamic interpretations. Here, the Beta divergence is preferred in place of the \(\alpha \)-divergence, which induces escort expectations due to the deformed dual representation \(\tau ({\tilde{\pi }}) = \log _{1-q}{\tilde{\pi }}= \frac{1}{q}{\tilde{\pi }}({x})^{q} - \frac{1}{q}\) from Example 5 [48, 80].

  2. Note, we order the arguments to match the Amari \(\alpha \)-divergence [65, 77], and use constant factors to induce limiting behavior of \(D_{\text {KL}}[{\pi _a}: {\pi _b}]\) as \(\alpha \rightarrow 0\) and \(D_{\text {KL}}[{\pi _b}:{\pi _a}]\) as \(\alpha \rightarrow 1\).

  3. While rigorous constructions of such statistical manifolds are considered in [6, 33, 43, 44, 61], we assume the manifold and tangent spaces are well-defined (as is done in [78]).

References

  1. Adlam, B., Gupta, N., Mariet, Z., Smith, J.: Understanding the bias-variance tradeoff of Bregman divergences. arXiv preprint arXiv:2202.04167 (2022)

  2. Alemi, A., Poole, B., Fischer, I., Dillon, J., Saurous, R.A., Murphy, K.: Fixing a broken ELBO. In: International Conference on Machine Learning, pp. 159–168 (2018)

  3. Amari, S.: Differential geometry of curved exponential families-curvatures and information loss. Ann. Stat. 13, 357–385 (1982)

    MathSciNet  Google Scholar 

  4. Amari, S., Nagaoka, H.: Methods of Information Geometry, vol 191. American Mathematical Society, New York (2000)

  5. Amid, E., Anil, R., Fifty, C., Warmuth, M.K.: Layerwise Bregman representation learning of neural networks with applications to knowledge distillation. In: Transactions on Machine Learning Research (2022)

  6. Ay, N., Jost, J., Vân Lê, H., Schwachhöfer, L.: Information Geometry, vol. 64. Springer, New York (2017)

    Google Scholar 

  7. Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)

    MathSciNet  Google Scholar 

  8. Banerjee, A., Guo, X., Wang, H.: On the optimality of conditional expectation as a Bregman predictor. IEEE Trans. Inf. Theory 51(7), 2664–2669 (2005)

    Article  MathSciNet  Google Scholar 

  9. Basu, A., Harris, I.R., Hjort, N.L., Jones, M.: Robust and efficient estimation by minimising a density power divergence. Biometrika 85(3), 549–559 (1998)

    Article  MathSciNet  Google Scholar 

  10. Bercher, J.F.: A simple probabilistic construction yielding generalized entropies and divergences, escort distributions and \(q\)-Gaussians. Phys. A Stat. Mech. Appl. 391(19), 4460–4469 (2012)

    Article  Google Scholar 

  11. Betancourt, M., Byrne, S., Livingstone, S., Girolami, M., et al.: Geometric foundations of Hamiltonian Monte Carlo. Bernoulli 23(4A), 2257–2298 (2017)

    Article  MathSciNet  Google Scholar 

  12. Blondel, M., Martins, A.F., Niculae, V.: Learning with Fenchel–Young losses. J. Mach. Learn. Res. 21(35), 1–69 (2020)

    MathSciNet  Google Scholar 

  13. Brekelmans, R., Huang, S., Ghassemi, M., Steeg, G.V., Grosse, R.B., Makhzani, A.: Improving mutual information estimation with annealed and energy-based bounds. In: International Conference on Learning Representations (2022)

  14. Brekelmans, R., Masrani, V., Wood, F., Ver Steeg, G., Galstyan, A.: All in the exponential family: Bregman duality in thermodynamic variational inference. In: Proceedings of the 37th International Conference on Machine Learning, JMLR.org, ICML’20 (2020)

  15. Brekelmans, R., Nielsen, F., Galstyan, A., Steeg, G.V.: Likelihood ratio exponential families. In: NeurIPS Workshop on Information Geometry in Deep Learning. https://openreview.net/forum?id=RoTADibt26_ (2020)

  16. Burbea, J., Rao, C.: Entropy differential metric, distance and divergence measures in probability spaces: a unified approach. J. Multivariate Anal. 12(4), 575–596 (1982). https://doi.org/10.1016/0047-259X(82)90065-3

    Article  MathSciNet  Google Scholar 

  17. Chatterjee, S., Diaconis, P.: The sample size required in importance sampling. Ann. Appl. Probab. 28(2), 1099–1135 (2018)

    Article  MathSciNet  Google Scholar 

  18. Cichocki, A., Si, Amari: Families of alpha-beta-and gamma-divergences: flexible and robust measures of similarities. Entropy 12(6), 1532–1568 (2010)

    Article  ADS  MathSciNet  Google Scholar 

  19. Cichocki, A., Cruces, S., Si, Amari: Generalized alpha–beta divergences and their application to robust nonnegative matrix factorization. Entropy 13(1), 134–170 (2011)

    Article  ADS  Google Scholar 

  20. Csiszár, I.: Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. Ann. Stat. 19(4), 2032–2066 (1991)

    Article  MathSciNet  Google Scholar 

  21. Del Moral, P., Doucet, A., Jasra, A.: Sequential Monte Carlo samplers. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 68(3), 411–436 (2006)

    Article  MathSciNet  Google Scholar 

  22. Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D.: Hybrid Monte Carlo. Phys. Lett. B 195(2), 216–222 (1987)

    Article  ADS  MathSciNet  CAS  Google Scholar 

  23. Earl, D.J., Deem, M.W.: Parallel tempering: theory, applications, and new perspectives. Phys. Chem. Chem. Phys. 7(23), 3910–3916 (2005)

    Article  CAS  PubMed  Google Scholar 

  24. Eguchi, S., Komori, O., Ohara, A.: Information geometry associated with generalized means. In: Information Geometry and Its Applications IV, pp. 279–295. Springer, New York (2016)

  25. Eguchi, S., Komori, O.: Path connectedness on a space of probability density functions. In: International Conference on Geometric Science of Information, pp. 615–624 (2015)

  26. Eguchi, S.: Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat. 4, 793–803 (1983)

    MathSciNet  Google Scholar 

  27. Eguchi, S.: A differential geometric approach to statistical inference on the basis of contrast functionals. Hiroshima Math. J. 15(2), 341–391 (1985)

    Article  MathSciNet  Google Scholar 

  28. Eguchi, S.: Information geometry and statistical pattern recognition. Sugaku Expos. 19(2), 197–216 (2006)

    Google Scholar 

  29. Frigyik, B.A., Srivastava, S., Gupta, M.R.: Functional Bregman divergence and Bayesian estimation of distributions. IEEE Trans. Inf. Theory 54(11), 5130–5139 (2008)

    Article  MathSciNet  Google Scholar 

  30. Geist, M., Scherrer, B., Pietquin, O.: A theory of regularized Markov decision processes. In: International Conference on Machine Learning, PMLR, pp. 2160–2169 (2019)

  31. Gelman, A., Meng, X.L.: Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. Stat. Sci. 5, 163–185 (1998)

    MathSciNet  Google Scholar 

  32. Goshtasbpour, S., Cohen, V., Perez-Cruz, F.: Adaptive annealed importance sampling with constant rate progress. In: International Conference on Machine Learning (2023)

  33. Grasselli, M.R.: Dual connections in nonparametric classical information geometry. Ann. Inst. Stat. Math. 62(5), 873–896 (2010)

    Article  MathSciNet  Google Scholar 

  34. Grosse, R.B., Maddison, C.J., Salakhutdinov, R.R.: Annealing between distributions by averaging moments. In: Advances in neural information processing systems, pp. 2769–2777 (2013)

  35. Hardy, G., Littlewood, J., Plóya, G.: Inequalities. Math. Gazette 37(321), 236–236 (1953)

    Google Scholar 

  36. Jarzynski, C.: Equilibrium free-energy differences from nonequilibrium measurements: a master-equation approach. Phys. Rev. E 56(5), 5018 (1997)

    Article  ADS  MathSciNet  CAS  Google Scholar 

  37. Jarzynski, C.: Nonequilibrium equality for free energy differences. Phys. Rev. Lett. 78(14), 2690 (1997)

    Article  ADS  CAS  Google Scholar 

  38. Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106(4), 620 (1957)

    Article  ADS  MathSciNet  Google Scholar 

  39. Kaniadakis, G., Scarfone, A.: A new one-parameter deformation of the exponential function. Phys. A Stat. Mech. Appl. 305(1–2), 69–75 (2002)

    Article  MathSciNet  Google Scholar 

  40. Knoblauch, J., Jewson, J., Damoulas, T.: Generalized variational inference: three arguments for deriving new posteriors. arXiv preprint arXiv:1904.02063 (2019)

  41. Kolmogorov, A.N.: Sur la Notion de la Moyenne. G. Bardi, tip. della R. Accad. dei Lincei (1930)

  42. Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (1991)

    Article  MathSciNet  Google Scholar 

  43. Loaiza, G.I., Quiceno, H.R.: A Riemannian geometry in the \(q\)-exponential Banach manifold induced by \(q\)-divergences. In: Geometric Science of Information. First International Conference, GSI 2013, Paris, France, August 28–30, 2013. Proceedings, pp. 737–742. Springer, Berlin (2013)

  44. Loaiza, G.I., Quiceno, H.: A \(q\)-exponential statistical Banach manifold. J. Math. Anal. Appl. 398(2), 466–476 (2013)

    Article  MathSciNet  Google Scholar 

  45. Masrani, V., Brekelmans, R., Bui, T., Nielsen, F., Galstyan, A., Steeg, G.V., Wood, F.: q-Paths: generalizing the geometric annealing path using power means. Uncertain. Artif. Intell. (2021)

  46. Murata, N., Takenouchi, T., Kanamori, T., Eguchi, S.: Information geometry of U-boost and Bregman divergence. Neural Comput. 16(7), 1437–1481 (2004)

    Article  PubMed  Google Scholar 

  47. Naudts, J.: Estimators, escort probabilities, and phi-exponential families in statistical physics. arXiv preprint arXiv:math-ph/0402005 (2004)

  48. Naudts, J.: Generalised Thermostatistics. Springer, New York (2011)

    Book  Google Scholar 

  49. Naudts, J., Zhang, J.: Rho–tau embedding and gauge freedom in information geometry. Inf. Geom. 1(1), 79–115 (2018)

    Article  MathSciNet  Google Scholar 

  50. Neal, R.M.: MCMC using Hamiltonian dynamics. In: Handbook of Markov chain Monte Carlo, p. 113 (2011)

  51. Neal, R.M.: Annealed importance sampling. Stat. Comput. 11(2), 125–139 (2001)

    Article  MathSciNet  Google Scholar 

  52. Nguyen, X., Wainwright, M.J., Jordan, M.I.: Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theory 56(11), 5847–5861 (2010)

    Article  MathSciNet  Google Scholar 

  53. Nielsen, F., Nock, R.: On Rényi and Tsallis entropies and divergences for exponential families. arXiv preprint arXiv:1105.3259 (2011)

  54. Nielsen, F.: An elementary introduction to information geometry. Entropy 22(10), 34 (2020)

    Article  MathSciNet  Google Scholar 

  55. Nielsen, F., Boltz, S.: The Burbea–Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 57(8), 5455–5466 (2011)

    Article  MathSciNet  Google Scholar 

  56. Nock, R., Cranko, Z., Menon, A.K., Qu, L., Williamson, R.C.: \(f\)-GANs in an information geometric nutshell. In: Advances in Neural Information Processing Systems (2017)

  57. Nock, R., Nielsen, F.: Fitting the smallest enclosing Bregman ball. In: European Conference on Machine Learning, pp. 649–656. Springer, New York (2005)

  58. Nowozin, S., Cseke, B., Tomioka, R.: \(f\)-GAN: training generative neural samplers using variational divergence minimization. Neural Inf. Process. Syst. 29, 11 (2016)

    Google Scholar 

  59. Ogata, Y.: A Monte Carlo method for high dimensional integration. Numer. Math. 55(2), 137–157 (1989)

    Article  MathSciNet  Google Scholar 

  60. Pfau, D.: A Generalized Bias-Variance Decomposition for Bregman Divergences. Unpublished manuscript (2013)

  61. Pistone, G., Sempi, C.: An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Stat. 4, 1543–1561 (1995)

    MathSciNet  Google Scholar 

  62. Poole, B., Ozair, S., Van Den Oord, A., Alemi, A., Tucker, G.: On variational bounds of mutual information. In: International Conference on Machine Learning, pp. 5171–5180 (2019)

  63. Rényi, A.: On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, Calif., pp. 547–561 (1961). https://projecteuclid.org/euclid.bsmsp/1200512181

  64. Rossky, P.J., Doll, J., Friedman, H.: Brownian dynamics as smart Monte Carlo simulation. J. Chem. Phys. 69(10), 4628–4633 (1978)

    Article  ADS  CAS  Google Scholar 

  65. Si, Amari: Integration of stochastic models by minimizing \(\alpha \)-divergence. Neural Comput. 19(10), 2780–2796 (2007)

    Article  MathSciNet  Google Scholar 

  66. Si, Amari: Information Geometry and Its Applications, vol. 194. Springer, New York (2016)

    Google Scholar 

  67. Sibson, R.: Information radius. Z. Wahrscheinlichkeitstheor. Verwandte Gebiete 14(2), 149–160 (1969)

    Article  MathSciNet  Google Scholar 

  68. Syed, S., Romaniello, V., Campbell, T., Bouchard-Côté, A.: Parallel tempering on optimized paths. In: International Conference on Machine Learning (2021)

  69. Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method. In: Allerton Conference on Communications, Control and Computing, pp. 368–377 (1999)

  70. Tsallis, C.: Introduction to Nonextensive Statistical Mechanics: Approaching a Complex World. Springer, New York (2009)

  71. Tsallis, C.: Possible generalization of Boltzmann–Gibbs statistics. J. Stat. Phys. 52(1–2), 479–487 (1988)

    Article  ADS  MathSciNet  Google Scholar 

  72. Van Erven, T., Harremos, P.: Rényi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 60(7), 3797–3820 (2014)

    Article  Google Scholar 

  73. Vellal, A., Chakraborty, S., Xu, J.Q.: Bregman power k-means for clustering exponential family data. In: International Conference on Machine Learning, PMLR, pp. 22103–22119 (2022)

  74. Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), Citeseer, pp. 681–688 (2011)

  75. Wong, T.K.L., Zhang, J.: Tsallis and Rényi deformations linked via a new \(\lambda \)-duality. arXiv preprint arXiv:2107.11925 (2021)

  76. Xu, J., Lange, K.: Power k-means clustering. In: International Conference on Machine Learning, PMLR, pp. 6921–6931 (2019)

  77. Zhang, J.: Divergence function, duality, and convex analysis. Neural Comput. 16(1), 159–195 (2004)

    Article  PubMed  Google Scholar 

  78. Zhang, J.: Nonparametric information geometry: from divergence function to referential-representational biduality on statistical manifolds. Entropy 15(12), 5384–5418 (2013)

    Article  ADS  MathSciNet  Google Scholar 

  79. Zhang, J.: On monotone embedding in information geometry. Entropy 17(7), 4485–4499 (2015)

    Article  ADS  Google Scholar 

  80. Zhang, J., Matsuzoe, H.: Entropy, cross-entropy, relative entropy: deformation theory (a). Europhys. Lett. 134(1), 18001 (2021)

    Article  ADS  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rob Brekelmans.

Ethics declarations

Conflict of interest

Frank Nielsen is currently a board member of the journal. He was not involved in the peer review or handling of the manuscript.

Additional information

Communicated by Jun Zhang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: the article was originally published without “Data Availability” and “Conflict of Interest”.

Appendices

Summary of Appendix

In Appendix A, we review annealed importance sampling as an example MCMC technique. In Appendix B, we prove our main result (Theorem 1). We discuss parameteric Bregman divergences and annealing paths between deformed exponential families from our representational perspective in Appendix C, where the \(\rho \)-affine property plays a crucial role. In Appendix D, we review the Eguchi relations and information-geometric structures induced by the rho–tau Bregman divergence and rho–tau Bregman information functionals (see Table 2). Finally, in Appendix E, we prove Theorem 2 showing that quasi-arithmetic paths in the \(\rho \)-representation are geodesics with respect to affine connections induced by the rho–tau Bregman divergence.

A Annealed importance sampling

We briefly present annealed importance sampling (AIS) [51] as a representative example of an MCMC method where the choice of annealing path can play a crucial role [34, 45]. AIS relies on similar insights as the Jarzynski equality in nonequilibrium thermodynamics [36, 37], and may be used to estimate (log) normalization or partition functions or sample from complex distributions.

More concretely, consider an initial distribution \(\pi _0({x}) \propto {\tilde{\pi }}_0({x})\) which is tractable to sample and is often chosen to have normalization constant \(\mathcal {Z}_0= 1\). We are often interested in estimating the normalizing constant \(\mathcal {Z}_1 = \int {\tilde{\pi }}_1({x})\) of a target distribution \(\pi _1({x}) \propto {\tilde{\pi }}_1({x})\), where only the unnormalized density \({\tilde{\pi }}_1\) is available. Since direct sampling from \(\pi _0({x})\) may require prohibitive sample complexity to accurately estimate the normalization constant ratio \(\mathcal {Z}_1/\mathcal {Z}_0\) [13, 17], AIS decomposes the estimation problem into a sequence of easier subproblems using a path of intermediate distributions \(\{{\tilde{\pi }}_{\beta _t}({x})\}_{\beta _0=0}^{\beta _T=1}\) between the endpoints \({\tilde{\pi }}_0({x})\) and \({\tilde{\pi }}_1({x})\). Most often, the geometric averaging path is used,

$$\begin{aligned} {\tilde{\pi }}_{\beta _t} ({x}) = \frac{{{\tilde{\pi }}}_0({x})^{1-\beta _t} \, {{\tilde{\pi }}}_1( {x})^{\beta _t}}{\mathcal {Z}({\beta _t})} \quad \text {where} \quad \mathcal {Z}({\beta _t}) = \int {{\tilde{\pi }}}_0({x})^{1-\beta _t} \, {{\tilde{\pi }}}_1({x})^{\beta _t} d{x}. \end{aligned}$$
(49)

AIS proceeds by constructing a sequence of Markov transition kernels \({\mathcal {T}_t}({x}_{t+1}|{x}_{t})\) which leave \(\pi _{\beta _t}\) invariant, with \(\int \pi _{\beta _t}({x}_t) {\mathcal {T}_t}({x}_{t+1}|{x}_{t}) d{x}_t= \pi _{\beta _t} ({x}_{t+1})\). Commonly, this is achieved using kernels such as HMC or Langevin dynamics [50] which transform the samples, with Metropolis-Hastings accept-reject steps to ensure invariance. To interpret AIS as importance sampling in an extended state space [13, 51], we define the reverse kernel as \({\tilde{\mathcal {T}}_t}({x}_{t} | {x}_{t+1}) = \frac{\pi _{{\beta _t}}({x}_t){\mathcal {T}_t}({x}_{t+1} | {x}_{t}) }{\int \pi _{{\beta _t}}({x}_t) {\mathcal {T}_t}({x}_{t+1} | {x}_{t})d{x}}\). Using the invariance of \({\mathcal {T}_t}\), we observe that \(\pi _{{\beta _t}}({x}_{t+1}) {\tilde{\mathcal {T}}_t}({x}_{t} | {x}_{t+1}) = \pi _{{\beta _t}}({x}_t) {\mathcal {T}_t}({x}_{t+1} | {x}_{t})\).

To construct an estimator of \(\mathcal {Z}_T/\mathcal {Z}_0\) using AIS, we sample from \({x}_0 \sim \pi _0({x})\), run the transition kernels in the forward direction to obtain samples \({x}_{1:T}\), and calculate the importance sampling weights along the path,

$$\begin{aligned} { w({x}_{0:T}) = \frac{{\tilde{\pi }}_1({x}_T) \prod \limits _{t=0}^{T-1} {\tilde{\mathcal {T}}_t}({x}_{t} | {x}_{t+1})}{{\tilde{\pi }}_0({x}_0) \prod \limits _{t=0}^{T-1} {\mathcal {T}_t}({x}_{t+1} | {x}_{t})} = \prod \limits _{t=1}^{T} \dfrac{{\tilde{\pi }}_{\beta _t}({x}_{t})}{{\tilde{\pi }}_{\beta _{t-1}}({x}_{t})} = \prod \limits _{t=1}^{T} \bigg ( \dfrac{\pi _T( {x}_{t})}{\pi _0({x}_{t})} \bigg )^{\beta _{t}-\beta _{t-1}}.\qquad } \end{aligned}$$
(50)

Note that we have used the above identity relating \({\mathcal {T}_t}\) and \({\tilde{\mathcal {T}}_t}\) in the second equality, and the definition of the geometric averaging path in the last equality.

Finally, it can be shown that \(w({x}_{0:T})\) provides an unbiased estimator of \(\mathcal {Z}_1/\mathcal {Z}_0\), with \(\mathbb {E}[w({x}_{0:T})] = \mathcal {Z}_1/\mathcal {Z}_0\) [51]. We can thus estimate the partition function ratio using the empirical average over K annealing chains, \(\mathcal {Z}_1/\mathcal {Z}_0 \approx \frac{1}{K} \sum w_{0:T}^{(k)}\). We detail the complete AIS procedure in Alg. 1.

AIS is considered among the gold standard methods for estimating normalization constants. Closely related MCMC methods involving path sampling [31] include Sequential Monte Carlo [21], which may involve resampling steps to prioritize higher-probability \({x}_t\), or parallel tempering [23], which runs T parallel sampling chains in order to obtain accurate samples from each \(\pi _{\beta _t}({x})\).

B Proof of Theorem 1

In the main text and below, we present and prove Theorem 1 in terms of scalar inputs and decomposable Bregman divergences. As we show in Appendix B.2 Theorem 3, a similar Bregman divergence-minimization interpretation of quasi-arithmetic means holds for vector-valued inputs, where the representation function is applied element-wise. Most commonly, vectorized divergences are constructed between parameter vectors \({\varvec{\theta }}\) of some (deformed) exponential family. However, we argue in Appendix C.1 that these cases are best understood using representations of unnormalized densities (as in Theorem 1) and the \(\rho \)-affine property of parametric families. Nevertheless, we provide proof of Theorem 3 for completeness.

Theorem 1

(Rho–tau Bregman information) Consider a monotonic representation function \(\rho : {\mathcal {X}}_\rho \subset {\mathbb {R}}\mapsto {\mathcal {Y}}_\rho \subset {\mathbb {R}}\) and a convex function \(f: {\mathcal {Y}}_\rho \rightarrow \mathbb {R}\). Consider discrete mixture weights \({\varvec{\beta }}= \{\beta _i \}_{i=1}^N\) over N inputs \({\varvec{\pi }}(x) = \{ \pi _i(x) \}_{i=1}^N\), \(\pi _i(x) \in {\mathcal {X}}_\rho \), with \(\sum _i \beta _i = 1\). Finally, assume the expected value \(\mu _{\rho }({\varvec{\pi }}, {\varvec{\beta }}) :=\sum _{i=1}^N \beta _i ~ \rho ({\tilde{\pi }}_i(x)) \in \textrm{ri}({\mathcal {Y}}_\rho )\) is in the relative interior of the range of \(\rho \) for all \(x \in {\mathcal {X}}\). Then, we have the following results,

  1. (i)

    For a given Bregman divergence \(D_{f}[\rho ({\tilde{\pi }}_a):\rho ({\tilde{\pi }}_b)]\) with generator \({\varPsi }_f\), the optimization

    $$\begin{aligned} {\mathcal {I}_{{f,\rho }}} \big ( {{\varvec{{\tilde{\pi }}}}, {\varvec{\beta }}} \big )&:=\min \limits _\mu \sum \limits _{i=1}^N \, \beta _i \, D_{f}\left[ \rho ({\tilde{\pi }}_i) : \rho (\mu ) \right] \, . \end{aligned}$$
    (16)

    has a unique minimizer given by the quasi-arithmetic mean with representation function \(\rho ({\tilde{\pi }})\)

    $$\begin{aligned} \mu ^*_{\rho }({\varvec{{\tilde{\pi }}}}, {\varvec{\beta }}) = \rho ^{-1}\left( \sum \limits _{i=1}^N \beta _i \, \rho \big ( {\tilde{\pi }}_i \big ) \right) = \mathop {\mathrm {arg\,min}}\limits \limits _{\mu } \sum \limits _{i=1}^N \, \beta _i \, D_{f}\left[ \rho ({\tilde{\pi }}_i) : \rho (\mu ) \right] . \end{aligned}$$
    (17)

    The arithmetic mean is recovered for \(\rho ({\tilde{\pi }}_i) = {\tilde{\pi }}_i\) and any f [7].

  2. (ii)

    At this minimizing argument \(\mu _\rho ^*\), the value of the expected divergence (or right-hand side of Eq. (16)) is called the Rho–tau Bregman information and is equal to a gap in Jensen’s inequality for mixture weights \({\varvec{\beta }}\), inputs \(\rho ({\varvec{{\tilde{\pi }}}}) = \{ \rho ({\tilde{\pi }}_i) \}_{i=1}^N\), and the convex functional \({\varPsi }_f[\rho _{\tilde{\pi }}] = \int f\big ( \rho _{\tilde{\pi }}({x})) \big ) d{x}\),

    $$\begin{aligned} {\mathcal {I}_{{f,\rho }}} \big ( {{\varvec{{\tilde{\pi }}}}, {\varvec{\beta }}} \big ) = \sum \limits _{i=1}^N \beta _i \, {\varPsi _{f}}\big [ \rho ({{\tilde{\pi }}_i}) \big ] - {\varPsi _{f}}\big [ \rho (\mu ^*_{\rho }) \big ] . \end{aligned}$$
    (18)
  3. (iii)

    Using \(\mu \ne \mu ^*_{\rho }\) as the representative in Eq. (16), the suboptimality gap is a rho–tau Bregman divergence

    $$\begin{aligned} D_{f}\big [ \rho \big ( \mu _\rho ^* \big ) : \rho \big ( \mu \big )\big ] = \sum \limits _{i=1}^N \beta _i ~ D_{f}\left[ \rho \big ( {\tilde{\pi }}_i \big ) : \rho \big ( \mu \big ) \right] - {\mathcal {I}_{{f,\rho }}}\big ( {\varvec{{\tilde{\pi }}}}, {\varvec{\beta }} \big ) , \end{aligned}$$
    (19)

    where \({\mathcal {I}_{{f,\rho }}}\big ( {\varvec{{\tilde{\pi }}}}, {\varvec{\beta }} \big )\) is evaluated at \(\mu _\rho ^*\) as in Eq. (18).

Proof

(ii): We first show the optimal representative \(\mu ^{*} = \rho ^{-1}\big (\) \(\sum _{i=1}^N \beta _i \, \rho ({\tilde{\pi }}_i)\big )\) yields a Jensen diversity in (ii), before proving this choice is the unique minimizing argument.

Expanding the expected divergence in Eq. (16) for \(\mu = {\mu }^*\), we have \(\sum _{i=1}^N \beta _i D_f [ \rho ({\tilde{\pi }}_i): \rho \big ( {\mu }^* \big ) ] = \sum _{i=1}^N \beta _i {\varPsi _{f}}[\rho ({\tilde{\pi }}_i)] - {\varPsi _{f}}[\rho ({\mu }^*)]\) \(- \int \big ( \sum _{i=1}^N \beta _i \rho ({\tilde{\pi }}_i) - \rho ({\mu }^*) \big ) \tau ({\mu }^*) d{x}\). Since \(\sum _i \beta _i \rho ({\tilde{\pi }}_i) = \rho ({\mu }^*)\), the final term cancels to yield

$$\begin{aligned} \sum \limits _{i=1}^N \beta _i D_f\left[ \rho ({\tilde{\pi }}_i) : \rho \big ( {\mu }^* \big ) \right]&= \sum \limits _{i=1}^N \beta _i {\varPsi _{f}}[\rho ({\tilde{\pi }}_i)] - {\varPsi _{f}}[\rho ({\mu }^*)]. \end{aligned}$$
(51)

(i): For any other representative \(\mu \), we write the difference in expected divergence and use Eq. (51) to simplify,

figure d
(53)
(54)

where we note that \(\rho ({\mu }^*) = \sum _{i=1}^N \beta _i \rho ({\tilde{\pi }}_i)\). The rho-tau divergence is minimized if and only if \(\rho ({\mu }^*) = \rho (\mu )\) [77], thus proving (i).

(iii): Finally, we can express the suboptimality gap in Eq. (52) or rho–tau Bregman divergence in Eq. (54) as the gap in a conjugate optimization. Considering the conjugate expansion of \({\varPsi _{f}}[\rho ({{\mu }^*})]\), we have

$$\begin{aligned} \hspace{-.2cm} {\varPsi _{f}}[\rho ({\mu }^*)]&= \sup \limits _{\tau (\mu )} \int \rho \big ( {{\mu }^*}(x) \big ) \tau (\mu (x)) d{x}- {\varPsi ^*_{f^*}}[\tau (\mu )]\nonumber \\&\ge \int \rho {\big ( {\mu }^*({x}) \big )} \tau \big ( \mu ({x}) \big ) d{x}- {\varPsi ^*_{f^*}}[\tau ({\mu })] \end{aligned}$$
(55)

for any choice of \(\tau ({\mu })\). This provides a lower bound on \({\varPsi }[\rho ({\mu }^*)]\), where the gap in the lower bound is the canonical form of the Bregman divergence. Indeed, substituting \({\varPsi ^*_{f^*}}[\tau ({\mu })] = \int \rho (\mu ({x})) \tau (\mu ({x})) d{x}- {\varPsi _{f}}[\rho ({\mu })]\) in Eq. (52), we have

$$\begin{aligned} D_f\big [&\rho \big ( \mu ^*({\varvec{{\tilde{\pi }}}},{\varvec{\beta }}) \big ) : \rho \big ( \mu \big )\big ] = {\varPsi _{f}}[\rho ({\mu }^*)] + {\varPsi ^*_{f^*}}[\tau ({\mu })] - \int \rho \big ( {\mu }^*({x}) \big ) \tau \big ( \mu ({x}) \big ) d{x}\nonumber \\&={\varPsi _{f}}[\rho ({\mu }^*)] - {\varPsi _{f}}[\rho (\mu )] - \int \Big ( \rho \big ( {\mu }^*({x}) \big ) - \rho \big ( \mu ({x}) \big )\Big ) \tau \big ( \mu ({x}) \big ) d{x}\, . \end{aligned}$$

\(\square \)

1.1 B.1 Interpretations of Theorem 1(iii)

Conjugate optimizations which treat f-divergences as a convex function of one argument are popular for providing variational lower bounds on divergences [52, 62] or min-max optimizations for adversarial training [56, 58]. Note however, that this proof provides a variational upper bound on the Bregman Information, which includes the Jensen–Shannon divergence (Example 3) and mutual information (Banerjee et al. [7, Ex. 6] ) as examples. To our knowledge, this upper bound has not been used extensively in the literature.

The equality in Eq. (19) can also be interpreted as a generalized bias-variance tradeoff for Bregman divergences ( [1, 60]).

1.2 B.2 Rho–tau Bregman information with vector-valued inputs

A more standard setting is to consider a finite-dimensional Bregman divergence over a vector of inputs, such as the natural parameters \({\varvec{\theta }}\) of a (deformed) exponential family \({\tilde{\pi }}_{{\varvec{\theta }}}^{(q)}({x}) = g({x})\exp _q \{ \langle {\varvec{\theta }}, {\varvec{T}}({x})\rangle \}\). However, we argue that this setting is best captured in our representational framework (see Appendix C), using \(\rho ({\tilde{\pi }}_{{\varvec{\theta }}}^{(q)}(x)) = \log _q \frac{{\tilde{\pi }}_{{\varvec{\theta }}}^{(q)}(x)}{g(x)} = \langle {\varvec{\theta }}, {\varvec{T}}({x})\rangle = \sum _{j=1}^d \theta ^j T^j(x)\) and the \(\rho \)-linearity of the density with respect to the appropriate base measure.

Nevertheless, we would also like to extend Theorem 1 to hold for N vector-valued, d-dimensional inputs.

Theorem 3

Consider a collection of inputs \({\varvec{u}} = \{ {\varvec{u}}_i \}_{i=1}^N\) where \({\varvec{u}}_{i} = \{u_{i}^1,..., u_{i}^j,..., u_{i}^d \}_{j=1}^d\). In this case, consider applying the monotonic representation function \(\rho : {\mathcal {X}}_\rho ^d \subset \mathbb {R}^d \rightarrow {\mathcal {Y}}_\rho ^d \subset \mathbb {R}^d\) elementwise \(\rho ({\varvec{u}}_i) :=\{\rho (u_{i}^1),..., \rho (u_{i}^j),..., \rho (u_{i}^d) \}_{j=1}^d\). For a convex generating function \(F: {\mathcal {Y}}_\rho ^d \subset \mathbb {R}^d \rightarrow \mathbb {R}\), define the Bregman divergence as

$$\begin{aligned} D_F[\rho ({\varvec{u}}_a) : \rho ({\varvec{u}}_b)] = F\big ( \rho ({\varvec{u}}_a) \big ) - F\big ( \rho ({\varvec{u}}_b) \big ) - \langle \rho ({\varvec{u}}_a) - \rho ({\varvec{u}}_b), \nabla _{\rho } F\big ( \rho ({\varvec{u}}_b) \big ) \rangle \end{aligned}$$

where the inner product sums over dimensions \(1 \le j \le d\). Using analogous definition of a conjugate representation as in Sect. 2.2, we have \(\tau ({\varvec{u}}) = \nabla _{\rho } F(\rho ({\varvec{u}}))\) with \(\tau (u^j)= \frac{\partial }{\partial (\rho ({\varvec{u}}))^j} F(\rho ({\varvec{u}})) \).

Finally, consider discrete mixture weights \({\varvec{\beta }}= \{\beta _i \}_{i=1}^N\) with \(\sum _i \beta _i = 1\), and assume the expected value \(\mu _{\rho }({\varvec{u}}, {\varvec{\beta }}) :=\sum _{i=1}^N \beta _i ~ \rho ({\varvec{u}}_i) \in \textrm{ri}({\mathcal {Y}}_\rho ^d)\) is in the relative interior of the range of \(\rho \). Then, we have the following results,

  1. (i)

    For a Bregman divergence with generator F, the optimization

    $$\begin{aligned} {\mathcal {I}_{{F,\rho }}} \big ( {{\varvec{u}}, {\varvec{\beta }}} \big )&:=\min \limits _\mu \sum \limits _{i=1}^N \, \beta _i \, D_F \left[ \rho ({\varvec{u}}_i) : \rho ({\varvec{\mu }}) \right] . \end{aligned}$$
    (56)

    has a unique minimizer given by the quasi-arithmetic mean with representation function \(\rho ({\tilde{\pi }})\)

    $$\begin{aligned} {\varvec{\mu }}_\rho ^*({\varvec{u}}, {\varvec{\beta }}) = \rho ^{-1}\left( \sum \limits _{i=1}^N \beta _i \, \rho \big ( {\varvec{u}}_i \big ) \right) = \mathop {\mathrm {arg\,min}}\limits \limits _{{\varvec{\mu }}} \sum \limits _{i=1}^N \, \beta _i \, D_F \left[ \rho ({\varvec{u}}_i) : \rho ({\varvec{\mu }}) \right] . \end{aligned}$$

    The arithmetic mean is recovered for \(\rho ({\varvec{u}}_i) = {\varvec{u}}_i\) and any F [7].

  2. (ii)

    At this minimizing argument, the value of the expected divergence in Eq. (56) is called the rho–tau Bregman information and is equal to a gap in Jensen’s inequality for the convex function F, mixture weights \({\varvec{\beta }}\), and inputs \({\varvec{u}}= \{ {\varvec{u}}_i\}_{i=1}^N\),

    $$\begin{aligned} {\mathcal {I}_{{F,\rho }}} \big ( {{\varvec{u}}, {\varvec{\beta }}} \big ) = \sum \limits _{i=1}^N \beta _i \, F\big ( \rho ({\varvec{u}}_i) \big ) - F\big ( \rho ({\varvec{\mu }}_\rho ^*) \big ) . \end{aligned}$$
    (57)
  3. (iii)

    Using \({\varvec{\mu }} \ne {\varvec{\mu }}_\rho ^*({\varvec{u}}, {\varvec{\beta }})\) as the representative in Eq. (56), the suboptimality gap is a rho-tau Bregman divergence

    $$\begin{aligned} D_F\big [ \rho \big ( {\varvec{\mu }}_\rho ^* \big ) : \rho \big ( {\varvec{\mu }} \big )\big ]&= \sum \limits _{i=1}^N \beta _i ~ D_F\left[ \rho \big ( {\varvec{u}}_i \big ) : \rho \big ( {\varvec{\mu }} \big ) \right] - {\mathcal {I}_{{F,\rho }}}\big ( {\varvec{u}}, {\varvec{\beta }} \big ) . \end{aligned}$$
    (58)

Proof

(ii):   Again, we start by showing that the optimal representative \({\varvec{\mu }}_\rho ^{*} = \rho ^{-1}\big (\) \(\sum _{i=1}^N \beta _i \, \rho ({\varvec{u}}_i)\big )\) yields a Jensen diversity in (ii). Expanding the expected divergence in Eq. (56) for \({\varvec{\mu }} = {\varvec{\mu }}_\rho ^{*}\), we have \(\sum _{i=1}^N \beta _i D_F [ \rho ({\varvec{u}}_i): \rho \big ( {\varvec{\mu }}_\rho ^* \big ) ] = \sum _{i=1}^N \beta _i F\big ( \rho ({\varvec{u}}_i) \big ) - F\big ( \rho ({\varvec{\mu }}_\rho ^*) \big )\) \(- \langle \sum _{i=1}^N \beta _i \rho ({\varvec{u}}_i) - \rho ({\varvec{\mu }}_\rho ^*), \tau ({\varvec{\mu }}_\rho ^*) \rangle \). Since \(\sum _i \beta _i \rho ({\varvec{u}}_i) = \rho ({\varvec{\mu }}_\rho ^*)\), the final term cancels to yield

$$\begin{aligned} \sum \limits _{i=1}^N \beta _i D_F\left[ \rho ({\varvec{u}}_i) : \rho \big ( {\varvec{\mu }}_\rho ^* \big ) \right]&= \sum _{i=1}^N \beta _i F\big ( \rho ({\varvec{u}}_i) \big ) - F\big ( \rho ({\varvec{\mu }}_\rho ^*) \big ) . \end{aligned}$$
(59)

(i, iii):   Writing the difference in expected divergence for a suboptimal representative \({\varvec{\mu }}\) and using Eq. (59), we have

(60)

The rho–tau divergence is minimized iff \(\rho ({\varvec{\mu }}_\rho ^*) = \rho ({\varvec{\mu }})\) [77], thus proving (i). \(\square \)

C Parametric Bregman divergence and annealing paths within (deformed) exponential families

Consider a q-exponential family with a d-dimensional natural parameter vector \({\varvec{\theta }}\in \varTheta \subset {\mathbb {R}}^d\), sufficient statistic vector \({\varvec{T}}({x})\), and base density \(g({x})\),

$$\begin{aligned} \pi ^{(q)}_{{\varvec{\theta }}}({x})&= \frac{1}{\mathcal {Z}_q({\varvec{\theta }})} g({x})\exp _q \{ \langle {\varvec{\theta }}, {\varvec{T}}({x})\rangle \} \qquad \nonumber \\ \text {where} \quad \mathcal {Z}_q({\varvec{\theta }})&= \int g({x})\exp _q \{ \langle {\varvec{\theta }}, {\varvec{T}}({x})\rangle \} d{x}. \end{aligned}$$
(61)

We let \(\tilde{\pi }^{(q)}_{{\varvec{\theta }}}({x}) = g({x})\exp _q \{ \langle {\varvec{\theta }}, {\varvec{T}}({x})\rangle \}\) denote the unnormalized density, often abbreviating to \({\tilde{\pi }}_{{\varvec{\theta }}}({x})\) for convenience.

From the convexity of \(\exp _q\), it can be shown that the normalization constant \(\mathcal {Z}_q({\varvec{\theta }})\) is a convex function of the parameters \({\varvec{\theta }}\), with first derivative

$$\begin{aligned} \frac{\partial \mathcal {Z}_q({\varvec{\theta }})}{\partial \theta ^j}&= \int g({x})[1 + (1-q) {\varvec{\theta }}\cdot {\varvec{T}}({x})]^{\frac{q}{1-q}} \cdot T^j({x}) d{x}\nonumber \\ {}&= \int g({x})^{1-q} {\tilde{\pi }}_{{\varvec{\theta }}}({x})^{q} \cdot T^j({x}) d{x}\end{aligned}$$
(62)

Parametric Interpretation of Amari \(\alpha \)-divergence We now show that the Bregman divergence induced by \(\frac{1}{q} \mathcal {Z}_q({\varvec{\theta }})\), for \(q > 0\), corresponds to the Amari \(\alpha \)-divergence between parametric unnormalized densities,

$$\begin{aligned}&D_{\frac{1}{q} \mathcal {Z}_q}[{\varvec{\theta }}^{\prime }: {\varvec{\theta }}] = \frac{1}{q} \mathcal {Z}_q({\varvec{\theta }}^{\prime }) - \frac{1}{q} \mathcal {Z}_q({\varvec{\theta }}) - \langle \nabla \frac{1}{q} \mathcal {Z}_q({\varvec{\theta }}), {\varvec{\theta }}^{\prime }- {\varvec{\theta }}\rangle \nonumber \\&\qquad =\frac{1}{q} \mathcal {Z}_q({\varvec{\theta }}^{\prime }) - \frac{1}{q} \mathcal {Z}_q({\varvec{\theta }}) - \frac{1}{q}\int g({x})^{1-q} {\tilde{\pi }}_{{\varvec{\theta }}}({x})^{q} \Big ( \langle {\varvec{\theta }}^{\prime }, {\varvec{T}}({x})\rangle - \langle {\varvec{\theta }}, {\varvec{T}}({x})\rangle \Big ) d{x}\nonumber \\&\qquad \overset{(1)}{=}\ \frac{1}{q} \mathcal {Z}_q({\varvec{\theta }}^{\prime }) - \frac{1}{q} \mathcal {Z}_q({\varvec{\theta }}) - \frac{1}{q} \int g({x})^{1-q} {\tilde{\pi }}_{{\varvec{\theta }}}({x})^{q} \Big ( \log _q \frac{\tilde{\pi }_{{\varvec{\theta }}^{\prime }}({x})}{g({x})} - \log _q \frac{{\tilde{\pi }}_{{\varvec{\theta }}}({x})}{g({x})} \Big ) d{x}\nonumber \\&\qquad = \frac{1}{q} \mathcal {Z}_q({\varvec{\theta }}^{\prime }) - \frac{1}{q} \mathcal {Z}_q({\varvec{\theta }}) - \frac{1}{q (1-q)} \int g({x})^{1-q} {\tilde{\pi }}_{{\varvec{\theta }}}({x})^{q} \Big ( \frac{\tilde{\pi }_{{\varvec{\theta }}^{\prime }}({x})}{g({x})}^{1-q} - \frac{{\tilde{\pi }}_{{\varvec{\theta }}}({x})}{g({x})}^{1-q} \Big ) d{x}\nonumber \\&\qquad = \frac{1}{q} \mathcal {Z}_q({\varvec{\theta }}^{\prime }) - \frac{1}{q} \mathcal {Z}_q({\varvec{\theta }}) - \frac{1}{q (1-q)} \int \Big ( \tilde{\pi }_{{\varvec{\theta }}^{\prime }}({x})^{1-q} {\tilde{\pi }}_{{\varvec{\theta }}}({x})^{q} d{x}+ \frac{1}{q(1-q)} \int {\tilde{\pi }}_{{\varvec{\theta }}}({x}) d{x}\nonumber \\&\qquad = \frac{1}{q}\int \tilde{\pi }_{{\varvec{\theta }}^{\prime }}({x}) d{x}+ \frac{1}{1-q} \int {\tilde{\pi }}_{{\varvec{\theta }}}({x}) d{x}- \frac{1}{q(1-q)} \int \tilde{\pi }_{{\varvec{\theta }}^{\prime }}({x})^{1-q} {\tilde{\pi }}_{{\varvec{\theta }}}({x})^{q} d{x}\nonumber \\&\qquad = D_{A}^{(q)}[\tilde{\pi }_{{\varvec{\theta }}^{\prime }} : {\tilde{\pi }}_{{\varvec{\theta }}} ] \end{aligned}$$
(63)

where in (1) we use the fact that \(\log _q \frac{{\tilde{\pi }}_{{\varvec{\theta }}}({x})}{g({x})} = \langle {\varvec{\theta }}, {\varvec{T}}({x})\rangle \).

For \(q=1\) and the exponential family, we have \(\mathcal {Z}({\varvec{\theta }}) = \int {\tilde{\pi }}_{{\varvec{\theta }}}({x}) d{x}\) and \(\frac{\partial \mathcal {Z}({\varvec{\theta }})}{\partial \theta ^j} = \int {\tilde{\pi }}_{{\varvec{\theta }}}({x}) T^j({x}) d{x}\), which leads to the Bregman divergence

$$\begin{aligned} D_{Z}[{\varvec{\theta }}^{\prime }: {\varvec{\theta }}]&= D_{\text {KL}}[{\tilde{\pi }}_{{\varvec{\theta }}}({x}) : {\tilde{\pi }}_{{\varvec{\theta }}^{\prime }}({x})]. \end{aligned}$$

By contrast, the divergence generated by the log partition function \(\log \mathcal {Z}({\varvec{\theta }})\) yields the kl divergence between normalized distributions. Using \(\frac{\partial }{\partial \theta ^j} \log \mathcal {Z}({\varvec{\theta }}) = \frac{1}{\mathcal {Z}({\varvec{\theta }})} \int {\tilde{\pi }}_{{\varvec{\theta }}}({x})T^j({x}) d{x}= \int \pi _{{\varvec{\theta }}}({x}) T^j({x}) d{x}\), we recover the well-known result [66]

$$\begin{aligned} D_{\log \mathcal {Z}}[{\varvec{\theta }}^{\prime }: {\varvec{\theta }}]&= D_{\text {KL}}[\pi _{{\varvec{\theta }}}({x}) : \pi _{{\varvec{\theta }}^{\prime }}({x})]. \end{aligned}$$

1.1 C.1 Parametric divergence in \(\rho = \log _q\) representation

Alternatively, we may view the parametric divergence \(D_{\frac{1}{q} \mathcal {Z}_q}[{\varvec{\theta }}^{\prime }: {\varvec{\theta }}]\) as a decomposable divergence in the \(\log _q\)-representation \(\rho ({\tilde{\pi }}_{{\varvec{\theta }}}^{(q)}) = \log _q \frac{{\tilde{\pi }}_{{\varvec{\theta }}}^{(q)}(x)}{g(x)}\), where

$$\begin{aligned} \begin{aligned} \rho ({\tilde{\pi }}_{{\varvec{\theta }}}^{(q)})&= \log _q \frac{{\tilde{\pi }}_{{\varvec{\theta }}}^{(q)}({x})}{g({x})}= \langle {\varvec{\theta }}, {\varvec{T}}({x})\rangle \qquad f(\rho ) = \frac{1}{q} \left( \exp _q\{\rho \}-\rho -1 \right) \qquad \\ {\varPsi _{f}}\big [ \rho _{{\tilde{\pi }}} \big ]&= - \frac{1}{q} \int \langle {\varvec{\theta }}, {\varvec{T}}({x})\rangle d{x}+ \frac{1}{q} \mathcal {Z}_q({\varvec{\theta }}) - \frac{1}{q}. \end{aligned} \end{aligned}$$
(64)

Note, the order of the \(\alpha \)-divergence is set by the deformation parameter q in the definition of the representation function \(\rho _q\) or q-exponential family. In Sect. 4.2 Example 11, we have used this Bregman divergence minimization to interpret the q-paths between arbitrary endpoints from a parametric perspective.

This interpretation also suggests that the form of the deformed family in Eq. (61), particularly its \(\log _q\)-linearity in \({\varvec{\theta }}\), is sufficient to derive the parametric divergence as a non-parametric rho-tau divergence using Eq. (64).

Due to this generality of the non-parametric perspective, we advocate viewing parametric annealing paths within the same q-exponential family through the lens of Theorem 1 instead of the vector-valued perspective in Theorem 3. Indeed, when annealing between parametric deformed exponential family endpoints in Example 15, the quasi-arithmetic mean in the \(\rho = \log _q\) representation (using Theorem 1) suggests linear or arithmetic mixing of the natural parameters. This interpretation is more natural, and analogous to the exponential family case, compared to taking the quasi-arithmetic mean of the parameter vectors \({\varvec{\theta }}\) directly (using Theorem 3).

1.2 C.2 Annealing paths between (deformed) exponential family endpoint densities

The above Bregman divergences can be used to analyze annealing paths in the special case where the endpoint densities \({\tilde{\pi }}_{{\varvec{\theta }}_0}\) and \({\tilde{\pi }}_{{\varvec{\theta }}_1}\) belong to the same (deformed) exponential family in Eq. (61).

Example 15

(Annealing within (deformed) exponential families) Due to the \(\rho \)-affine property of deformed exponential families, it is natural to consider the \(\rho ({\tilde{\pi }}) = \log _q {\tilde{\pi }}\) path within the \(\exp _q\) family.

Ignoring the normalization constant, the unnormalized density with respect to \(g({x})\) is linear in \({\varvec{\theta }}\) after applying the \(\rho (\tilde{\pi }_{{\varvec{\theta }}}) = \log \frac{\tilde{\pi }_{{\varvec{\theta }}}({x})}{g({x})}\) representation function, with \(\log \frac{{\tilde{\pi }}_{{\varvec{\theta }}}({x})}{g({x})} = \langle {\varvec{\theta }}, {\varvec{T}}({x})\rangle .\) Since the quasi-arithmetic mean also has this \(\rho \)-affine property (Eq. 5), we can see that the q-path or geometric path between (deformed) exponential endpoints \({\tilde{\pi }}_{{\varvec{\theta }}_0}({x})\) and \({\tilde{\pi }}_{{\varvec{\theta }}_1}({x})\) is simply a linear interpolation in the natural parameters

$$\begin{aligned} {\varvec{\theta }}_\beta&= (1-\beta ) \, {\varvec{\theta }}_0 + \beta \, {\varvec{\theta }}_1 = \mathop {\mathrm {arg\,min}}\limits \limits _{{\varvec{\theta }}_r} ~ (1-\beta ) D_{\frac{1}{q}\mathcal {Z}_q}[{\varvec{\theta }}_0 : {\varvec{\theta }}_r] + \beta ~ D_{\frac{1}{q}\mathcal {Z}_q}[{\varvec{\theta }}_1 : {\varvec{\theta }}_r]. \end{aligned}$$

which includes the kl divergence and exponential family for \(q=1\).

Example 16

(Moment averaging path of Grosse et al. [34]) In the case of the standard exponential family,

$$\begin{aligned} \begin{aligned} \pi _{{\varvec{\theta }}}({x})&= g({x})\exp \{ \langle {\varvec{\theta }}, {\varvec{T}}({x})\rangle - \psi ({\varvec{\theta }}) \} \,\, \text {with} \,\, \psi ({\varvec{\theta }}) = \log \int g({x})\exp \{ \langle {\varvec{\theta }}, {\varvec{T}}({x})\rangle \} d{x}, \end{aligned} \end{aligned}$$

Grosse et al. [34] propose the moment averaging path, which uses the dual parameter mapping \(\rho ({\varvec{\theta }}) = {\varvec{\eta }}({\varvec{\theta }}) = \mathbb {E}_{\pi _{{\varvec{\theta }}}}\left[ {\varvec{T}}({x})\right] \) as a representation function for the quasi-arithmetic mean,

$$\begin{aligned} {\varvec{\eta }}({\varvec{\theta }}_\beta )&= (1-\beta ) \, {\varvec{\eta }}({\varvec{\theta }}_0) + \beta \, {\varvec{\eta }}({\varvec{\theta }}_1) =\mathop {\mathrm {arg\,min}}\limits \limits _{{\varvec{\eta }}_r} (1-\beta ) D_{\psi ^*}[{\varvec{\eta }}_0 : {\varvec{\eta }}_r] + \beta ~ D_{\psi ^*}[{\varvec{\eta }}_1 : {\varvec{\eta }}_r] \end{aligned}$$

for an appropriate dual divergence based on the dual of the log partition function \(\psi ^*({\varvec{\eta }}({\varvec{\theta }})) = D_{KL}[\pi _{{\varvec{\theta }}}(x): g(x)]\) (see [34]). While Grosse et al. [34] show performance gains using the moment averaging path, additional sampling procedures may be required to find \({\varvec{\theta }}_{\beta }\) via the inverse mapping \({\varvec{\eta }}^{-1}(\cdot )\).

D Information geometry of rho–tau divergences

We next review results from [78] describing the statistical manifolds induced by rho–tau Bregman Informations. We summarize using our notation in Table 2.

Eguchi relations The seminal Eguchi relations [26, 27] describe the statistical manifold structure \((\mathcal {M}, g, \nabla , \nabla ^*)\) induced by a divergence \(D[{\pi _a}:{\pi _b}]\). We first consider a manifold of parametric densities \(\mathcal {M}\) represented by a coordinate system \({\varvec{\theta }}(\pi ): \mathcal {M}_{{\varvec{\theta }}} \mapsto \varTheta \subset \mathbb {R}^N\), with \(\partial _i = \frac{\partial }{\partial \theta ^i}\) as a basis for the tangent space. The Riemannian metric is written \(g_{ij}({\varvec{\theta }}) = \langle \partial _i, \partial _j \rangle \), while the affine connection or covariant derivative is expressed using the scalar Christoffel symbols \(\varGamma _{ij,k}({\varvec{\theta }}) = \langle \nabla _{\partial _i} \partial _j, \partial _k \rangle \) [4, 54]. For a given divergence, taking the second and third order differentials yield the following metric and conjugate pair of affine connections

$$\begin{aligned} g_{ij}({\varvec{\theta }}) = - ( {\partial _j})_{{\pi _{{\varvec{\theta }}_a}}} ( {\partial _k} )_{\pi _{{\varvec{\theta }}_b}}&D[{\pi _{{\varvec{\theta }}_a}}:{\pi _{{\varvec{\theta }}_b}}] \Big |_{{\pi _{{\varvec{\theta }}_a}}={\pi _{{\varvec{\theta }}_b}}} \end{aligned}$$
(65)
$$\begin{aligned} \varGamma _{ij,k}({\varvec{\theta }}) = - ( {\partial _i})_{\pi _{{\varvec{\theta }}_a}}( {\partial _j})_{\pi _{{\varvec{\theta }}_a}}( {\partial _k})_{\pi _{{\varvec{\theta }}_b}}&D[{\pi _{{\varvec{\theta }}_a}}:{\pi _{{\varvec{\theta }}_b}}] \Big |_{{\pi _{{\varvec{\theta }}_a}}={\pi _{{\varvec{\theta }}_b}}} \end{aligned}$$
(66)
$$\begin{aligned} \varGamma ^*_{ij,k}({\varvec{\theta }}) = - ( {\partial _i})_{\pi _{{\varvec{\theta }}_b}}( {\partial _j})_{\pi _{{\varvec{\theta }}_b}}( {\partial _k})_{\pi _{{\varvec{\theta }}_a}}&D[{\pi _{{\varvec{\theta }}_a}}:{\pi _{{\varvec{\theta }}_b}}] \Big |_{{\pi _{{\varvec{\theta }}_a}}={\pi _{{\varvec{\theta }}_b}}} \end{aligned}$$
(67)

where \(({\partial _j})_{{\pi _{{\varvec{\theta }}_a}}}\) indicates partial differentiation with respect to the parameter \(\theta _{a}^j\) with index j of the first argument.

Statistical manifold from rho–tau divergence Following Zhang [77, 78], viewing \(D_{f,\rho }^{(\beta )}[{\tilde{\pi }}_0: {\tilde{\pi }}_1] :=\frac{1}{\beta (1-\beta )} \, {\mathcal {I}_{{f,\rho }}} \big ( {{\varvec{{\tilde{\pi }}}}, {\varvec{\beta }}} \big )\) in Eq. (22) as a divergence functional yields the following Riemannian metric and primal affine connection (expressed using the Christoffel symbols \(\varGamma _{ij,k}({\varvec{\theta }})\)),

$$\begin{aligned} \hspace{-.2cm} g_{ij}({\varvec{\theta }})&= \int \frac{\partial \rho _{\pi _{{\varvec{\theta }}}}({x})}{\partial \theta ^i} \frac{\partial \tau _{\pi _{{\varvec{\theta }}}}({x})}{\partial \theta ^j} d{x}= \int {\rho ^{\prime }_{\pi _{{\varvec{\theta }}}}({x}) \tau ^{\prime }_{\pi _{{\varvec{\theta }}}}({x})}\frac{\partial \pi _{{\varvec{\theta }}}({x})}{\partial \theta ^i}\frac{\partial \pi _{{\varvec{\theta }}}({x})}{\partial \theta ^j} d{x}, \end{aligned}$$
(68)
$$\begin{aligned} \varGamma _{ij,k}({\varvec{\theta }})&= {\int {\rho ^{\prime }_{\pi _{{\varvec{\theta }}}} \tau ^{\prime }_{\pi _{{\varvec{\theta }}}}} \left( \frac{\partial ^2 \pi _{{\varvec{\theta }}}}{\partial \theta ^i \partial \theta ^j} \frac{\partial \pi _{{\varvec{\theta }}}}{\partial \theta ^k} - \alpha \big ({x}; \rho ,\tau ,\beta \big ) \frac{\partial \pi _{{\varvec{\theta }}}}{\partial \theta ^i }\frac{\partial \pi _{{\varvec{\theta }}}}{\partial \theta ^j } \frac{\partial \pi _{{\varvec{\theta }}}}{\partial \theta ^k} \right) d{x}} \nonumber \\&\qquad \text {where}\,\, \alpha \big ({x}; \rho ,\tau ,\beta \big ) = -(1-\beta ) \frac{\tau ^{\prime \prime }_{\pi _{{\varvec{\theta }}}}({x})}{\tau ^{\prime }_{\pi _{{\varvec{\theta }}}}({x})} - \beta \frac{\rho ^{\prime \prime }_{\pi _{{\varvec{\theta }}}}({x})}{\rho ^{\prime }_{\pi _{{\varvec{\theta }}}}({x})} . \end{aligned}$$
(69)
Table 2 Dual pair of affine connections induced by divergence functions considered in Sect. 4, where \(\alpha (\beta ,\rho ,\tau )\) refers to Eq. (69) for primal and dual connections \(\varGamma ({\tilde{\pi }})\) and \(\varGamma ^*({\tilde{\pi }})\). The factor \(\rho ^{\prime }({\tilde{\pi }})\tau ^{\prime }({\tilde{\pi }})\) also specifies the Riemannian metric in Eq. (68)

Since our exposition in Sects. 2.2 and  4 considers a nonparametric manifold of arbitrary unnormalized densities, we also recall the nonparametric analogues of Eqs. (68) and (69) from Zhang [78].Footnote 3 For tangent vectors \(u({x}), v({x}), w({x})\) (such that \(\int u({x}) d{x}= 0\)) at a point \({\tilde{\pi }}({x})\),

$$\begin{aligned} g_{u,v}({{\tilde{\pi }}})&= \langle u, v \rangle =\int \rho ^{\prime }_{{\tilde{\pi }}}{({x})} \tau ^{\prime }_{{\tilde{\pi }}}{({x})} \, u({x}) v({x}) d{x} \end{aligned}$$
(70)
$$\begin{aligned} \varGamma ^{(\beta )}_{wu,v}({\tilde{\pi }})&= \langle \nabla ^{(\beta )}_w u, v \rangle \nonumber \\&= \int \rho ^{\prime }_{{\tilde{\pi }}}{({x})} \tau ^{\prime }_{{\tilde{\pi }}}{({x})} \Big ( \big ( d_w u({x}) \big ) v({x}) -\alpha \big ({x}; \rho ,\tau ,\beta \big ) u({x}) w({x}) v({x}) \Big ) d{x}. \nonumber \\&\qquad \text {where}\,\, \alpha \big ({x}; \rho ,\tau ,\beta \big ) = -(1-\beta ) \frac{\tau ^{\prime \prime }_{{\tilde{\pi }}}({x})}{\tau ^{\prime }_{{\tilde{\pi }}}({x})} - \beta \frac{\rho ^{\prime \prime }_{{\tilde{\pi }}}({x})}{\rho ^{\prime }_{{\tilde{\pi }}}({x})} . \end{aligned}$$
(71)

where \(d_w u\) is the directional derivative of u in the direction of w and \(\rho ^{\prime }_{{\tilde{\pi }}}{({x})} = \rho ^{\prime }\big ( {\tilde{\pi }}({x}) \big )\). The parametric expression above can be recovered using, for example, \(u({x}) = \frac{\partial }{\partial \theta ^i}\pi _{{\varvec{\theta }}}({x})\).

Riemannian metrics To recover the Fisher–Rao metric, we may consider the \(\rho ({\tilde{\pi }}) = \log {\tilde{\pi }}\) and \(\tau ({\tilde{\pi }}) = \pi \) representations, which yields \(\rho ^\prime ({\tilde{\pi }}) \tau ^\prime ({\tilde{\pi }}) = {\tilde{\pi }}^{-1}\) as desired. However, the Fisher-Rao metric may also be recovered using the representations \(\rho ({\tilde{\pi }}) = \log _q({\tilde{\pi }})\), \(\tau ({\tilde{\pi }}) = \log _{1-q} {\tilde{\pi }}\) used to derive the \(\alpha \)-divergence in Example 5 (see e.g. Nielsen [54, Sect. 3.12] ). Finally, the Jensen–Shannon divergence and Zhang’s \((\beta , q)\) divergence also induce the Fisher–Rao metric due to the fact that the outer integration term \(\rho ^\prime ({\tilde{\pi }})\tau ^\prime ({\tilde{\pi }}) = {\tilde{\pi }}^{-1}\), while the metric for the Beta divergence integrates \(\rho ^\prime ({\tilde{\pi }})\tau ^\prime ({\tilde{\pi }}) = {\tilde{\pi }}^{-q}\) and thus may be referred to as a ‘deformed’ metric [49, 80].

Affine connections Recall that the standard \(\alpha \)-connection [3, 4] is given by \(\alpha (x;\rho ,\tau ,\beta ) = \alpha \cdot \pi _{{\varvec{\theta }}}({x})^{-1}\). From Table 2, we see that the \(\alpha \)-connection with \(\alpha = \beta \) is induced from either the Amari \(\alpha \)-divergence or the Jensen-Shannon divergence, which are the rho-tau Bregman Information corresponding to the kl divergences in either direction. When treating the \(\alpha \)-divergence as a rho-tau Bregman divergence (as in Example 5) instead of a Bregman Information (as in Example 4), we see that the order of the \(\alpha \)-connection is set by the representation parameter \(\alpha = q\). This mirrors our observations in Sect. 4.1. Finally, note that the Beta divergence (as a rho-tau Bregman divergence) induces the alpha connections of order q and 0, instead of 1 and 0 for the kl divergence or q and \(1-q\) for the Amari \(\alpha \)-divergence.

The divergence functionals derived from the rho–tau Bregman information for either the Beta, \(\alpha \), or Amari–Cichocki \((q,\lambda )\) divergences induce \(\alpha \)-connections which interpolate between the endpoint values based on the mixture parameter \(\beta \).

Limiting behavior of Amari-\(\alpha \) and Beta-divergences Finally, in Table 3, we recall the limiting behavior of the \(\alpha \) divergences as \(q\rightarrow 0\) or \(q \rightarrow 1\), and the Beta divergence as \(q \rightarrow 1\) or \(q \rightarrow 2\). While these families of divergences agree in their limiting behavior as \(q \rightarrow 1\), the Beta divergence recovers either the Euclidean Bregman divergence or Itakura–Saito divergence for \(q = 0\) and \(q \rightarrow 2\) respectively.

Table 3 Limiting behavior in q for Amari \(\alpha \) (\(\tau \)-deformed) and Beta-divergences (\(\tau \)-id) as rho–tau Bregman divergences.

E Geodesics for the rho–tau Bregman divergence

In this section, we show that the quasi-arithmetic mixture path in the \(\rho ({\tilde{\pi }})\) representation of densities is a geodesic with respect to the primal connection induced by the rho-tau divergence \(D_{f}[\rho ({\tilde{\pi }}_1):\rho ({\tilde{\pi }}_0)]\). Recall from Zhang [78, Sect. 2.3 (Eq. 88)] that the \(\alpha \)-connection (or covariant derivative) associated with the rho–tau Bregman divergence has the form

$$\begin{aligned} \nabla ^{(\alpha )}_{\dot{\gamma }} \dot{\gamma }&= (d_{\dot{\gamma }} \dot{\gamma })_{\gamma _t} + \frac{d}{d\gamma } \big ( \alpha \log \rho ^{\prime }(\gamma _t) + (1-\alpha ) \log \tau ^{\prime }(\gamma _t) \big ) \cdot \dot{\gamma }^2 \nonumber \\&= (d_{\dot{\gamma }} \dot{\gamma })_{\gamma _t} + \left( \alpha \frac{ \rho ^{\prime \prime }(\gamma _t)}{\rho ^{\prime }(\gamma _t)} + (1-\alpha ) \frac{ \tau ^{\prime \prime }(\gamma _t)}{\tau ^{\prime }(\gamma _t)} \right) \cdot \dot{\gamma }^2 \end{aligned}$$
(72)

where \(\dot{\gamma } = \frac{d\gamma }{dt}\) and the parameter \(\alpha \) plays the role of the convex combination or mixture parameter \(\beta \). In this section, we use the notation \(\nabla ^{(\alpha )}\) to represent the affine connection, instead of the Christoffel symbol notation from e.g. Eq. (71), with \(\varGamma ^{(\alpha )}_{wu,v}({\tilde{\pi }}) = \langle \nabla ^{(\alpha )}_w u, v \rangle \).

We are interested in the Bregman divergence and \(\nabla ^{(1)}\) connection for \(\alpha =1\). Using Eq. (5), we need to show that the geodesic equation \( \nabla ^{(1)}_{\dot{\gamma }} \dot{\gamma } = 0\) holds ([54] Sec. 3.12) for curves which are linear in the \(\rho \)-representation.

Theorem 2

(Geodesics for Rho-Tau Bregman Divergence) The curve \(\gamma _t = \rho ^{-1} \big ( (1-t) \rho ({\tilde{\pi }}_0) + t \, \rho ({\tilde{\pi }}_1) \big )\) (with time derivative \(\dot{\gamma _t} = \frac{d}{dt}\gamma _t\)) is auto-parallel with respect to the primal affine connection \(\nabla ^{(1)}\) induced by the rho-tau Bregman divergence \(D_{f}[\rho ({{{\tilde{\pi }}_a}}):\rho ({{{\tilde{\pi }}_b}})]\) for \(\beta = 1\). In other words, the following geodesic equation holds

$$\begin{aligned} \nabla ^{(1)}_{\dot{\gamma }_t} \dot{\gamma }_t = 0. \end{aligned}$$
(23)

The \(\rho \)-representation of the unnormalized density thus provides an affine coordinate system for the geometry induced by \(D_{f}[\rho ({{{\tilde{\pi }}_a}}):\rho ({{{\tilde{\pi }}_b}})]\).

Proof

We simplify each of the terms in the geodesic equation, where we rewrite the desired geodesic equation in Eq. (23) to match Eq. 88 of Zhang [78],

$$\begin{aligned} \nabla ^{(1)}_{\dot{\gamma }} \dot{\gamma } = d_{\dot{\gamma }} \dot{\gamma } + \big (\dot{\gamma }\big )^{2} \cdot \left( \frac{d}{d\gamma } \log {\rho ^{\prime }(\gamma )} \right) = 0 \end{aligned}$$
(73)

First, note the particularly simple expression for \(\frac{d\rho (\gamma _t)}{dt} = \rho ({\tilde{\pi }}_1) - \rho ({\tilde{\pi }}_0)\) given the definition \(\rho (\gamma (t)) = (1-t) \rho ({\tilde{\pi }}_0) + t \, \rho ({\tilde{\pi }}_1)\). Noting the chain rule \(\frac{d\rho _t}{dt} = \frac{d\rho (\gamma _t)}{d\gamma } \frac{d\gamma _t}{dt}\), we can rearrange to obtain an expression for \(\dot{\gamma }(t)\)

$$\begin{aligned} \dot{\gamma }(t) = \frac{d\gamma (t)}{dt} = \left( \frac{d\rho (\gamma _t)}{d\gamma }\right) ^{-1} \frac{d\rho (\gamma _t)}{dt} = \left( \frac{d\rho (\gamma _t)}{d\gamma }\right) ^{-1} \left( \rho ({\tilde{\pi }}_1) - \rho ({\tilde{\pi }}_0) \right) \end{aligned}$$
(74)

Taking the directional derivative \(d_{\dot{\gamma }}\dot{\gamma } = \frac{d\dot{\gamma }}{d\gamma }\cdot \dot{\gamma }\),

$$\begin{aligned} d_{\dot{\gamma _t}}\dot{\gamma _t}&= \dot{\gamma _t} \cdot \frac{d}{d\gamma } \left[ \left( \frac{d\rho (\gamma _t)}{d\gamma }\right) ^{-1} \left( \rho ({\tilde{\pi }}_1) - \rho ({\tilde{\pi }}_0) \right) \right] \nonumber \\&= - \dot{\gamma _t}\left( \rho ({\tilde{\pi }}_1) - \rho ({\tilde{\pi }}_0) \right) \frac{d\rho (\gamma _t)}{d\gamma }^{-2} \frac{d^2\rho (\gamma _t)}{d\gamma ^2} \end{aligned}$$
(75)

Rewriting the final term in Eq. (23), we have

$$\begin{aligned} \frac{d}{d\gamma } \big ( \log \rho ^{\prime }(\gamma _t) \big )&= \frac{d}{d\gamma } \big ( \log \frac{d \rho (\gamma _t)}{d\gamma } \big ) = \left( \frac{d\rho (\gamma _t)}{d\gamma }\right) ^{-1} \frac{d^2 \rho (\gamma _t) }{d\gamma ^2}. \end{aligned}$$
(76)

Putting it all together, we have

$$\begin{aligned} \nabla ^{(1)}_{\dot{\gamma _t}} \dot{\gamma _t}&= -\frac{d \dot{\gamma _t}}{d\gamma } \cdot \dot{\gamma _t} + \big (\dot{\gamma _t} \big )^{2} \cdot \frac{d}{d\gamma } \big ( \log \rho ^{\prime }(\gamma _t) \big )\nonumber \\&= -\dot{\gamma _t} \left( \rho ({\tilde{\pi }}_1) - \rho ({\tilde{\pi }}_0) \right) \frac{d\rho (\gamma _t)}{d\gamma }^{-2} \frac{d^2\rho (\gamma _t)}{d\gamma ^2}\nonumber \\&\quad + (\dot{\gamma _t})^2 \left( \frac{d\rho (\gamma _t)}{d\gamma }\right) ^{-1} \frac{d^2 \rho (\gamma _t) }{d\gamma ^2} \end{aligned}$$
(77)

Noting that \(\frac{d\rho (\gamma _t)}{d\gamma }^{-1} = \dot{\gamma }(\rho ({\tilde{\pi }}_1) - \rho ({\tilde{\pi }}_0))\) from Eq. (74), we have

which proves the proposition. \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Brekelmans, R., Nielsen, F. Variational representations of annealing paths: Bregman information under monotonic embedding. Info. Geo. (2024). https://doi.org/10.1007/s41884-023-00129-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s41884-023-00129-6

Keywords

Navigation