Skip to main content
Log in

Sum-of-Squares Relaxations for Information Theory and Variational Inference

  • Published:
Foundations of Computational Mathematics Aims and scope Submit manuscript

Abstract

We consider extensions of the Shannon relative entropy, referred to as f-divergences. Three classical related computational problems are typically associated with these divergences: (a) estimation from moments, (b) computing normalizing integrals, and (c) variational inference in probabilistic models. These problems are related to one another through convex duality, and for all of them, there are many applications throughout data science, and we aim for computationally tractable approximation algorithms that preserve properties of the original problem such as potential convexity or monotonicity. In order to achieve this, we derive a sequence of convex relaxations for computing these divergences from non-centered covariance matrices associated with a given feature vector: starting from the typically non-tractable optimal lower-bound, we consider an additional relaxation based on “sums-of-squares”, which is is now computable in polynomial time as a semidefinite program. We also provide computationally more efficient relaxations based on spectral information divergences from quantum information theory. For all of the tasks above, beyond proposing new relaxations, we derive tractable convex optimization algorithms, and we present illustrations on multivariate trigonometric polynomials and functions on the Boolean hypercube.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Note that our notation \(D(p\Vert q)\) ignores the dependence in f.

  2. Note that using an outer approximation in Eq. (10) leads to a relaxation per se while in the dual view presented here, we obtain a strengthening due to replacing non-negative functions by the smaller set of sums-of-squares. Throughout the paper, we will use the term “relaxation” in all cases.

  3. Note that \(\Sigma _p\) and \(\Sigma _q\) do depend on \(\varphi \), but we omit the dependence in the notation.

  4. In this section, we do not need to impose that \(\Vert T \varphi (x)\Vert = 1\) for all \(x \in \mathcal {X}\), but we will in subsequent sections.

  5. A similar formulation can be derived for \(D^{\textrm{OPT}}\) using \(\Gamma \) instead of \(\widehat{\Gamma }\).

  6. Matlab code to reproduce all experiments can be downloaded from www.di.ens.fr/~fbach/fdiv_quantum_var.zip.

References

  1. Syed Mumtaz Ali and Samuel D. Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 28(1):131–142, 1966.

  2. Shun-ichi Amari and Atsumi Ohara. Geometry of \(q\)-exponential family of probability distributions. Entropy, 13(6):1170–1185, 2011.

    Article  MathSciNet  Google Scholar 

  3. Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(1):714–751, 2017.

    MathSciNet  Google Scholar 

  4. Francis Bach. Information theory with kernel methods. IEEE Transactions on Information Theory, 2022.

  5. Francis Bach, Simon Lacoste-Julien, and Guillaume Obozinski. On the equivalence between herding and conditional gradient algorithms. In International Conference on Machine Learning, pages 1355–1362, 2012.

  6. Francis Bach and Alessandro Rudi. Exponential convergence of sum-of-squares hierarchies for trigonometric polynomials. SIAM Journal on Optimization, 33(3):2137–2159, 2023.

    Article  MathSciNet  Google Scholar 

  7. Aharon Ben-Tal and Marc Teboulle. Penalty functions and duality in stochastic programming via \(\varphi \)-divergence functionals. Mathematics of Operations Research, 12(2):224–240, 1987.

    Article  MathSciNet  Google Scholar 

  8. Dimitris Bertsimas, Xuan Vinh Doan, and Jean-Bernard Lasserre. Approximating integrals of multivariate exponentials: A moment approach. Operations Research Letters, 36(2):205–210, 2008.

  9. Rajendra Bhatia. Matrix Analysis, volume 169. Springer Science & Business Media, 2013.

  10. Andrew Blake, Pushmeet Kohli, and Carsten Rother. Markov random fields for vision and image processing. MIT Press, 2011.

  11. Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.

  12. Christos Boutsidis, Michael W. Mahoney, and Petros Drineas. An improved approximation algorithm for the column subset selection problem. In Proceedings of the Symposium on Discrete algorithms, pages 968–977, 2009.

  13. Michel Broniatowski and Amor Keziou. Minimization of \(\varphi \)-divergences on sets of signed measures. Studia Scientiarum Mathematicarum Hungarica, 43(4):403–442, 2006.

    Article  MathSciNet  Google Scholar 

  14. Jean-François Cardoso. Dependence, correlation and Gaussianity in independent component analysis. Journal of Machine Learning Research, 4:1177–1203, 2003.

    MathSciNet  Google Scholar 

  15. Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40:120–145, 2011.

    Article  MathSciNet  Google Scholar 

  16. Yutian Chen, Max Welling, and Alex Smola. Super-samples from kernel herding. In Conference on Uncertainty in Artificial Intelligence, pages 109–116, 2010.

  17. Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, 1999.

  18. David Cruz-Uribe and C. J. Neugebauer. Sharp error bounds for the trapezoidal rule and Simpson’s rule. Journal of Inequalities in Pure and Applied Mathematics, 3(4), 2002.

  19. Imre Csiszár. Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 2:229–318, 1967.

    MathSciNet  Google Scholar 

  20. Monroe D. Donsker and S. R. Srinivasa Varadhan. Asymptotic evaluation of certain Markov process expectations for large time-III. Communications on Pure and Applied Mathematics, 29(4):389–461, 1976.

  21. Bogdan Dumitrescu. Positive Trigonometric Polynomials and Signal Processing Applications, volume 103. Springer, 2007.

  22. Kun Fang and Hamza Fawzi. The sum-of-squares hierarchy on the sphere and applications in quantum information theory. Mathematical Programming, 190(1):331–360, 2021.

    Article  MathSciNet  Google Scholar 

  23. Hamza Fawzi and Omar Fawzi. Defining quantum divergences via convex optimization. Quantum, 5:387, 2021.

    Article  Google Scholar 

  24. Walter Gautschi. Numerical Analysis. Springer Science & Business Media, 2011.

  25. Luis Gonzalo Sanchez Giraldo, Murali Rao, and Jose C. Principe. Measures of entropy from data using infinitely divisible kernels. IEEE Transactions on Information Theory, 61(1):535–548, 2014.

    Article  MathSciNet  Google Scholar 

  26. Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, 1996.

  27. William W. Hager. Minimizing a quadratic over a sphere. SIAM Journal on Optimization, 12(1):188–208, 2001.

    Article  MathSciNet  Google Scholar 

  28. Frank Hansen and Gert Kjærgård Pedersen. Jensen’s inequality for operators and Löwner’s theorem. Mathematische Annalen, 258(3):229–241, 1982.

  29. Fumio Hiai and Milán Mosonyi. Different quantum \(f\)-divergences and the reversibility of quantum operations. Reviews in Mathematical Physics, 29(07):1750023, 2017.

    Article  MathSciNet  Google Scholar 

  30. Johannes Jahn. Introduction to the Theory of Nonlinear Optimization. Springer, 2020.

  31. Michael I. Jordan and Martin J. Wainwright. Semidefinite relaxations for approximate inference on graphs with cycles. Advances in Neural Information Processing Systems, 16, 2003.

  32. James E. Kelley, Jr. The cutting-plane method for solving convex programs. Journal of the Society for Industrial and Applied Mathematics, 8(4):703–712, 1960.

  33. Jean-Bernard Lasserre. An explicit exact SDP relaxation for nonlinear 0–1 programs. In International Conference on Integer Programming and Combinatorial Optimization, pages 293–303. Springer, 2001.

  34. Jean-Bernard Lasserre. Moments, Positive Polynomials and their Applications, volume 1. World Scientific, 2010.

  35. Monique Laurent. A comparison of the Sherali-Adams, Lovász-Schrijver, and Lasserre relaxations for 0–1 programming. Mathematics of Operations Research, 28(3):470–496, 2003.

    Article  MathSciNet  Google Scholar 

  36. Steffen L. Lauritzen. Graphical Models, volume 17. Clarendon Press, 1996.

  37. Friedrich Liese and Igor Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394–4412, 2006.

    Article  MathSciNet  Google Scholar 

  38. Friedrich Liese and Igor Vajda. \(f\)-divergences: sufficiency, deficiency and testing of hypotheses. Advances in Inequalities from Probability Theory and Statistics, pages 131–173, 2008.

  39. David G. Luenberger. Optimization by vector space methods. John Wiley & Sons, 1997.

  40. Keiji Matsumoto. A new quantum version of \(f\)-divergence. In Nagoya Winter Workshop: Reality and Measurement in Algebraic Quantum Theory, pages 229–273. Springer, 2015.

  41. Tom Minka. Divergence measures and message passing. Technical Report MSR-TR-2005-173, Microsoft Research Ltd, 2005.

  42. Ilya Mironov. Rényi differential privacy. In Computer Security Foundations Symposium, pages 263–275, 2017.

  43. Kevin P. Murphy. Machine Learning: a Probabilistic Perspective. MIT Press, 2012.

  44. Yurii Nesterov and Arkadii Nemirovskii. Interior-point Polynomial Algorithms in Convex Programming. SIAM, 1994.

  45. XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan. On surrogate loss functions and \(f\)-divergences. The Annals of Statistics, 37(2):876–904, 2009.

    Article  MathSciNet  Google Scholar 

  46. XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.

    Article  MathSciNet  Google Scholar 

  47. Ryan O’Donnell. Analysis of Boolean functions. Cambridge University Press, 2014.

  48. Anthony O’Hagan. Bayes-Hermite quadrature. Journal of Statistical Planning and Inference, 29(3):245–260, 1991.

    Article  MathSciNet  Google Scholar 

  49. Pablo A. Parrilo. Semidefinite programming relaxations for semialgebraic problems. Mathematical Programming, 96(2):293–320, 2003.

    Article  MathSciNet  Google Scholar 

  50. Antoine Picard-Weibel and Benjamin Guedj. On change of measure inequalities for \(f\)-divergences. Technical Report 2202.05568, arXiv, 2022.

  51. Yury Polyanskiy and Yihong Wu. Information Theory: From Coding to Learning. Cambridge University Press, 2023.

  52. Christian P. Robert and George Casella. Monte Carlo Statistical Methods, volume 2. Springer, 1999.

  53. Ralph Tyrell Rockafellar. Convex Analysis. Princeton University Press, 2015.

  54. Paul Rubenstein, Olivier Bousquet, Josip Djolonga, Carlos Riquelme, and Ilya O. Tolstikhin. Practical and consistent estimation of \(f\)-divergences. Advances in Neural Information Processing Systems, 32, 2019.

  55. Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. Advances in Neural Information Processing Systems, 28, 2015.

  56. Igal Sason. On \(f\)-divergences: Integral representations, local behavior, and inequalities. Entropy, 20(5):383, 2018.

    Article  MathSciNet  Google Scholar 

  57. Claus Scheiderer. Sums of squares on real algebraic surfaces. Manuscripta Mathematica, 119:395–410, 2006.

    Article  MathSciNet  Google Scholar 

  58. Lucas Slot and Monique Laurent. Sum-of-squares hierarchies for binary polynomial optimization. Mathematical Programming, pages 1–40, 2022.

  59. Gabor Szegö. Orthogonal Polynomials. American Mathematical Society Colloquium Publications, 1975.

  60. Marco Tomamichel. Quantum Information Processing with Finite Resources: Mathematical Foundations, volume 5. Springer, 2015.

  61. Leslie G. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8(2):189–201, 1979.

    Article  MathSciNet  Google Scholar 

  62. Martin J. Wainwright and Michael I. Jordan. Graphical Models, Exponential Families, and Variational Inference. Now Publishers Inc., 2008.

  63. David V. Widder. The Stieltjes transform. Transactions of the American Mathematical Society, 43(1):7–60, 1938.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The author would like to thank Omar Fawzi for discussions related to quantum information divergences, Adrien Taylor and Justin Carpentier for discussions on optimization algorithms, as well as David Holzmüller for providing clarifying comments. The comments of the anonymous reviewers were greatly appreciated. We acknowledge support from the French government under the management of the Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute), as well as from the European Research Council (Grant SEQUOIA 724063).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francis Bach.

Additional information

Communicated by Jim Renegar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Newton Method for Computing F

In order to compute \(F(v,w) = \sup _{r \geqslant 0} \frac{ r v + w - f(r)}{r+1}\) in Sect. 2.3, we assume \((v,w)\in \mathbb {{R}}^2\) fixed and define \(\varphi (r) = \frac{ r v + w - f(r)}{r+1}\), which is twice differentiable, and so that, \(\varphi '(r) = \frac{1}{(r+1)^2} \big [ v-w - \big ( (r+1)f'(r) - f(r) \big ) \big ]\). Since the function \(r \mapsto (r+1)f'(r) - f(r) \) has derivative \(r \mapsto (r+1)f''(r)\), it is strictly positive. Thus the derivative \(\varphi '\) thus has at most one zero, and we aim to solve the equation \( (r+1)f'(r) - f(r) = v - w\).

We now focus on the special case \(f(t) = t \log t - t + 1\), where \( \psi (r) = (r+1)f'(r) - f(r) = r - 1 + \log r\), and has full range on \(\mathbb {{R}}_+\) so that the equation above has a unique solution. In order to compute its solution, for \(v-w \geqslant 0\), we iterate Newton’s method [24, Section 4.8] \(r \leftarrow r + \frac{v-w - \psi (r)}{\psi '(r) } = \frac{2 - \log (r) + v-w}{ 1 + 1/r}\) for 5 iterations to reach machine precision, while for \(v-w \leqslant 0\) we iterate Newton’s method on the logarithm of r \( \log r \leftarrow \log r + \frac{v-w - \psi (r)}{r \psi '(r) } = \frac{ 1 + v-w + ( \log r -1) r}{ 1 + r} \) for 5 iterations.

Appendix B: Decomposition of Operator-Convex Functions

We have the following particular cases from Sect. 2.

  • \(\alpha \)-divergences: \(f(t) =\frac{1}{\alpha (\alpha -1)} \big [t^\alpha - \alpha t + (\alpha -1) \big ] = \frac{1}{\alpha } \frac{\sin (\alpha -1)\pi }{(\alpha -1)\pi } (t-1)^2 \int _0^{+\infty } \frac{1}{t+\lambda } \frac{ \lambda ^\alpha d\lambda }{(1+\lambda )^2}\) for \(\alpha \in (-1,2)\). Other representations exist for \(\alpha =-1\) and \(\alpha =2\) (see below), but other cases are not operator-convex.

  • KL divergence (\(\alpha =1\)): \(f (t) = t \log t -t + 1 = \int _0^{+\infty } \frac{(t-1)^2}{t+\lambda } \frac{ \lambda d\lambda }{(\lambda +1)^2}\).

  • Rerverse KL divergence (\(\alpha =0)\): \(f(t) = - \log t + t - 1= \int _0^{+\infty } \frac{(t-1)^2}{t+\lambda } \frac{ d\lambda }{(\lambda +1)^2}\).

  • Pearson \(\chi ^2\) divergence (\(\alpha =2\)): \(f(t) = \frac{1}{2} (t-1)^2\) is operator convex.

  • Reverse Pearson \(\chi ^2\) divergence (\(\alpha =-1\)): \(f(t) = \frac{1}{2} \big ( \frac{1}{t} + t \big ) - 1 = \frac{1}{2} \frac{(t-1)^2}{t}\) is operator convex, with \(d\nu (\lambda )\) proportional to a Dirac at \(\lambda = 0\).

  • Le Cam distance: \(f(t) = \frac{(t-1)^2}{t+1}\) is operator convex with \(d\nu (\lambda )\) proportional to a Dirac at \(\lambda = 1\).

  • Jensen–Shannon divergence: \(f(t) =2 t \log \frac{2t}{t+1} + 2 \log \frac{2}{t+1} =2 t \log t - 2(t+1) \log (t+1) + 2(t+1) \log 2 = 2t \log t - 4 \frac{t+1}{2}\log \frac{t+1}{2} \) is operator convex, as it can be written \( f(t) = 2(t-1)^2 \int _0^{+\infty } \big ( \frac{1}{t+\lambda } - \frac{1}{t+1+2\lambda } \big ) \frac{\lambda d\lambda }{(1+\lambda )^2} \), which leads to \( f(t) =2 (t-1)^2 \int _0^{+\infty } \frac{1}{t+\lambda } \frac{\lambda d\lambda }{(1+\lambda )^2} - 2 (t-1)^2 \int _1^{+\infty } \frac{1}{t+\lambda } \frac{(\lambda -1) d\lambda }{(1+\lambda )^2}\).

Appendix C: Proofs of Lemmas 1 and 2

In this section, we prove Lemma 1 from Sect. 7, taken from [40, Section 9.1] and shown in here for completeness and the expression of the maximizers. We also prove the extension Lemma 2.

We start by the dual formulation:

$$\begin{aligned}{} & {} \sup _{M,N \in {\mathbb {H}}_d } \ \mathop { \mathrm tr}[ MA ] + \mathop { \mathrm tr}[ NB] \quad \text{ such } \text{ that }\quad \forall r \geqslant 0, \ r M + N \preccurlyeq f(r) I\\{} & {} \quad = \inf _{\Lambda \ {\mathbb {H}_d^+}-\mathrm{valued \ measure \ on}\ \mathbb {{R}}_+} \int _0^{+\infty }\! f(r) \mathop { \mathrm tr}\big [ d\Lambda (r) ] \quad \text{ such } \text{ that }\quad \int _0^{+\infty }\! d\Lambda (r) \\{} & {} \quad = B \quad \text{ and }\quad \int _0^{+\infty } \! r d\Lambda (r) = A, \end{aligned}$$

which can be solved in closed form.

Indeed, given the eigendecomposition \(B^{-1/2} A B^{-1/2} = \sum _{i =1}^d \lambda _i u_i u_i^*\), we consider \(\Lambda = \sum _{i =1}^d B^{1/2} u_i u_i^*B^{1/2} \delta _{\lambda _i}\), where \(\delta _{\lambda _i}\) is the Dirac measure at \(\lambda _i\), so that we get a feasible measure \(\Lambda \), and an objective equal to \(\sum _{i=1}^d f(\lambda _i) \mathop { \mathrm tr}\big [B^{1/2} V B^{1/2} u_i u_i^*\big ] = \mathop { \mathrm tr}\big [ B^{1/2} V B^{1/2} f \big ( B^{-1/2}AB^{-1/2} \big ) \big ]\). Thus the infimum is less than \( \mathop { \mathrm tr}\big [ B^{1/2} V B^{1/2} f \big ( B^{-1/2}A B^{-1/2} \big ) \big ]\).

The other direction is a direct consequence of the operator Jensen’s inequality [28]: for any feasible measure \(\Lambda \) approached by an empirical measure \(\sum _{i =1}^m M_i \delta _{r_i}\), with \(M_i \succcurlyeq 0\), we have \(\sum _{i =1}^n( M_i^{1/2} B^{-1/2} )^*( M_i^{1/2} B^{-1/2} ) = I\), and thus

$$\begin{aligned} \int _{\mathbb {{R}}_+} \! f(r) d\Lambda (r)= & {} B^{1/2} \Big (\sum _{i =1}^m ( M_i^{1/2} B^{-1/2} )^*f(r_iI) ( M_i^{1/2} B^{-1/2} ) \Big ) B^{1/2} \\&\succcurlyeq&B^{1/2} f \Big (\sum _{i =1}^m( M_i^{1/2} B^{-1/2} )^*(r_i I) ( M_i^{1/2} B^{-1/2} ) \Big ) B^{1/2} \\= & {} B^{1/2} f \big ( B^{-1/2}A B^{-1/2} \big ) B^{1/2}. \end{aligned}$$

The lower bound follows by letting the number m of Diracs go to infinity to tightly approximate any feasible matrix \(\Lambda \).

In order to obtain the minimizers M and N, we simply notice that they are the gradients of the function \((A,B) \mapsto \mathop { \mathrm tr}\big [ B f \big ( B^{-1/2}A B^{-1/2} \big ) \big ]\) with respect to A and B. We thus get, using gradients of spectral functions

$$\begin{aligned} M^*= \sum _{i,j=1}^d \frac{f(\lambda _i)-f(\lambda _j)}{\lambda _i - \lambda _j} u_i^\top B u_j \cdot B^{-1/2} u_i u_j^*B^{-1/2}, \end{aligned}$$

with the convention that for \(\lambda _i = \lambda _j\), \( \frac{f(\lambda _i)-f(\lambda _j)}{\lambda _i - \lambda _j} = f'(\lambda _i)\). To obtain \(N^*\), we simply consider the identity \(\mathop { \mathrm tr}\big [ B f \big ( B^{-1/2}A B^{-1/2} \big ) \big ] = \mathop { \mathrm tr}\big [ A g \big ( A^{-1/2}B A^{-1/2} \big ) \big ]\), for \(g(t) = t f(1/t)\), and use the corresponding formula.

Proof of Lemma 2

We simply need to change slightly the proof of the lemma above, by replacing \(\int _0^{+\infty }\! f(r) \mathop { \mathrm tr}\big [ d\Lambda (r) \big ] \) by \(\int _0^{+\infty }\! f(r) \mathop { \mathrm tr}\big [ V d\Lambda (r) \big ]\) with all inequalities preserved because \(V \succcurlyeq 0\).

Appendix D: Computing Integrals

A third task is related to f-divergences beyond computing the divergences themselves and the associated log-partition functions. In this section, we consider the task of computing \( \int _{\mathcal {X}} f^*( h(x)) dq(x)\), where q is a finite positive measure on \(\mathcal {X}\), \(f^*\) is the Fenchel conjugate of f, and \(h: \mathcal {X}\rightarrow \mathbb {{R}}\) an arbitrary function (such that the integral is finite). The difference with computing log-partition functions in Eq. (24) is minor and we thus extend only a few results from Sect. 8 and Sect. 9, most without proofs as they follow the same lines as results for f-partition functions.

For \(f(t) = t \log t - t + 1\), we have \(f^*(u) =e^u - 1\), and we there aim at estimating integrals of exponential functions, a classical task in probabilistic modelling (see [43, 62] and references therein), which up to a logarithm is the same as computing the log-partition function; however, they are different for other functions f.

This computational task can be classically related to f-divergences by Fenchel duality as we have:

$$\begin{aligned} \int _{\mathcal {X}} f^*( h(x)) dq(x) = \sup _{p\ \mathrm{positive \ measure \ on}\ \mathcal {X}}\ \int _\mathcal {X}h(x) dp(x) - D(p\Vert q), \end{aligned}$$

where the only difference with Eq. (23) is that p is not assumed to sum to one. Below, we show that for functions h(x) which are quadratic forms in \(\varphi (x)\), we can replace \(D(p\Vert q)\) by the lower-bound we just defined above, and obtain a computable upper bound of the integral.

We also have the representation corresponding to Eq. (25), that will be useful later:

$$\begin{aligned} \int _\mathcal {X}f^*(h(x)) dq(x)= & {} \inf _{ w: \mathcal {X}\rightarrow \mathbb {{R}}} - \int _\mathcal {X}w(x) dq(x) \quad \text{ such } \text{ that }\quad \forall x \in \mathcal {X},\nonumber \\{} & {} \forall r \geqslant 0, \quad r h(x) + w(x) \leqslant f(r). \end{aligned}$$
(33)

1.1 Related Work

There exist many ways of estimating integrals, in particular in compact sets in small dimensions, where various quadrature rules, such as the trapezoidal or Simpson’s rule, can be applied to compute integrals based on function evaluations, with well-defined convergence rates [18]. In higher dimensions, still based on function evaluations, Bayes–Hermite quadrature rules [48], and the related kernel quadrature rules [5, 16] come with precise convergence rates linking approximation error and number of function evaluations [3]. An alternative in our context is Monte-Carlo integration from samples from q [52], with convergence rate in \(O(1/\sqrt{n})\) from n function evaluations.

In this paper, we follow [8] and consider computing integrals given a specific knowledge of the integrand, here of the form \(f^*(h(x))\), where h is a known quadratic form in a feature vector \(\varphi (x)\). While we also use a sum-of-squares approach as in [8], we rely on different tools (link with f-divergences and partition functions rather than integration by parts).

1.2 Relaxations

In order to compute integrals, we simply use the same technique but without the constraint that measures sum to one, that is, without the constraint that \(\mathop { \mathrm tr}[A]=1\). Starting from Eq. (33), we get, with \(B = \Sigma _q\):

$$\begin{aligned} \widetilde{C}_q(H)= & {} \int _\mathcal {X}f^*\big ( \varphi (x)^*H \varphi (x) \big ) dq(x) = \sup _{ p\in {\mathcal {M}}_+(\mathcal {X})} \ \int _\mathcal {X}\varphi (x)^*H \varphi (x) dp(x) - D(p\Vert q)\\\leqslant & {} \sup _{ A \in \mathcal {C} } \ \mathop { \mathrm tr}[ HA] - D^{\textrm{OPT}}(A\Vert B) = \widetilde{C}_q^{\textrm{OPT}} (H) \\= & {} \inf _{N \in {\mathbb {H}}_d } - \mathop { \mathrm tr}[N B ] \text{ such } \text{ that } \forall r \geqslant 0, \quad \Gamma ( rH + N) \leqslant f(r). \end{aligned}$$

Note that we only have an inequality here because we are not optimizing over q. We then get two computable relaxations by considering \(D^{\textrm{SOS}}(A\Vert B) \) and \(D^{\textrm{QT}}(A\Vert B) \) instead of \(D^{\textrm{OPT}}(A\Vert B) \), with the respective formulations:

$$\begin{aligned} \widetilde{C}_q^{\textrm{SOS}} (H)= & {} \inf _{N \in {\mathbb {H}}_d } \ \ - \mathop { \mathrm tr}[N B ] \text{ such } \text{ that } \forall r \geqslant 0, \ f(r) U - rH - N \in \widehat{\mathcal {C}}^*\\ \widetilde{C}_q^{\textrm{QT}} (H)= & {} \sup _{A \in {\mathbb {H}}_d } \ \inf _{V \succcurlyeq 0, \ V - I\in {\mathcal {V}}^\perp } \ \mathop { \mathrm tr}[A H ] - \mathop { \mathrm tr}\big [ B^{1/2} V B^{1/2} f \big (B^{-1/2}A B^{-1/2} \big ) \big ]. \end{aligned}$$

Dual formulations and algorithms can then easily be derived.

Appendix E: Dual Formulations

In this appendix we present dual variational formulations to most of the formulations proposed in the main paper. We only consider the relaxations of f-divergences. Formulations for f-partition functions can be derived similarly.

1.1 f-Divergences

We can consider the Lagrangian dual of Eq. (3), by introducing a Lagrange multiplier \(\lambda \) for the infinite-dimensional constraint \(\forall x \in \mathcal {X}, \forall r \geqslant 0, \ r v(x)+w(x) \leqslant f(r)\) in the form of a positive finite measure \(\lambda \) on \(\mathcal {X}\times \mathbb {{R}}_+\) [30]. We then obtain, using strong duality for equality constraints [39, Section 8.6]:

$$\begin{aligned} D(p\Vert q)= & {} \inf _{ \lambda \in {\mathcal {M}}_+( \mathcal {X}\times \mathbb {{R}}_+) } \sup _{ v,w: \mathcal {X}\rightarrow \mathbb {{R}}} \int _\mathcal {X}v(x) dp(x) + \int _\mathcal {X}w(x) dq(x)\nonumber \\{} & {} + \int _\mathcal {X}\int _{\mathbb {{R}}_+} \big [ f(r) - rv(x) - w(x) \big ] d\lambda (x,r)\nonumber \\= & {} \inf _{\lambda \in {\mathcal {M}}_+( \mathcal {X}\times \mathbb {{R}}_+) } \ \int _\mathcal {X}\int _{\mathbb {{R}}_+} f(r) d\lambda (x,r) \\{} & {} \text{ such } \text{ that }\quad \int _{\mathbb {{R}}_+} \!d\lambda (\cdot ,r)=dq(\cdot )\quad \text{ and }\quad \int _{\mathbb {{R}}_+} \! r d\lambda (\cdot ,r)=dp(\cdot ),\nonumber \end{aligned}$$
(34)

with the two constraints resulting from the maximization with respect to v and w.

1.2 Optimal Relaxation of f-Divergences (\(D^{\textrm{OPT}}\))

We can also formulate Eq. (15) as a minimization problem; we we can use Lagrangian duality, akin to Eq. (34). This requires to introduce a Lagrange multiplier for the constraint \(\forall r \geqslant 0, \ \Gamma (rM+N) \leqslant f(r)\), which is equivalent to, \( \forall r \geqslant 0, \ f(r) - rM-N \in \widehat{\mathcal {C}}^*\), which leads to a \({\mathcal {C}}\)-valued finite measure on \(\mathbb {{R}}_+\) [30], to get:

$$\begin{aligned} D^{\textrm{OPT}} ( A \Vert B )= & {} \inf _{\Lambda \ { \mathcal {C}}-\mathrm{valued \ measure \ on}\ \mathbb {{R}}_+} \ \ \sup _{M,N \in \mathbb {H}_d} \ \ \mathop { \mathrm tr}[ AM ] + \mathop { \mathrm tr}[BN] \nonumber \\{} & {} + \int _0^{+\infty }\! \mathop { \mathrm tr}\big [ d\Lambda (r) ( f(r) - rM - N ) \big ] \nonumber \\= & {} \inf _{\Lambda \ { \mathcal {C}}-\mathrm{valued \ measure \ on}\ \mathbb {{R}}_+} \int _0^{+\infty }\! f(r) \mathop { \mathrm tr}\big [ d\Lambda (r)] \quad \text{ such } \text{ that }\nonumber \\{} & {} \int _0^{+\infty }\! d\Lambda (r)= B \quad \text{ and }\quad \int _0^{+\infty } \! r d\Lambda (r) = A. \end{aligned}$$
(35)

1.3 SOS Relaxation of f-Divergences (\(D^{\textrm{SOS}}\))

We can also get a formulation for Eq. (17), akin to Eqs. (34) and (35). This requires to introduce a Lagrange multiplier for the constraint \( \forall r \geqslant 0, \ f(r) U - rM-N \in \widehat{\mathcal {C}}^*\), which is a \(\widehat{\mathcal {C}}\)-valued finite measure on \(\mathbb {{R}}_+\) [30], to get:

$$\begin{aligned} D^{\textrm{SOS}} ( A \Vert B )= & {} \inf _{\Lambda \ {\widehat{\mathcal {C}}}-\mathrm{valued \ measure \ on}\ \mathbb {{R}}_+} \ \ \sup _{M,N \in \mathbb {H}_d} \ \ \mathop { \mathrm tr}[ AM ] + \mathop { \mathrm tr}[ BN] \nonumber \\{} & {} +\int _0^{+\infty }\! \mathop { \mathrm tr}\big [ d\Lambda (r) ( f(r) - rM - N ) \big ] \nonumber \\= & {} \inf _{\Lambda \ {\widehat{\mathcal {C}}}-\mathrm{valued \ measure \ on}\ \mathbb {{R}}_+} \int _0^{+\infty }\! f(r) \mathop { \mathrm tr}\big [ d\Lambda (r) ] \ \text{ such } \text{ that } \nonumber \\{} & {} \int _0^{+\infty }\! d\Lambda (r) = B\quad \text{ and }\quad \int _0^{+\infty } \! r d\Lambda (r) = A, \end{aligned}$$
(36)

as maximizing out \(M,N \in \mathbb {H}_d\) introduces linear constraints.

1.4 Quantum Relaxations of f-Divergences (\(D^{\textrm{SOS}}\))

Equation (21) also has a primal–dual form as

$$\begin{aligned} D^{\textrm{QT}}(A\Vert B)= & {} \sup _{V \succcurlyeq 0, \ V - I\in {\mathcal {V}}^\perp } \ \ \inf _{\Lambda \ {\mathbb {H}_d^+}-\mathrm{valued \ measure \ on}\ \mathbb {{R}}_+} \int _0^{+\infty }\! f(r) \mathop { \mathrm tr}\big [ d\Lambda (r) V ] \nonumber \\{} & {} \text{ such } \text{ that } \int _0^{+\infty }\! d\Lambda (r) = B \quad \text{ and }\quad \int _0^{+\infty } \! r d\Lambda (r) = A. \end{aligned}$$
(37)

We also have a formulation akin to Eq. (36), that is, adding a Lagrange multiplier \(\Sigma \in {\mathcal {V}}\) in Eq. (37) for the constraint \(I-V \in {\mathcal {V}}^\perp \):

$$\begin{aligned} D^{\textrm{QT}}(A\Vert B)= & {} \inf _{\Sigma \in {\mathcal {V}}, \ \Lambda \ {{\mathbb {H}}_d^+}-\mathrm{valued \ measure \ on}\ \mathbb {{R}}_+} \int _0^{+\infty }\! f(r) \mathop { \mathrm tr}\big [ d\Lambda (r) ] + \mathop { \mathrm tr}[ \Sigma ] \\{} & {} \text{ such } \text{ that } \int _0^{+\infty }\! d\Lambda (r) = B, \int _0^{+\infty } r d\Lambda (r) = A, \text{ and } \\{} & {} \int _0^{+\infty }\! f(r) d\Lambda (r) \preccurlyeq \Sigma , \end{aligned}$$

which shows the additional relaxation compared to \(D^{\textrm{SOS}}(A\Vert B)\), for which \(\Lambda \) is a measure (almost everywhere) valued in \({\mathcal {V}}\), while here it is only in \(\mathbb {H}_d^+\).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bach, F. Sum-of-Squares Relaxations for Information Theory and Variational Inference. Found Comput Math (2024). https://doi.org/10.1007/s10208-024-09651-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10208-024-09651-0

Keywords

Mathematics Subject Classification

Navigation