Abstract
We consider extensions of the Shannon relative entropy, referred to as fdivergences. Three classical related computational problems are typically associated with these divergences: (a) estimation from moments, (b) computing normalizing integrals, and (c) variational inference in probabilistic models. These problems are related to one another through convex duality, and for all of them, there are many applications throughout data science, and we aim for computationally tractable approximation algorithms that preserve properties of the original problem such as potential convexity or monotonicity. In order to achieve this, we derive a sequence of convex relaxations for computing these divergences from noncentered covariance matrices associated with a given feature vector: starting from the typically nontractable optimal lowerbound, we consider an additional relaxation based on “sumsofsquares”, which is is now computable in polynomial time as a semidefinite program. We also provide computationally more efficient relaxations based on spectral information divergences from quantum information theory. For all of the tasks above, beyond proposing new relaxations, we derive tractable convex optimization algorithms, and we present illustrations on multivariate trigonometric polynomials and functions on the Boolean hypercube.
Similar content being viewed by others
Notes
Note that our notation \(D(p\Vert q)\) ignores the dependence in f.
Note that using an outer approximation in Eq. (10) leads to a relaxation per se while in the dual view presented here, we obtain a strengthening due to replacing nonnegative functions by the smaller set of sumsofsquares. Throughout the paper, we will use the term “relaxation” in all cases.
Note that \(\Sigma _p\) and \(\Sigma _q\) do depend on \(\varphi \), but we omit the dependence in the notation.
In this section, we do not need to impose that \(\Vert T \varphi (x)\Vert = 1\) for all \(x \in \mathcal {X}\), but we will in subsequent sections.
A similar formulation can be derived for \(D^{\textrm{OPT}}\) using \(\Gamma \) instead of \(\widehat{\Gamma }\).
Matlab code to reproduce all experiments can be downloaded from www.di.ens.fr/~fbach/fdiv_quantum_var.zip.
References
Syed Mumtaz Ali and Samuel D. Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 28(1):131–142, 1966.
Shunichi Amari and Atsumi Ohara. Geometry of \(q\)exponential family of probability distributions. Entropy, 13(6):1170–1185, 2011.
Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(1):714–751, 2017.
Francis Bach. Information theory with kernel methods. IEEE Transactions on Information Theory, 2022.
Francis Bach, Simon LacosteJulien, and Guillaume Obozinski. On the equivalence between herding and conditional gradient algorithms. In International Conference on Machine Learning, pages 1355–1362, 2012.
Francis Bach and Alessandro Rudi. Exponential convergence of sumofsquares hierarchies for trigonometric polynomials. SIAM Journal on Optimization, 33(3):2137–2159, 2023.
Aharon BenTal and Marc Teboulle. Penalty functions and duality in stochastic programming via \(\varphi \)divergence functionals. Mathematics of Operations Research, 12(2):224–240, 1987.
Dimitris Bertsimas, Xuan Vinh Doan, and JeanBernard Lasserre. Approximating integrals of multivariate exponentials: A moment approach. Operations Research Letters, 36(2):205–210, 2008.
Rajendra Bhatia. Matrix Analysis, volume 169. Springer Science & Business Media, 2013.
Andrew Blake, Pushmeet Kohli, and Carsten Rother. Markov random fields for vision and image processing. MIT Press, 2011.
Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
Christos Boutsidis, Michael W. Mahoney, and Petros Drineas. An improved approximation algorithm for the column subset selection problem. In Proceedings of the Symposium on Discrete algorithms, pages 968–977, 2009.
Michel Broniatowski and Amor Keziou. Minimization of \(\varphi \)divergences on sets of signed measures. Studia Scientiarum Mathematicarum Hungarica, 43(4):403–442, 2006.
JeanFrançois Cardoso. Dependence, correlation and Gaussianity in independent component analysis. Journal of Machine Learning Research, 4:1177–1203, 2003.
Antonin Chambolle and Thomas Pock. A firstorder primaldual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40:120–145, 2011.
Yutian Chen, Max Welling, and Alex Smola. Supersamples from kernel herding. In Conference on Uncertainty in Artificial Intelligence, pages 109–116, 2010.
Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, 1999.
David CruzUribe and C. J. Neugebauer. Sharp error bounds for the trapezoidal rule and Simpson’s rule. Journal of Inequalities in Pure and Applied Mathematics, 3(4), 2002.
Imre Csiszár. Informationtype measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 2:229–318, 1967.
Monroe D. Donsker and S. R. Srinivasa Varadhan. Asymptotic evaluation of certain Markov process expectations for large timeIII. Communications on Pure and Applied Mathematics, 29(4):389–461, 1976.
Bogdan Dumitrescu. Positive Trigonometric Polynomials and Signal Processing Applications, volume 103. Springer, 2007.
Kun Fang and Hamza Fawzi. The sumofsquares hierarchy on the sphere and applications in quantum information theory. Mathematical Programming, 190(1):331–360, 2021.
Hamza Fawzi and Omar Fawzi. Defining quantum divergences via convex optimization. Quantum, 5:387, 2021.
Walter Gautschi. Numerical Analysis. Springer Science & Business Media, 2011.
Luis Gonzalo Sanchez Giraldo, Murali Rao, and Jose C. Principe. Measures of entropy from data using infinitely divisible kernels. IEEE Transactions on Information Theory, 61(1):535–548, 2014.
Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, 1996.
William W. Hager. Minimizing a quadratic over a sphere. SIAM Journal on Optimization, 12(1):188–208, 2001.
Frank Hansen and Gert Kjærgård Pedersen. Jensen’s inequality for operators and Löwner’s theorem. Mathematische Annalen, 258(3):229–241, 1982.
Fumio Hiai and Milán Mosonyi. Different quantum \(f\)divergences and the reversibility of quantum operations. Reviews in Mathematical Physics, 29(07):1750023, 2017.
Johannes Jahn. Introduction to the Theory of Nonlinear Optimization. Springer, 2020.
Michael I. Jordan and Martin J. Wainwright. Semidefinite relaxations for approximate inference on graphs with cycles. Advances in Neural Information Processing Systems, 16, 2003.
James E. Kelley, Jr. The cuttingplane method for solving convex programs. Journal of the Society for Industrial and Applied Mathematics, 8(4):703–712, 1960.
JeanBernard Lasserre. An explicit exact SDP relaxation for nonlinear 0–1 programs. In International Conference on Integer Programming and Combinatorial Optimization, pages 293–303. Springer, 2001.
JeanBernard Lasserre. Moments, Positive Polynomials and their Applications, volume 1. World Scientific, 2010.
Monique Laurent. A comparison of the SheraliAdams, LovászSchrijver, and Lasserre relaxations for 0–1 programming. Mathematics of Operations Research, 28(3):470–496, 2003.
Steffen L. Lauritzen. Graphical Models, volume 17. Clarendon Press, 1996.
Friedrich Liese and Igor Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394–4412, 2006.
Friedrich Liese and Igor Vajda. \(f\)divergences: sufficiency, deficiency and testing of hypotheses. Advances in Inequalities from Probability Theory and Statistics, pages 131–173, 2008.
David G. Luenberger. Optimization by vector space methods. John Wiley & Sons, 1997.
Keiji Matsumoto. A new quantum version of \(f\)divergence. In Nagoya Winter Workshop: Reality and Measurement in Algebraic Quantum Theory, pages 229–273. Springer, 2015.
Tom Minka. Divergence measures and message passing. Technical Report MSRTR2005173, Microsoft Research Ltd, 2005.
Ilya Mironov. Rényi differential privacy. In Computer Security Foundations Symposium, pages 263–275, 2017.
Kevin P. Murphy. Machine Learning: a Probabilistic Perspective. MIT Press, 2012.
Yurii Nesterov and Arkadii Nemirovskii. Interiorpoint Polynomial Algorithms in Convex Programming. SIAM, 1994.
XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan. On surrogate loss functions and \(f\)divergences. The Annals of Statistics, 37(2):876–904, 2009.
XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
Ryan O’Donnell. Analysis of Boolean functions. Cambridge University Press, 2014.
Anthony O’Hagan. BayesHermite quadrature. Journal of Statistical Planning and Inference, 29(3):245–260, 1991.
Pablo A. Parrilo. Semidefinite programming relaxations for semialgebraic problems. Mathematical Programming, 96(2):293–320, 2003.
Antoine PicardWeibel and Benjamin Guedj. On change of measure inequalities for \(f\)divergences. Technical Report 2202.05568, arXiv, 2022.
Yury Polyanskiy and Yihong Wu. Information Theory: From Coding to Learning. Cambridge University Press, 2023.
Christian P. Robert and George Casella. Monte Carlo Statistical Methods, volume 2. Springer, 1999.
Ralph Tyrell Rockafellar. Convex Analysis. Princeton University Press, 2015.
Paul Rubenstein, Olivier Bousquet, Josip Djolonga, Carlos Riquelme, and Ilya O. Tolstikhin. Practical and consistent estimation of \(f\)divergences. Advances in Neural Information Processing Systems, 32, 2019.
Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. Advances in Neural Information Processing Systems, 28, 2015.
Igal Sason. On \(f\)divergences: Integral representations, local behavior, and inequalities. Entropy, 20(5):383, 2018.
Claus Scheiderer. Sums of squares on real algebraic surfaces. Manuscripta Mathematica, 119:395–410, 2006.
Lucas Slot and Monique Laurent. Sumofsquares hierarchies for binary polynomial optimization. Mathematical Programming, pages 1–40, 2022.
Gabor Szegö. Orthogonal Polynomials. American Mathematical Society Colloquium Publications, 1975.
Marco Tomamichel. Quantum Information Processing with Finite Resources: Mathematical Foundations, volume 5. Springer, 2015.
Leslie G. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8(2):189–201, 1979.
Martin J. Wainwright and Michael I. Jordan. Graphical Models, Exponential Families, and Variational Inference. Now Publishers Inc., 2008.
David V. Widder. The Stieltjes transform. Transactions of the American Mathematical Society, 43(1):7–60, 1938.
Acknowledgements
The author would like to thank Omar Fawzi for discussions related to quantum information divergences, Adrien Taylor and Justin Carpentier for discussions on optimization algorithms, as well as David Holzmüller for providing clarifying comments. The comments of the anonymous reviewers were greatly appreciated. We acknowledge support from the French government under the management of the Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR19P3IA0001 (PRAIRIE 3IA Institute), as well as from the European Research Council (Grant SEQUOIA 724063).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Jim Renegar.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Newton Method for Computing F
In order to compute \(F(v,w) = \sup _{r \geqslant 0} \frac{ r v + w  f(r)}{r+1}\) in Sect. 2.3, we assume \((v,w)\in \mathbb {{R}}^2\) fixed and define \(\varphi (r) = \frac{ r v + w  f(r)}{r+1}\), which is twice differentiable, and so that, \(\varphi '(r) = \frac{1}{(r+1)^2} \big [ vw  \big ( (r+1)f'(r)  f(r) \big ) \big ]\). Since the function \(r \mapsto (r+1)f'(r)  f(r) \) has derivative \(r \mapsto (r+1)f''(r)\), it is strictly positive. Thus the derivative \(\varphi '\) thus has at most one zero, and we aim to solve the equation \( (r+1)f'(r)  f(r) = v  w\).
We now focus on the special case \(f(t) = t \log t  t + 1\), where \( \psi (r) = (r+1)f'(r)  f(r) = r  1 + \log r\), and has full range on \(\mathbb {{R}}_+\) so that the equation above has a unique solution. In order to compute its solution, for \(vw \geqslant 0\), we iterate Newton’s method [24, Section 4.8] \(r \leftarrow r + \frac{vw  \psi (r)}{\psi '(r) } = \frac{2  \log (r) + vw}{ 1 + 1/r}\) for 5 iterations to reach machine precision, while for \(vw \leqslant 0\) we iterate Newton’s method on the logarithm of r \( \log r \leftarrow \log r + \frac{vw  \psi (r)}{r \psi '(r) } = \frac{ 1 + vw + ( \log r 1) r}{ 1 + r} \) for 5 iterations.
Appendix B: Decomposition of OperatorConvex Functions
We have the following particular cases from Sect. 2.

\(\alpha \)divergences: \(f(t) =\frac{1}{\alpha (\alpha 1)} \big [t^\alpha  \alpha t + (\alpha 1) \big ] = \frac{1}{\alpha } \frac{\sin (\alpha 1)\pi }{(\alpha 1)\pi } (t1)^2 \int _0^{+\infty } \frac{1}{t+\lambda } \frac{ \lambda ^\alpha d\lambda }{(1+\lambda )^2}\) for \(\alpha \in (1,2)\). Other representations exist for \(\alpha =1\) and \(\alpha =2\) (see below), but other cases are not operatorconvex.

KL divergence (\(\alpha =1\)): \(f (t) = t \log t t + 1 = \int _0^{+\infty } \frac{(t1)^2}{t+\lambda } \frac{ \lambda d\lambda }{(\lambda +1)^2}\).

Rerverse KL divergence (\(\alpha =0)\): \(f(t) =  \log t + t  1= \int _0^{+\infty } \frac{(t1)^2}{t+\lambda } \frac{ d\lambda }{(\lambda +1)^2}\).

Pearson \(\chi ^2\) divergence (\(\alpha =2\)): \(f(t) = \frac{1}{2} (t1)^2\) is operator convex.

Reverse Pearson \(\chi ^2\) divergence (\(\alpha =1\)): \(f(t) = \frac{1}{2} \big ( \frac{1}{t} + t \big )  1 = \frac{1}{2} \frac{(t1)^2}{t}\) is operator convex, with \(d\nu (\lambda )\) proportional to a Dirac at \(\lambda = 0\).

Le Cam distance: \(f(t) = \frac{(t1)^2}{t+1}\) is operator convex with \(d\nu (\lambda )\) proportional to a Dirac at \(\lambda = 1\).

Jensen–Shannon divergence: \(f(t) =2 t \log \frac{2t}{t+1} + 2 \log \frac{2}{t+1} =2 t \log t  2(t+1) \log (t+1) + 2(t+1) \log 2 = 2t \log t  4 \frac{t+1}{2}\log \frac{t+1}{2} \) is operator convex, as it can be written \( f(t) = 2(t1)^2 \int _0^{+\infty } \big ( \frac{1}{t+\lambda }  \frac{1}{t+1+2\lambda } \big ) \frac{\lambda d\lambda }{(1+\lambda )^2} \), which leads to \( f(t) =2 (t1)^2 \int _0^{+\infty } \frac{1}{t+\lambda } \frac{\lambda d\lambda }{(1+\lambda )^2}  2 (t1)^2 \int _1^{+\infty } \frac{1}{t+\lambda } \frac{(\lambda 1) d\lambda }{(1+\lambda )^2}\).
Appendix C: Proofs of Lemmas 1 and 2
In this section, we prove Lemma 1 from Sect. 7, taken from [40, Section 9.1] and shown in here for completeness and the expression of the maximizers. We also prove the extension Lemma 2.
We start by the dual formulation:
which can be solved in closed form.
Indeed, given the eigendecomposition \(B^{1/2} A B^{1/2} = \sum _{i =1}^d \lambda _i u_i u_i^*\), we consider \(\Lambda = \sum _{i =1}^d B^{1/2} u_i u_i^*B^{1/2} \delta _{\lambda _i}\), where \(\delta _{\lambda _i}\) is the Dirac measure at \(\lambda _i\), so that we get a feasible measure \(\Lambda \), and an objective equal to \(\sum _{i=1}^d f(\lambda _i) \mathop { \mathrm tr}\big [B^{1/2} V B^{1/2} u_i u_i^*\big ] = \mathop { \mathrm tr}\big [ B^{1/2} V B^{1/2} f \big ( B^{1/2}AB^{1/2} \big ) \big ]\). Thus the infimum is less than \( \mathop { \mathrm tr}\big [ B^{1/2} V B^{1/2} f \big ( B^{1/2}A B^{1/2} \big ) \big ]\).
The other direction is a direct consequence of the operator Jensen’s inequality [28]: for any feasible measure \(\Lambda \) approached by an empirical measure \(\sum _{i =1}^m M_i \delta _{r_i}\), with \(M_i \succcurlyeq 0\), we have \(\sum _{i =1}^n( M_i^{1/2} B^{1/2} )^*( M_i^{1/2} B^{1/2} ) = I\), and thus
The lower bound follows by letting the number m of Diracs go to infinity to tightly approximate any feasible matrix \(\Lambda \).
In order to obtain the minimizers M and N, we simply notice that they are the gradients of the function \((A,B) \mapsto \mathop { \mathrm tr}\big [ B f \big ( B^{1/2}A B^{1/2} \big ) \big ]\) with respect to A and B. We thus get, using gradients of spectral functions
with the convention that for \(\lambda _i = \lambda _j\), \( \frac{f(\lambda _i)f(\lambda _j)}{\lambda _i  \lambda _j} = f'(\lambda _i)\). To obtain \(N^*\), we simply consider the identity \(\mathop { \mathrm tr}\big [ B f \big ( B^{1/2}A B^{1/2} \big ) \big ] = \mathop { \mathrm tr}\big [ A g \big ( A^{1/2}B A^{1/2} \big ) \big ]\), for \(g(t) = t f(1/t)\), and use the corresponding formula.
Proof of Lemma 2
We simply need to change slightly the proof of the lemma above, by replacing \(\int _0^{+\infty }\! f(r) \mathop { \mathrm tr}\big [ d\Lambda (r) \big ] \) by \(\int _0^{+\infty }\! f(r) \mathop { \mathrm tr}\big [ V d\Lambda (r) \big ]\) with all inequalities preserved because \(V \succcurlyeq 0\).
Appendix D: Computing Integrals
A third task is related to fdivergences beyond computing the divergences themselves and the associated logpartition functions. In this section, we consider the task of computing \( \int _{\mathcal {X}} f^*( h(x)) dq(x)\), where q is a finite positive measure on \(\mathcal {X}\), \(f^*\) is the Fenchel conjugate of f, and \(h: \mathcal {X}\rightarrow \mathbb {{R}}\) an arbitrary function (such that the integral is finite). The difference with computing logpartition functions in Eq. (24) is minor and we thus extend only a few results from Sect. 8 and Sect. 9, most without proofs as they follow the same lines as results for fpartition functions.
For \(f(t) = t \log t  t + 1\), we have \(f^*(u) =e^u  1\), and we there aim at estimating integrals of exponential functions, a classical task in probabilistic modelling (see [43, 62] and references therein), which up to a logarithm is the same as computing the logpartition function; however, they are different for other functions f.
This computational task can be classically related to fdivergences by Fenchel duality as we have:
where the only difference with Eq. (23) is that p is not assumed to sum to one. Below, we show that for functions h(x) which are quadratic forms in \(\varphi (x)\), we can replace \(D(p\Vert q)\) by the lowerbound we just defined above, and obtain a computable upper bound of the integral.
We also have the representation corresponding to Eq. (25), that will be useful later:
1.1 Related Work
There exist many ways of estimating integrals, in particular in compact sets in small dimensions, where various quadrature rules, such as the trapezoidal or Simpson’s rule, can be applied to compute integrals based on function evaluations, with welldefined convergence rates [18]. In higher dimensions, still based on function evaluations, Bayes–Hermite quadrature rules [48], and the related kernel quadrature rules [5, 16] come with precise convergence rates linking approximation error and number of function evaluations [3]. An alternative in our context is MonteCarlo integration from samples from q [52], with convergence rate in \(O(1/\sqrt{n})\) from n function evaluations.
In this paper, we follow [8] and consider computing integrals given a specific knowledge of the integrand, here of the form \(f^*(h(x))\), where h is a known quadratic form in a feature vector \(\varphi (x)\). While we also use a sumofsquares approach as in [8], we rely on different tools (link with fdivergences and partition functions rather than integration by parts).
1.2 Relaxations
In order to compute integrals, we simply use the same technique but without the constraint that measures sum to one, that is, without the constraint that \(\mathop { \mathrm tr}[A]=1\). Starting from Eq. (33), we get, with \(B = \Sigma _q\):
Note that we only have an inequality here because we are not optimizing over q. We then get two computable relaxations by considering \(D^{\textrm{SOS}}(A\Vert B) \) and \(D^{\textrm{QT}}(A\Vert B) \) instead of \(D^{\textrm{OPT}}(A\Vert B) \), with the respective formulations:
Dual formulations and algorithms can then easily be derived.
Appendix E: Dual Formulations
In this appendix we present dual variational formulations to most of the formulations proposed in the main paper. We only consider the relaxations of fdivergences. Formulations for fpartition functions can be derived similarly.
1.1 fDivergences
We can consider the Lagrangian dual of Eq. (3), by introducing a Lagrange multiplier \(\lambda \) for the infinitedimensional constraint \(\forall x \in \mathcal {X}, \forall r \geqslant 0, \ r v(x)+w(x) \leqslant f(r)\) in the form of a positive finite measure \(\lambda \) on \(\mathcal {X}\times \mathbb {{R}}_+\) [30]. We then obtain, using strong duality for equality constraints [39, Section 8.6]:
with the two constraints resulting from the maximization with respect to v and w.
1.2 Optimal Relaxation of fDivergences (\(D^{\textrm{OPT}}\))
We can also formulate Eq. (15) as a minimization problem; we we can use Lagrangian duality, akin to Eq. (34). This requires to introduce a Lagrange multiplier for the constraint \(\forall r \geqslant 0, \ \Gamma (rM+N) \leqslant f(r)\), which is equivalent to, \( \forall r \geqslant 0, \ f(r)  rMN \in \widehat{\mathcal {C}}^*\), which leads to a \({\mathcal {C}}\)valued finite measure on \(\mathbb {{R}}_+\) [30], to get:
1.3 SOS Relaxation of fDivergences (\(D^{\textrm{SOS}}\))
We can also get a formulation for Eq. (17), akin to Eqs. (34) and (35). This requires to introduce a Lagrange multiplier for the constraint \( \forall r \geqslant 0, \ f(r) U  rMN \in \widehat{\mathcal {C}}^*\), which is a \(\widehat{\mathcal {C}}\)valued finite measure on \(\mathbb {{R}}_+\) [30], to get:
as maximizing out \(M,N \in \mathbb {H}_d\) introduces linear constraints.
1.4 Quantum Relaxations of fDivergences (\(D^{\textrm{SOS}}\))
Equation (21) also has a primal–dual form as
We also have a formulation akin to Eq. (36), that is, adding a Lagrange multiplier \(\Sigma \in {\mathcal {V}}\) in Eq. (37) for the constraint \(IV \in {\mathcal {V}}^\perp \):
which shows the additional relaxation compared to \(D^{\textrm{SOS}}(A\Vert B)\), for which \(\Lambda \) is a measure (almost everywhere) valued in \({\mathcal {V}}\), while here it is only in \(\mathbb {H}_d^+\).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author selfarchiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bach, F. SumofSquares Relaxations for Information Theory and Variational Inference. Found Comput Math (2024). https://doi.org/10.1007/s10208024096510
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10208024096510
Keywords
 Information theory
 Probabilistic variational inference
 Sumofsquares
 Entropy
 fdivergences
 Semidefinite programming
 Quantum entropies