Matrix versions of the Hellinger distance

Abstract

On the space of positive definite matrices, we consider distance functions of the form \(d(A,B)=\left[ \mathrm{tr}\mathcal {A}(A,B)-\mathrm{tr}\mathcal {G}(A,B)\right] ^{1/2},\) where \(\mathcal {A}(A,B)\) is the arithmetic mean and \(\mathcal {G}(A,B)\) is one of the different versions of the geometric mean. When \(\mathcal {G}(A,B)=A^{1/2}B^{1/2}\) this distance is \(\Vert A^{1/2}-B^{1/2}\Vert _2,\) and when \(\mathcal {G}(A,B)=(A^{1/2}BA^{1/2})^{1/2}\) it is the Bures–Wasserstein metric. We study two other cases: \(\mathcal {G}(A,B)=A^{1/2}(A^{-1/2}BA^{-1/2})^{1/2}A^{1/2},\) the Pusz–Woronowicz geometric mean, and \(\mathcal {G}(A,B)=\exp \big (\frac{\log A+\log B}{2}\big ),\) the log Euclidean mean. With these choices, d(AB) is no longer a metric, but it turns out that \(d^2(A,B)\) is a divergence. We establish some (strict) convexity properties of these divergences. We obtain characterisations of barycentres of m positive definite matrices with respect to these distance measures. One of these leads to a new interpretation of a power mean introduced by Lim and Palfia, as a barycentre. The other uncovers interesting relations between the log Euclidean mean and relative entropy.

This is a preview of subscription content, log in to check access.

Change history

  • 11 October 2019

    Theorem 9 in our paper [1] is wrong. The statement should be replaced by the following.

References

  1. 1.

    Abatzoglou, T.J.: Norm derivatives on spaces of operators. Math. Ann. 239, 129–135 (1979)

    MathSciNet  Article  Google Scholar 

  2. 2.

    Agueh, M., Carlier, G.: Barycenters in the Wasserstein space. SIAM J. Math. Anal. Appl. 43, 904–924 (2011)

    MathSciNet  Article  Google Scholar 

  3. 3.

    Aiken, J.G., Erdos, J.A., Goldstein, J.A.: Unitary approximation of positive operators. Illinois J. Math. 24, 61–72 (1980)

    MathSciNet  Article  Google Scholar 

  4. 4.

    Amari, S.: Information Geometry and its Applications. Springer, Tokyo (2016)

    Google Scholar 

  5. 5.

    Ando, T.: Concavity of certain maps on positive definite matrices and applications to Hadamard products. Linear Algebra Appl. 26, 203–241 (1979)

    MathSciNet  Article  Google Scholar 

  6. 6.

    Ando, T., Li, C.-K., Mathias, R.: Geometric means. Linear Algebra Appl. 385, 305–334 (2004)

    MathSciNet  Article  Google Scholar 

  7. 7.

    Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Geometric means in a novel vector space structure on symmetric positive-definite matrices. SIAM J. Math. Anal. Appl. 29, 328–347 (2007)

    MathSciNet  Article  Google Scholar 

  8. 8.

    Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)

    MathSciNet  MATH  Google Scholar 

  9. 9.

    Barbaresco, F.: Innovative tools for radar signal processing based on Cartan’s geometry of SPD matrices and information geometry. In: IEEE Radar Conference, Rome (2008)

  10. 10.

    Bauschke, H.H., Borwein, J.M.: Legendre functions and the method of random Bregman projections. J. Convex Anal. 4, 27–67 (1997)

    MathSciNet  MATH  Google Scholar 

  11. 11.

    Bauschke, H.H., Borwein, J.M.: Joint and separate convexity of the Bregman distance. Stud. Comput. Math. 8, 23–36 (2001)

    MathSciNet  Article  Google Scholar 

  12. 12.

    Bengtsson, I., Zyczkowski, K.: Geometry of Quantum States: An Introduction to Quantum Entanglement. Cambridge University Press, Cambridge (2006)

    Google Scholar 

  13. 13.

    Bhagwat, K.V., Subramanian, R.: Inequalities between means of positive operators. Math. Proc. Camb. Philos. Soc. 83, 393–401 (1978)

    MathSciNet  Article  Google Scholar 

  14. 14.

    Bhatia, R.: Matrix Analysis. Springer, Tokyo (1997)

    Google Scholar 

  15. 15.

    Bhatia, R.: Positive Definite Matrices. Princeton University Press, Princeton (2007)

    Google Scholar 

  16. 16.

    Bhatia, R.: The Riemannian mean of positive matrices. In: Nielsen, F., Bhatia, R. (eds.) Matrix Information Geometry, pp. 35–51. Springer, Tokyo (2013)

    Google Scholar 

  17. 17.

    Bhatia, R., Grover, P.: Norm inequalities related to the matrix geometric mean. Linear Algebra Appl. 437, 726–733 (2012)

    MathSciNet  Article  Google Scholar 

  18. 18.

    Bhatia, R., Jain, T., Lim, Y.: On the Bures-Wasserstein distance between positive definite matrices. Expos. Math. (2018). https://doi.org/10.1016/j.exmath.2018.01.002

  19. 19.

    Bhatia, R., Jain, T., Lim, Y.: Strong convexity of sandwiched entropies and related optimization problems. Rev. Math. Phys. 30, 1850014 (2018)

    MathSciNet  Article  Google Scholar 

  20. 20.

    Carlen, E.A., Lieb, E.H.: A Minkowski type trace inequality and strong subadditivity of quantum entropy. Adv. Math. Sci. AMS Transl. 180, 59–68 (1999)

    MathSciNet  MATH  Google Scholar 

  21. 21.

    Carlen, E.A., Lieb, E.H.: A Minkowski type trace inequality and strong subadditivity of quantum entropy. II. Convexity and concavity. Lett. Math. Phys. 83, 107–126 (2008)

    ADS  MathSciNet  Article  Google Scholar 

  22. 22.

    Chebbi, Z., Moakher, M.: Means of Hermitian positive-definite matrices based on the log-determinant \(\alpha \)-divergence function. Linear Algebra Appl. 436, 18721889 (2012)

    Google Scholar 

  23. 23.

    Dhillon, I.S., Tropp, J.A.: Matrix nearness problems with Bregman divergences. SIAM J. Matrix Anal. Appl. 29, 1120–1146 (2004)

    MathSciNet  Article  Google Scholar 

  24. 24.

    Fletcher, P., Joshi, S.: Riemannian geometry for the statistical analysis of diffusion tensor data. Signal Process. 87, 250–262 (2007)

    Article  Google Scholar 

  25. 25.

    Hiai, F., Mosonyi, M., Petz, D., Beny, C.: Quantum f-divergences and error correction. Rev. Math. Phys. 23, 691–747 (2011)

    MathSciNet  Article  Google Scholar 

  26. 26.

    Jencová, A.: Geodesic distances on density matrices. J. Math. Phys. 45, 1787–1794 (2004)

    ADS  MathSciNet  Article  Google Scholar 

  27. 27.

    Jencova, A., Ruskai, M.B.: A unified treatment of convexity of relative entropy and related trace functions with conditions for equality. Rev. Math. Phys. 22, 1099–1121 (2010)

    MathSciNet  Article  Google Scholar 

  28. 28.

    Lim, Y., Palfia, M.: Matrix power means and the Karcher mean. J. Funct. Anal. 262, 1498–1514 (2012)

    MathSciNet  Article  Google Scholar 

  29. 29.

    Modin, K.: Geometry of matrix decompositions seen through optimal transport and information geometry. J. Geom. Mech. 9, 335–390 (2017)

    MathSciNet  Article  Google Scholar 

  30. 30.

    Nielsen, F., Bhatia, R. (eds.): Matrix Information Geometry. Springer, Tokyo (2013)

    Google Scholar 

  31. 31.

    Nielsen, F., Boltz, S.: The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 57, 5455–5466 (2011)

    MathSciNet  Article  Google Scholar 

  32. 32.

    Pitrik, J., Virosztek, D.: On the joint convexity of the Bregman divergence of matrices. Lett. Math. Phys. 105, 675–692 (2015)

    ADS  MathSciNet  Article  Google Scholar 

  33. 33.

    Pusz, W., Woronowicz, S.L.: Functional calculus for sesquilinear forms and the purification map. Rep. Math. Phys. 8, 159–170 (1975)

    ADS  MathSciNet  Article  Google Scholar 

  34. 34.

    Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)

    Google Scholar 

  35. 35.

    Rockafellar, R.T., Wets, R.J.-B.: Variational Analysis. Springer, Tokyo (1998)

    Google Scholar 

  36. 36.

    Sra, S.: Positive definite matrices and the \(S\)-divergence. Proc. Am. Math. Soc. 144, 2787–2797 (2016)

    Google Scholar 

  37. 37.

    Takatsu, A.: Wasserstein geometry of Gaussian measures. Osaka J. Math. 48, 1005–1026 (2011)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors thank F. Hiai and S. Sra for helpful comments and references, and the anonymous referee for a careful reading of the manuscript. The first author is grateful to INRIA and Ecole polytechnique, Palaiseau for visits that facilitated this work, and to CSIR(India) for the award of a Bhatnagar Fellowship.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Tanvi Jain.

Appendices

Appendix A. Proof of Lemma 5

We make a variation of the proof of Theorem 3.12 in [10], dealing with a related problem (the minimisation of \(\Phi \) over a closed convex set).

Since \(\varphi \) is of Legendre type, Theorem 3.7(iii) of [10] shows that for all \(a\in {\text {int}}{\text {dom}}\varphi \), the map \(x\mapsto \Phi (x,a)\) is coercive, meaning that \(\mathrm{lim}_{\Vert x\Vert \rightarrow \infty } \Phi (x,a)=+\infty \). A sum of coercive functions is coercive, and so the map

$$\begin{aligned} \Psi (x):= \sum _{j=1}^m\frac{1}{m}\Phi (x,a_j) \end{aligned}$$

is coercive. The infimum of a coercive lower-semicontinuous function on a closed non-empty set is attained, so there is an element \(\bar{x}\in {\text {clo}}{\text {int}}{\text {dom}}\varphi \) such that \(\inf _{x\in {\text {clo}}{\text {int}}{\text {dom}}\varphi } \Phi (x)=\Phi (\bar{x})<+\infty \). Suppose that \(\bar{x}\) belongs to the boundary of \({\text {int}}{\text {dom}}\varphi \). Let us fix an arbitrary \(z\in {\text {int}}{\text {dom}}\varphi \), and let \(g(t) :=\Psi ((1-t)\bar{x}+t z)\), defined for \(t\in [0,1)\). We have

$$\begin{aligned} g'(t)=\langle \nabla \varphi ((1-t)\bar{x}+t z)-\sum _{j=1}^m \frac{1}{m} \nabla \varphi (a_j), z-\bar{x}\rangle . \end{aligned}$$

Using property (iv) of the definition of Legendre type functions, we obtain that \(\mathrm{lim}_{t\rightarrow 0^+}g'(t)=-\infty \), which entails that \(g(t)<g(0)=\Psi ({\bar{x}})\) for t small enough. Since \((1-t)\bar{x}+t z\in {\text {int}}{\text {dom}}\varphi \) for all \(t\in (0,1)\), this contradicts the optimality of \({\bar{x}}\). So \({\bar{x}}\in {\text {int}}{\text {dom}}\varphi \), which proves Lemma 5.

Appendix B. Examples

In the last statement of Theorem 6, dealing with tracial convex functions, we required \(\varphi \) to be differentiable and strictly convex on \(\mathbb {P}\). In the second statement, dealing with the non tracial case, we made a stronger assumption, requiring \(\varphi \) to be of Legendre type. We now give an example showing that the Legendre condition cannot be dispensed with. To this end, it is convenient to construct first an example showing the tightness of Lemma 5.

Need for the Legendre condition in Lemma 5

Let us fix \(N>3\), let \(e=(1,1)^\top \in \mathbb {R}^2\),

$$\begin{aligned} L=\left( \begin{array}{cc} N-1 &{}\quad -2 \\ -2 &{}\quad N-1 \end{array}\right) \end{aligned}$$
(59)

and consider the affine transformation \(g(x)=e+Lx\). Let \( a = (N,0)^\top \), \( b = (0 , N)^\top \), and

$$\begin{aligned}&{\bar{a}}:= g^{-1}( a)= \frac{1}{N^2-2N-3}\left( \begin{array}{c} N^2-2N-1 \\ N-1 \end{array} \right) , \\&{\bar{b}}:=g^{-1}(b)= \frac{1}{N^2-2N-3} \left( \begin{array}{c}N-1 \\ N^2-2N-1 \end{array} \right) . \end{aligned}$$

Observe that \({\bar{a}}, {\bar{b}}\in \mathbb {R}_{++}^2\) since \(N>3\).

Consider now, for \(p>1\), the map \(\varphi (x):=\Vert x\Vert _p^p=|x_1|^p+|x_2|^p\) defined on \(\mathbb {R}^2\) and \({\bar{\varphi }}(x)=\varphi (g(x))\). Observe that \(\varphi \) is strictly convex and differentiable. Let \({\bar{\Phi }}\) denote the Bregman divergence associated with \({\bar{\varphi }}\), and let \({\bar{\Psi }}(x):= \frac{1}{2}({\bar{\Phi }}(x,{\bar{a}})+{\bar{\Phi }}(x,{\bar{b}}))\). We claim that 0 is the unique point of minimum of \({\bar{\Psi }}\) over \(\mathbb {R}_{+}^2\). Indeed,

$$\begin{aligned} \nabla {\bar{\Psi }}(x)&=L^\top (\nabla \varphi (g(x)))-\frac{1}{2} \Big (L^\top (\nabla \varphi ( a))+ L^\top (\nabla \varphi ( b))\Big ), \end{aligned}$$

from which we obtain

$$\begin{aligned} \nabla {\bar{\Psi }}(0)&= L(p(1 -N^{p-1}/2) e)=(N-3)p(1-N^{p-1}/2)e. \end{aligned}$$

It follows that \(\nabla {\bar{\Psi }}(0)\in \mathbb {R}_{++}^2\) if \(p>1\) is chosen close enough to 1, so that \(1-N^{p-1}/2>0\). Then, since \({\bar{\Psi }}\) is convex, we have

$$\begin{aligned} {\bar{\Psi }}(x)-{\bar{\Psi }}(0)\geqslant \langle \nabla {\bar{\Psi }}(0),x\rangle >0, \quad \text { for all } x\in \mathbb {R}_{+}^2{\setminus }\{0\} \end{aligned}$$
(60)

showing the claim.

Consider now the modification \(\hat{\varphi }\) of \(\bar{\varphi }\), so that \(\hat{\varphi }(x)=\bar{\varphi }(x)\) for \(x\in \mathbb {R}_{+}^2\), and \(\hat{\varphi }(x)=+\infty \) otherwise. The function \(\hat{\varphi }\) is strictly convex, lower-semicontinuous, and differentiable on the interior of its domain, but not of Legendre type, and the conclusion of Lemma 5 does not apply to it.

The geometric intuition leading to this example is described in the figure.

figurea

The example illustrated. The point u is the unconstrained minimum of the sum of Bregman divergences \(\Psi (x):=\Phi (x,a)+\Phi (x,b)\) associated with \(\varphi (x)=x_1^p+x_2^p\), here \(p=1.2\). Level curves of \(\Psi \) are shown. The minimum of \(\Psi \) on the simplicial cone C is at the unit vector e. An affine change of variables sending C to the standard quadrant, and a lift to the cone of positive semidefinite matrices leads to Proposition 11.

Need for the Legendre condition in Theorem 6

We next construct an example showing that the Legendre condition in the second statement of Theorem 6 cannot be dispensed with. Observe that the inverse of the linear operator L in (59) is given by

$$\begin{aligned} L^{-1} = \frac{1}{N^2-2N-3}\left( \begin{array}{c@{\quad }c} N-1 &{} 2 \\ 2 &{} N-1 \end{array}\right) . \end{aligned}$$

In particular, it is a nonnegative matrix.

We set \(\tau =\left( {\begin{matrix}0&{}1\\ 1&{} 0\end{matrix}}\right) \), and consider the “quantum” analogue of L, i.e.

$$\begin{aligned} T(X)=(N-1)X- 2 \tau X \tau . \end{aligned}$$

Then,

$$\begin{aligned} T^{-1}(X)= \frac{1}{N^2-2N-3}\big ((N-1)X+ 2\tau X\tau \big ) \end{aligned}$$

is a completely positive map leaving \(\mathbb {P}\) invariant. The analogue of the map g is

$$\begin{aligned} G(X) =I + T(X) \end{aligned}$$

where I denotes the identity matrix.

We now consider the map \(\varphi (X):= \Vert X\Vert _p^p={\text {tr}}(|X|^p)\) defined on the space of Hermitian matrices. The function \(\varphi \) is differentiable and strictly convex, still assuming that \(p>1\). We set \({\bar{A}}:= {\text {diag}}({\bar{a}})\in \mathbb {P}\), \({\bar{B}}:={\text {diag}}({\bar{b}})\in \mathbb {P}\), and now define \({\bar{\Phi }}\) to be the Bregman divergence associated with \({\bar{\varphi }}:= \varphi \circ G\). Let

$$\begin{aligned} {\bar{\Psi }}(X):= \frac{1}{2}\Big ( {\bar{\Phi }}(X,{\bar{A}})+ {\bar{\Phi }}(X,{\bar{B}}) \Big ). \end{aligned}$$

We then have the following result.

Proposition 11

The minimum of the function \({\bar{\Psi }}\) on the closure of \(\mathbb {P}\) is achieved at point 0. Moreover, the equation

$$\begin{aligned} \nabla \bar{\varphi }(X)=\frac{1}{2}(\nabla \bar{\varphi }(\bar{A}) + \nabla \bar{\varphi }(\bar{B})) \end{aligned}$$
(61)

has no solution X in \(\mathbb {P}\). \(\square \)

Proof

From [3] (Theorem 2.1) or [1] (Theorem 2.3), we have

$$\begin{aligned} \frac{{\hbox {d}}}{{\hbox {d}}t}\mid _{t=0} {\text {tr}}|X+tY|^p = p {\text {Re}} {\text {tr}}|X|^{p-1}U^*Y \end{aligned}$$

where \(X=U|X|\) is the polar decomposition of X. In particular, if X is diagonal and positive semidefinite,

$$\begin{aligned} \nabla \varphi (X) = pX^{p-1}. \end{aligned}$$

Then, by a computation similar to the one in the scalar case above, we obtain

$$\begin{aligned} \nabla {\bar{\Psi }}(0) = (N-3)p(1-N^{p-1}/2)I \in \mathbb {P}. \end{aligned}$$

We conclude, as in (60), that

$$\begin{aligned} {\bar{\Psi }}(X)-{\bar{\Psi }}(0)\geqslant \langle \nabla {\bar{\Psi }}(0),X\rangle >0, \quad \text { for all } X\in {\text {clo}}\mathbb {P}{\setminus }\{0\}, \end{aligned}$$

where now \(\langle \cdot ,\cdot \rangle \) is the Frobenius scalar product on the space of Hermitian matrices. It follows that 0 is the unique point of minimum of \({\bar{\Psi }}\) on \({\text {clo}}\mathbb {P}\).

Moreover, if Eq. (61) had a solution \(X\in \mathbb {P}\), the first-order optimality condition for the minimisation of the function \({\bar{\Psi }}\) over \(\mathbb {P}\) would be satisfied, showing that \({\bar{\Psi }}(Y)\geqslant {\bar{\Psi }}(X)\) for all \(X\in \mathbb {P}\), and by density, \({\bar{\Psi }}(0)\geqslant {\bar{\Psi }}(X)\), contradicting the fact that 0 is the unique point of minimum of \({\bar{\Psi }}\) over \({\text {clo}}\mathbb {P}\). \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bhatia, R., Gaubert, S. & Jain, T. Matrix versions of the Hellinger distance. Lett Math Phys 109, 1777–1804 (2019). https://doi.org/10.1007/s11005-019-01156-0

Download citation

Keywords

  • Geometric mean
  • Matrix divergence
  • Bregman divergence
  • Relative entropy
  • Strict convexity
  • Barycentre

Mathematics Subject Classification

  • 15B48
  • 49K35
  • 94A17
  • 81P45