Some Universal Insights on Divergences for Statistics, Machine Learning and Artificial Intelligence

Broniatowski, Michel; Stummer, Wolfgang

doi:10.1007/978-3-030-02520-5_8

Michel Broniatowski² &
Wolfgang Stummer^3,4

Part of the book series: Signals and Communication Technology ((SCT))

1124 Accesses
14 Citations

Abstract

Dissimilarity quantifiers such as divergences (e.g. Kullback–Leibler information, relative entropy) and distances between probability distributions are widely used in statistics, machine learning, information theory and adjacent artificial intelligence (AI). Within these fields, in contrast, some applications deal with divergences between other-type real-valued functions and vectors. For a broad readership, we present a correspondingly unifying framework which – by its nature as a “structure on structures” – also qualifies as a basis for similarity-based multistage AI and more humanlike (robustly generalizing) machine learning. Furthermore, we discuss some specificalities, subtleties as well as pitfalls when e.g. one “moves away” from the probability context. Several subcases and examples are given, including a new approach to obtain parameter estimators in continuous models which is based on noisy divergence minimization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
See e.g. Weller-Fahy et al. [93].
2.
Alternatively, one can think of d(p, q) as degree of proximity from p to q.
3.
Measurable.
4.
In a probabilistic approach rather than a chaos-theoretic approach.
5.
Where one of them may e.g. stem from training data.
6.
This means that there exists a $N \in \mathscr {F}$ with $\lambda [N]=0$ (where the empty set $N = \emptyset $ is allowed) such that for all $x \in \mathscr {X}\backslash \{N\}$ (say) $p(x) \in ]-\infty ,\infty [$ holds.
7.
As an example, let $\mathscr {X}= \mathbb {R}$, $\lambda = \lambda _{L}$ be the Lebesgue measure (and hence, except for rare cases, the integral turns into a Riemann integral) and ; since this qualifies as a probability density and thus is a possible candidate for in Sect. 3.3.1.2 below.
8.
Respectively working with canonical space representation and $Y : = id$.
9.
As a side remark, let us mention here that in the special case of continuously differentiable strictly log-convex divergence generator $\phi $, one can construct divergences which are tighter than (38) respectively (39), see Stummer and Kißlinger [82]; in a finite discrete space and for differentiable exponentially concave divergence generator $\phi $, a similar tightening (called L-divergence) can be found in Pal and Wong [66, 67].
10.
The first resp. second resp. third integral in (41) can be interpreted as divergence-contribution of the function-(support-)overlap resp. of one part of the function-nonoverlap (e.g. describing “extreme outliers”) resp. of the other part of the function-nonoverlap (e.g. describing “extreme inliers”).
11.
This can be interpreted analogously as in footnote 10.
12.
i.e. the properties (D1) and (D2) (respectively (D2) respectively (D1), (D2) and (D3)) are satisfied.
13.
E.g. applying the divergence (46) for $\alpha \in \mathbb {R}\backslash \{0,1\}$, the sum-entry appears, which can be viewed as penalty for the cell x being empty of data observations (“intrinsic empty-cell-penalty”); for divergence (60), the penalty is .
14.
In several situations, such a conversion can appear in a natural way; e.g. an institution may generate/collect data of “continuous value” but mask them for external data analysts to group-frequencies, for reasons of confidentiality (information asymmetry).
15.
In an encompassing way, the part (a) reflects a measure-theoretic “plug-in” version of decomposable pseudo-divergences $D: (\mathscr {P}^{meas,\lambda _1} \cup \mathscr {P}^{meas,\lambda _2}) \otimes \mathscr {P}^{meas,\lambda _1} \mapsto \mathbb {R}$, where $\mathscr {P}^{meas,\lambda _1}$ is a family of mutually equivalent nonnegative measures of the form $\mathfrak {P}[\bullet ] := \mathfrak {P}^{\mathbbm {1} \cdot \lambda _{1}}[\bullet ] : = \int _{\bullet } \mathbbm {p}(x) \, \mathrm {d}\lambda _{1}(x)$, $\mathscr {P}^{meas,\lambda _2}$ is a family of nonnegative measures of the form $\overline{\mathfrak {P}}[\bullet ]:= \overline{\mathfrak {P}}^{\mathbbm {1} \cdot \lambda _{2}}[\bullet ] : = \int _{\bullet } \mathbbm {q}(x) \, \mathrm {d}\lambda _{2}(x)$ such that any $\mathfrak {P} \in \mathscr {P}^{meas,\lambda _1}$ is not equivalent to any $\overline{\mathfrak {P}} \in \mathscr {P}^{meas,\lambda _2}$, and (101) is replaced with $D(\mathfrak {P},\mathfrak {Q})=\mathfrak {D}^{0}(\mathfrak {P})+\mathfrak {D}^{1}(\mathfrak {Q}) +\int _{\mathscr {X}} \rho _{\mathfrak {Q}}(x) \, \mathrm {d}\mathfrak {P}(x) \quad \text {for all }\mathbbm {P} \in \mathfrak {P} \in \mathscr {P}^{meas,\lambda _1}\cup \mathscr {P}^{meas,\lambda _2}, \mathfrak {Q} \in \mathscr {P}^{meas,\lambda _2}$; cf. Vajda [90], Broniatowski and Vajda [18], Broniatowski et al. [19]; part (b) is new.

References

Amari, S.-I.: Information Geometry and Its Applications. Springer, Japan (2016)
Book Google Scholar
Amari, S.-I., Karakida, R., Oizumi, M.: Information geometry connecting Wasserstein distance and Kullback-Leibler divergence via the entropy-relaxed transportation problem. Info. Geo. (2018). https://doi.org/10.1007/s41884-018-0002-8
Article Google Scholar
Amari, S.-I., Nagaoka, H.: Methods of Information Geometry. Oxford University Press, Oxford (2000)
Google Scholar
Ali, M.S., Silvey, D.: A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. B–28, 131–140 (1966)
MathSciNet MATH Google Scholar
Al Mohamad, D.: Towards a better understanding of the dual representation of phi divergences. Stat. Papers (2016). https://doi.org/10.1007/s00362-016-0812-5
Article MathSciNet Google Scholar
Avlogiaris, G., Micheas, A., Zografos, K.: On local divergences between two probability measures. Metrika 79, 303–333 (2016)
Article MathSciNet Google Scholar
Avlogiaris, G., Micheas, A., Zografos, K.: On testing local hypotheses via local divergence. Stat. Methodol. 31, 20–42 (2016)
Article MathSciNet Google Scholar
Ay, N., Jost, J., Le, H.V., Schwachhöfer, L.: Information Geometry. Springer, Berlin (2017)
Book Google Scholar
Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)
MathSciNet MATH Google Scholar
Basu, A., Harris, I.R., Hjort, N.L., Jones, M.C.: Robust and efficient estimation by minimizing a density power divergence. Biometrika 85(3), 549–559 (1998)
Article MathSciNet Google Scholar
Basu, A., Lindsay, B.G.: Minimum disparity estimation for continuous models: efficiency, distributions and robustness. Ann. Inst. Stat. Math. 46(4), 683–705 (1994)
Article MathSciNet Google Scholar
Basu, A., Mandal, A., Martin, N., Pardo, L.: Robust tests for the equality of two normal means based on the density power divergence. Metrika 78, 611–634 (2015)
Article MathSciNet Google Scholar
Basu, A., Shioya, H., Park, C.: Statistical Inference: The Minimum Distance Approach. CRC Press, Boca Raton (2011)
MATH Google Scholar
Birkhoff, G.D: A set of postulates for plane geometry, based on scale and protractor. Ann. Math. 33(2) 329–345 (1932)
Article MathSciNet Google Scholar
Boissonnat, J.-D., Nielsen, F., Nock, R.: Bregman Voronoi diagrams. Discret. Comput. Geom. 44(2), 281–307 (2010)
Article MathSciNet Google Scholar
Broniatowski, M., Keziou, A.: Minimization of $\phi $-divergences on sets of signed measures. Stud. Sci. Math. Hungar. 43, 403–442 (2006)
MathSciNet MATH Google Scholar
Broniatowski, M., Keziou, A.: Parametric estimation and tests through divergences and the duality technique. J. Multiv. Anal. 100(1), 16–36 (2009)
Article MathSciNet Google Scholar
Broniatowski, M., Vajda, I.: Several applications of divergence criteria in continuous families. Kybernetika 48(4), 600–636 (2012)
MathSciNet MATH Google Scholar
Broniatowski, M., Toma, A., Vajda, I.: Decomposable pseudodistances in statistical estimation. J. Stat. Plan. Inf. 142, 2574–2585 (2012)
Article MathSciNet Google Scholar
Buckland, M.K.: Information as thing. J. Am. Soc. Inf. Sci. 42(5), 351–360 (1991)
Article Google Scholar
Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning and Games. Cambridge University Press, Cambridge (2006)
Book Google Scholar
Chhogyal, K., Nayak, A., Sattar, A.: On the KL divergence of probability mixtures for belief contraction. In: Hölldobler,S., et al. (eds.) KI 2015: Advances in Artificial Intelligence. Lecture Notes in Artificial Intelligence, vol. 9324, pp. 249–255. Springer International Publishing (2015)
Google Scholar
Cliff, O.M., Prokopenko, M., Fitch, R.: An information criterion for inferring coupling in distributed dynamical systems. Front. Robot. AI 3(71). https://doi.org/10.3389/frobt.2016.00071 (2016)
Cliff, O.M., Prokopenko, M., Fitch, R.: Minimising the Kullback-Leibler divergence for model selection in distributed nonlinear systems. Entropy 20(51). https://doi.org/10.3390/e20020051 (2018)
Article MathSciNet Google Scholar
Collins, M., Schapire, R.E., Singer, Y.: Logistic regression, AdaBoost and Bregman distances. Mach. Learn. 48, 253–285 (2002)
Article Google Scholar
Cooper, V.N., Haddad, H.M., Shahriar, H.: Android malware detection using Kullback-Leibler divergence. Adv. Distrib. Comp. Art. Int. J., Special Issue 3(2) (2014)
Article Google Scholar
Csiszar, I.: Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten. Publ. Math. Inst. Hungar. Acad. Sci. A-8, 85–108 (1963)
Google Scholar
DasGupta, A.: Some results on the curse of dimensionality and sample size recommendations. Calcutta Stat. Assoc. Bull. 50(3–4), 157–178 (2000)
Article MathSciNet Google Scholar
De Groot, M.H.: Uncertainty, information and sequential experiments. Ann. Math. Stat. 33, 404–419 (1962)
Article MathSciNet Google Scholar
Ghosh, A., Basu, A.: Robust Bayes estimation using the density power divergence. Ann. Inst. Stat. Math. 68, 413–437 (2016)
Article MathSciNet Google Scholar
Ghosh, A., Basu, A.: Robust estimation in generalized linear models: the density power divergence approach. TEST 25, 269–290 (2016)
Article MathSciNet Google Scholar
Ghosh, A., Harris, I.R., Maji, A., Basu, A., Pardo, L.: A generalized divergence for statistical inference. Bernoulli 23(4A), 2746–2783 (2017)
Article MathSciNet Google Scholar
Hampel, F.R., Ronchetti, E.M., Rousseuw, P.J., Stahel, W.A.: Robust Statistics: The Approach Based on Influence Functions. Wiley, New York (1986)
MATH Google Scholar
Karakida, R., Amari, S.-I.: Information geometry of Wasserstein divergence. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 119–126. Springer International (2017)
Google Scholar
Kißlinger, A.-L., Stummer, W.: Some decision procedures based on scaled Bregman distance surfaces. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2013. Lecture Notes in Computer Science, vol. 8085, pp. 479–486. Springer, Berlin (2013)
Google Scholar
Kißlinger, A.-L., Stummer, W.: New model search for nonlinear recursive models, regressions and autoregressions. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2015. Lecture Notes in Computer Science, vol. 9389, pp. 693–701. Springer International (2015)
Google Scholar
Kißlinger, A.-L., Stummer, W.: Robust statistical engineering by means of scaled Bregman distances. In: Agostinelli, C., Basu, A., Filzmoser, P., Mukherjee, D. (eds.) Recent Advances in Robust Statistics - Theory and Applications, pp. 81–113. Springer, India (2016)
Chapter Google Scholar
Kißlinger, A.-L., Stummer, W.: A new toolkit for robust distributional change detection. Appl. Stochastic Models Bus. Ind. 34, 682–699 (2018)
Article Google Scholar
Kuchibhotla, A.K., Basu, A.: A general setup for minimum disparity estimation. Stat. Prob. Lett. 96, 68–74 (2015)
Article Google Scholar
Liese, F., Miescke, K.J.: Statistical Decision Theory: Estimation, Testing, and Selection. Springer, New York (2008)
Book Google Scholar
Liese, F., Vajda, I.: Convex Statistical Distances. Teubner, Leipzig (1987)
MATH Google Scholar
Liese, F., Vajda, I.: On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory 52(10), 4394–4412 (2006)
Article MathSciNet Google Scholar
Lin, N., He, X.: Robust and efficient estimation under data grouping. Biometrika 93(1), 99–112 (2006)
Article MathSciNet Google Scholar
Liu, M., Vemuri, B.C., Amari, S.-I., Nielsen, F.: Total Bregman divergence and its applications to shape retrieval. In: Proceedings of 23rd IEEE CVPR, pp. 3463–3468 (2010)
Google Scholar
Liu, M., Vemuri, B.C., Amari, S.-I., Nielsen, F.: Shape retrieval using hierarchical total Bregman soft clustering. IEEE Trans. Pattern Anal. Mach. Intell. 34(12), 2407–2419 (2012)
Article Google Scholar
Lizier, J.T.: JIDT: an information-theoretic toolkit for studying the dynamcis of complex systems. Front. Robot. AI 1(11). https://doi.org/10.3389/frobt.2014.00011 (2014)
Menendez, M., Morales, D., Pardo, L., Vajda, I.: Two approaches to grouping of data and related disparity statistics. Comm. Stat. - Theory Methods 27(3), 609–633 (1998)
Article MathSciNet Google Scholar
Menendez, M., Morales, D., Pardo, L., Vajda, I.: Minimum divergence estimators based on grouped data. Ann. Inst. Stat. Math. 53(2), 277–288 (2001)
Article MathSciNet Google Scholar
Menendez, M., Morales, D., Pardo, L., Vajda, I.: Minimum disparity estimators for discrete and continuous models. Appl. Math. 46(6), 439–466 (2001)
Article MathSciNet Google Scholar
Millmann, R.S., Parker, G.D.: Geometry - A Metric Approach With Models, 2nd edn. Springer, New York (1991)
Google Scholar
Minka, T.: Divergence measures and message passing. Technical Report MSR-TR-2005-173, Microsoft Research Ltd., Cambridge, UK (2005)
Google Scholar
Morales, D., Pardo, L., Vajda, I.: Digitalization of observations permits efficient estimation in continuous models. In: Lopez-Diaz, M., et al. (eds.) Soft Methodology and Random Information Systems, pp. 315–322. Springer, Berlin (2004)
Chapter Google Scholar
Morales, D., Pardo, L., Vajda, I.: On efficient estimation in continuous models based on finitely quantized observations. Comm. Stat. - Theory Methods 35(9), 1629–1653 (2006)
Article MathSciNet Google Scholar
Murata, N., Takenouchi, T., Kanamori, T., Eguchi, S.: Information geometry of U-boost and Bregman divergence. Neural Comput. 16(7), 1437–1481 (2004)
Article Google Scholar
Nielsen, F., Barbaresco, F. (eds.): Geometric Science of Information GSI 2013. Lecture Notes in Computer Science, vol. 8085. Springer, Berlin (2013)
Google Scholar
Nielsen, F., Barbaresco, F. (eds.): Geometric Science of Information GSI 2015. Lecture Notes in Computer Science, vol. 9389. Springer International (2015)
Google Scholar
Nielsen, F., Barbaresco, F. (eds.): Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589. Springer International (2017)
Google Scholar
Nielsen, F., Bhatia, R. (eds.): Matrix Information Geometry. Springer, Berlin (2013)
Google Scholar
Nielsen, F., Nock, R.: Bregman divergences from comparative convexity. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 639–647. Springer International (2017)
Google Scholar
Nielsen, F., Sun, K., Marchand-Maillet, S.: On Hölder projective divergences. Entropy 19, 122 (2017)
Article Google Scholar
Nielsen, F., Sun, K., Marchand-Maillet,S.: K-means clustering with Hölder divergences. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 856–863. Springer International (2017)
Google Scholar
Nock, R., Menon, A.K., Ong, C.S.: A scaled Bregman theorem with applications. Advances in Neural Information Processing Systems 29 (NIPS 2016), pp. 19–27 (2016)
Google Scholar
Nock, R., Nielsen, F.: Bregman divergences and surrogates for learning. IEEE Trans. Pattern Anal. Mach. Intell. 31(11), 2048–2059 (2009)
Article Google Scholar
Nock, R., Nielsen, F., Amari, S.-I.: On conformal divergences and their population minimizers. IEEE Trans. Inf. Theory 62(1), 527–538 (2016)
Article MathSciNet Google Scholar
Österreicher, F., Vajda, I.: Statistical information and discrimination. IEEE Trans. Inf. Theory 39, 1036–1039 (1993)
Article MathSciNet Google Scholar
Pal, S., Wong, T.-K.L.: The geometry of relative arbitrage. Math. Financ. Econ. 10, 263–293 (2016)
Article MathSciNet Google Scholar
Pal, S., Wong, T.-K.L.: Exponentially concave functions and a new information geometry. Ann. Probab. 46(2), 1070–1113 (2018)
Article MathSciNet Google Scholar
Pardo, L.: Statistical Inference Based on Divergence Measures. Chapman & Hall/CRC, Boca Raton (2006)
MATH Google Scholar
Park, C., Basu, A.: Minimum disparity estimation: asymptotic normality and breakdown point results. Bull. Inf. Kybern. 36, 19–33 (2004)
MathSciNet MATH Google Scholar
Patra, S., Maji, A., Basu, A., Pardo, L.: The power divergence and the density power divergence families: the mathematical connection. Sankhya 75-B Part 1, 16–28 (2013)
Article MathSciNet Google Scholar
Peyre, G., Cuturi M.: Computational Optimal Transport (2018). arXiv:1803.00567v1
Read, T.R.C., Cressie, N.A.C.: Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer, New York (1988)
Book Google Scholar
Reid, M.D., Williamson, R.C.: Information, divergence and risk for binary experiments. J. Mach. Learn. Res. 12, 731–817 (2011)
MathSciNet MATH Google Scholar
Roensch, B., Stummer, W.: 3D insights to some divergences for robust statistics and machine learning. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 460–469. Springer International (2017)
Google Scholar
Rüschendorf, L.: On the minimum discrimination information system. Stat. Decis. Suppl. Issue 1, 263–283 (1984)
MATH Google Scholar
Scott, D.W.: Multivariate Density Estimation - Theory, Practice and Visualization, 2nd edn. Wiley, Hoboken (2015)
MATH Google Scholar
Scott, D.W., Wand, M.P.: Feasibility of multivariate density estimates. Biometrika 78(1), 197–205 (1991)
Article MathSciNet Google Scholar
Stummer, W.: On a statistical information measure of diffusion processes. Stat. Decis. 17, 359–376 (1999)
MathSciNet MATH Google Scholar
Stummer, W.: On a statistical information measure for a generalized Samuelson-Black-Scholes model. Stat. Decis. 19, 289–314 (2001)
MathSciNet MATH Google Scholar
Stummer, W.: Exponentials, Diffusions, Finance. Entropy and Information. Shaker, Aachen (2004)
MATH Google Scholar
Stummer, W.: Some Bregman distances between financial diffusion processes. Proc. Appl. Math. Mech. 7(1), 1050503–1050504 (2007)
Article Google Scholar
Stummer, W., Kißlinger, A-L.: Some new flexibilizations of Bregman divergences and their asymptotics. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 514–522. Springer International (2017)
Google Scholar
Stummer, W., Vajda, I.: On divergences of finite measures and their applicability in statistics and information theory. Statistics 44, 169–187 (2010)
Article MathSciNet Google Scholar
Stummer, W., Vajda, I.: On Bregman distances and divergences of probability measures. IEEE Trans. Inf. Theory 58(3), 1277–1288 (2012)
Article MathSciNet Google Scholar
Sugiyama, M., Suzuki, T., Kanamori, T.: Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation. Ann. Inst. Stat. Math. 64, 1009–1044 (2012)
Article MathSciNet Google Scholar
Toma, A., Broniatowski, M.: Dual divergence estimators and tests: robustness results. J. Multiv. Anal. 102, 20–36 (2011)
Article MathSciNet Google Scholar
Tsuda, K., Rätsch, G., Warmuth, M.: Matrix exponentiated gradient updates for on-line learning and Bregman projection. J. Mach. Learn. Res. 6, 995–1018 (2005)
MathSciNet MATH Google Scholar
van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes. Springer, Berlin (1996)
Book Google Scholar
Vajda, I.: Theory of Statistical Inference and Information. Kluwer, Dordrecht (1989)
MATH Google Scholar
Vajda, I.: Modifications of divergence criteria for applications in continuous families. Research Report No. 2230, Institute of Information Theory and Automation, Prague (2008)
Google Scholar
Vemuri, B.C., Liu, M., Amari, S.-I., Nielsen, F.: Total Bregman divergence and its applications to DTI analysis. IEEE Trans. Med. Imag. 30(2), 475–483 (2011)
Article Google Scholar
Victoria-Feser, M.-P., Ronchetti, E.: Robust estimation for grouped data. J. Am. Stat. Assoc. 92(437), 333–340 (1997)
Article MathSciNet Google Scholar
Weller-Fahy, D.J., Borghetti, B.J., Sodemann, A.A.: A survey of distance and similarity measures used within network intrusion anomaly detection. IEEE Commun. Surv. Tutor. 17(1), 70–91 (2015)
Article Google Scholar
Wu, L., Hoi, S.C.H., Jin, R., Zhu, J., Yu, N.: Learning Bregman distance functions for semi-supervised clustering. IEEE Trans. Knowl. Data Engin. 24(3), 478–491 (2012)
Article Google Scholar
Zhang, J., Naudts, J.: Information geometry under monotone embedding, part I: divergence functions. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 205–214. Springer International (2017)
Google Scholar
Zhang, J., Wang, X., Yao, L., Li, J., Shen, X.: Using Kullback-Leibler divergence to model opponents in poker. Computer Poker and Imperfect Information: Papers from the AAAI-14 Workshop (2014)
Google Scholar

Download references

Acknowledgements

We are grateful to three anonymous referees for their very useful suggestions and comments. W. Stummer wants to thank very much the Sorbonne Universite Pierre et Marie Curie Paris for its partial financial support.

Author information

Authors and Affiliations

Sorbonne Universite Pierre et Marie Curie, LPSM, 4 place Jussieu, 75252, Paris, France
Michel Broniatowski
Department of Mathematics, University of Erlangen–Nürnberg, Cauerstrasse 11, 91058, Erlangen, Germany
Wolfgang Stummer
Affiliated Faculty Member of the School of Business and Economics, University of Erlangen–Nürnberg, Lange Gasse 20, 90403, Nürnberg, Germany
Wolfgang Stummer

Authors

Michel Broniatowski
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Stummer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wolfgang Stummer .

Editor information

Editors and Affiliations

Sony Computer Science Laboratories, Inc., Tokyo, Japan
Frank Nielsen

Appendix: Proofs

Proof of Theorem 4. Assertion (1) and the “if-part” of (2) follow immediately from Theorem 1 which uses less restrictive assumptions. In order to show the “only-if” part of (2) (and the “if-part” of (2) in an alternative way), one can use the straightforwardly provable fact that the Assumption 2 implies

$$\begin{aligned}& \overline{\mathbbm {w}_{3} \cdot \psi _{\phi ,c}}(x,s,t) \ = \ 0 \qquad \text {if and only if} \qquad s \ = \ t \end{aligned}$$

(136)

for all $s \in \mathscr {R}\big (\frac{P}{M_{1}}\big )$, all $t \in \mathscr {R}\big (\frac{Q}{M_{2}}\big )$ and $\lambda $-a.a. $x \in \mathscr {X}$. To proceed, assume that $D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q) = 0$, which by the non-negativity of $\overline{\mathbbm {w}_{3} \cdot \psi _{\phi ,c}}(\cdot ,\cdot )$ implies that $\overline{\mathbbm {w}_{3} \cdot \psi _{\phi ,c}}\big (\frac{p(x)}{m_{1}(x)},\frac{q(x)}{m_{2}(x)}\big ) = 0$ for $\lambda $-a.a. $x \in \mathscr {X}$. From this and the “only-if” part of (136), we obtain the identity $\frac{p(x)}{m_1(x)}=\frac{q(x)}{m_2(x)} \ \text {for}\, \lambda \text {-a.a.}\, x \in \mathscr {X}$. $\square $

Proof of Theorem 5. Consistently with Theorem 1 (and our adaptions) the “if-part” follows from (51). By our above investigations on the adaptions of the Assumptions 2 to the current context, it remains to investigate the “only-if” part (2) for the following four cases (recall that $\phi $ is strictly convex at $t=1$): (ia) $\phi $ is differentiable at $t=1$ (hence, c is obsolete and $\phi _{+,c}^{\prime }(1)$ collapses to $\phi ^{\prime }(1)$) and the function $\phi $ is affine linear on [1, s] for some $s \in \mathscr {R}\big (\frac{P}{Q}\big )\backslash [a,1]$; (ib) $\phi $ is differentiable at $t=1$, and the function $\phi $ is affine linear on [s, 1] for some $s \in \mathscr {R}\big (\frac{P}{Q}\big )\backslash [1,b]$; (ii) $\phi $ is not differentiable at $t=1$, $c=1$, and the function $\phi $ is affine linear on [1, s] for some $s \in \mathscr {R}\big (\frac{P}{Q}\big )\backslash [a,1]$; (iii) $\phi $ is not differentiable at $t=1$, $c=0$, and the function $\phi $ is affine linear on [s, 1] for some $s \in \mathscr {R}\big (\frac{P}{Q}\big )\backslash [1,b]$. It is easy to see from the strict convexity at 1 that for (ii) one has $\phi (0) + \phi _{+,1}^{\prime }(1) - \phi (1) >0$, whereas for (iii) one gets $\phi ^{*}(0) -\phi _{+,0}^{\prime }(1) >0$; furthermore, for (ia) there holds $\phi (0) + \phi ^{\prime }(1) - \phi (1) >0$ and for (ib) $\phi ^{*}(0) -\phi ^{\prime }(1) >0$. Let us first examine the situations (ia) respectively (ii) under the assumptive constraint $D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q})= 0$ with $c=1$ respectively (in case of differentiability) obsolete c, for which we can deduce from (51)

$$\begin{aligned}& \textstyle \textstyle 0 = D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q}) \nonumber \\& \textstyle \geqslant \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \big [ \mathbbm {q}(x) \cdot \phi \big ( { \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}}\big ) - \mathbbm {q}(x) \cdot \phi \big ( 1 \big ) - \phi _{+,c}^{\prime } \big ( 1 \big ) \cdot \big ( \mathbbm {p}(x) - \mathbbm {q}(x) \big ) \big ] \, \nonumber \\[-0.1cm]&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \big ) \cdot \varvec{1}_{]\mathbbm {p}(x),\infty [}\big (\mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \big [ \phi (0) + \phi _{+,c}^{\prime }(1) - \phi (1) \big ] \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \cdot \varvec{1}_{]\mathbbm {p}(x),\infty [}\big (\mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \, \geqslant 0 , \nonumber \end{aligned}$$

and hence $\int _{{\mathscr {X}}} \varvec{1}_{]\mathbbm {p}(x),\infty [}\big (\mathbbm {q}(x)\big ) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \, = 0$. From this and (55) we obtain

$$\begin{aligned}& \textstyle 0 \, = \, \int _{{\mathscr {X}}} \, \big ( \mathbbm {p}(x) \, - \, \mathbbm {q}(x) \big ) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \, = \, \int _{{\mathscr {X}}} \, \big ( \mathbbm {p}(x) - \mathbbm {q}(x) \big ) \, \cdot \,\varvec{1}_{]\mathbbm {q}(x),\infty [}\big (\mathbbm {p}(x)\big ) \, \cdot \, \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \qquad \ \ \nonumber \end{aligned}$$

and therefore $\int _{{\mathscr {X}}} \varvec{1}_{]\mathbbm {q}(x),\infty [}\big (\mathbbm {p}(x)\big ) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \, = 0$. Since for $\lambda $-a.a. $x\in \mathscr {X}$ we have $\mathbbm {r}(x) >0$, we arrive at $\mathbbm {p}(x) =\mathbbm {q}(x)$ for $\lambda $-a.a. $x\in \mathscr {X}$. The remaining cases (ib) respectively (iii) can be treated analogously. $\square $

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Broniatowski, M., Stummer, W. (2019). Some Universal Insights on Divergences for Statistics, Machine Learning and Artificial Intelligence. In: Nielsen, F. (eds) Geometric Structures of Information. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-030-02520-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-02520-5_8
Published: 20 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02519-9
Online ISBN: 978-3-030-02520-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Some Universal Insights on Divergences for Statistics, Machine Learning and Artificial Intelligence

Abstract

Access this chapter

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Proofs

Appendix: Proofs

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation