Abstract
Dissimilarity quantifiers such as divergences (e.g. Kullback–Leibler information, relative entropy) and distances between probability distributions are widely used in statistics, machine learning, information theory and adjacent artificial intelligence (AI). Within these fields, in contrast, some applications deal with divergences between other-type real-valued functions and vectors. For a broad readership, we present a correspondingly unifying framework which – by its nature as a “structure on structures” – also qualifies as a basis for similarity-based multistage AI and more humanlike (robustly generalizing) machine learning. Furthermore, we discuss some specificalities, subtleties as well as pitfalls when e.g. one “moves away” from the probability context. Several subcases and examples are given, including a new approach to obtain parameter estimators in continuous models which is based on noisy divergence minimization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
See e.g. Weller-Fahy et al. [93].
- 2.
Alternatively, one can think of d(p, q) as degree of proximity from p to q.
- 3.
Measurable.
- 4.
In a probabilistic approach rather than a chaos-theoretic approach.
- 5.
Where one of them may e.g. stem from training data.
- 6.
This means that there exists a \(N \in \mathscr {F}\) with \(\lambda [N]=0\) (where the empty set \(N = \emptyset \) is allowed) such that for all \(x \in \mathscr {X}\backslash \{N\}\) (say) \(p(x) \in ]-\infty ,\infty [\) holds.
- 7.
As an example, let \(\mathscr {X}= \mathbb {R}\), \(\lambda = \lambda _{L}\) be the Lebesgue measure (and hence, except for rare cases, the integral turns into a Riemann integral) and ; since this qualifies as a probability density and thus is a possible candidate for in Sect. 3.3.1.2 below.
- 8.
Respectively working with canonical space representation and \(Y : = id\).
- 9.
As a side remark, let us mention here that in the special case of continuously differentiable strictly log-convex divergence generator \(\phi \), one can construct divergences which are tighter than (38) respectively (39), see Stummer and Kißlinger [82]; in a finite discrete space and for differentiable exponentially concave divergence generator \(\phi \), a similar tightening (called L-divergence) can be found in Pal and Wong [66, 67].
- 10.
The first resp. second resp. third integral in (41) can be interpreted as divergence-contribution of the function-(support-)overlap resp. of one part of the function-nonoverlap (e.g. describing “extreme outliers”) resp. of the other part of the function-nonoverlap (e.g. describing “extreme inliers”).
- 11.
This can be interpreted analogously as in footnote 10.
- 12.
i.e. the properties (D1) and (D2) (respectively (D2) respectively (D1), (D2) and (D3)) are satisfied.
- 13.
- 14.
In several situations, such a conversion can appear in a natural way; e.g. an institution may generate/collect data of “continuous value” but mask them for external data analysts to group-frequencies, for reasons of confidentiality (information asymmetry).
- 15.
In an encompassing way, the part (a) reflects a measure-theoretic “plug-in” version of decomposable pseudo-divergences \(D: (\mathscr {P}^{meas,\lambda _1} \cup \mathscr {P}^{meas,\lambda _2}) \otimes \mathscr {P}^{meas,\lambda _1} \mapsto \mathbb {R}\), where \(\mathscr {P}^{meas,\lambda _1}\) is a family of mutually equivalent nonnegative measures of the form \(\mathfrak {P}[\bullet ] := \mathfrak {P}^{\mathbbm {1} \cdot \lambda _{1}}[\bullet ] : = \int _{\bullet } \mathbbm {p}(x) \, \mathrm {d}\lambda _{1}(x)\), \(\mathscr {P}^{meas,\lambda _2}\) is a family of nonnegative measures of the form \(\overline{\mathfrak {P}}[\bullet ]:= \overline{\mathfrak {P}}^{\mathbbm {1} \cdot \lambda _{2}}[\bullet ] : = \int _{\bullet } \mathbbm {q}(x) \, \mathrm {d}\lambda _{2}(x)\) such that any \(\mathfrak {P} \in \mathscr {P}^{meas,\lambda _1}\) is not equivalent to any \(\overline{\mathfrak {P}} \in \mathscr {P}^{meas,\lambda _2}\), and (101) is replaced with \(D(\mathfrak {P},\mathfrak {Q})=\mathfrak {D}^{0}(\mathfrak {P})+\mathfrak {D}^{1}(\mathfrak {Q}) +\int _{\mathscr {X}} \rho _{\mathfrak {Q}}(x) \, \mathrm {d}\mathfrak {P}(x) \quad \text {for all }\mathbbm {P} \in \mathfrak {P} \in \mathscr {P}^{meas,\lambda _1}\cup \mathscr {P}^{meas,\lambda _2}, \mathfrak {Q} \in \mathscr {P}^{meas,\lambda _2}\); cf. Vajda [90], Broniatowski and Vajda [18], Broniatowski et al. [19]; part (b) is new.
References
Amari, S.-I.: Information Geometry and Its Applications. Springer, Japan (2016)
Amari, S.-I., Karakida, R., Oizumi, M.: Information geometry connecting Wasserstein distance and Kullback-Leibler divergence via the entropy-relaxed transportation problem. Info. Geo. (2018). https://doi.org/10.1007/s41884-018-0002-8
Amari, S.-I., Nagaoka, H.: Methods of Information Geometry. Oxford University Press, Oxford (2000)
Ali, M.S., Silvey, D.: A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. B–28, 131–140 (1966)
Al Mohamad, D.: Towards a better understanding of the dual representation of phi divergences. Stat. Papers (2016). https://doi.org/10.1007/s00362-016-0812-5
Avlogiaris, G., Micheas, A., Zografos, K.: On local divergences between two probability measures. Metrika 79, 303–333 (2016)
Avlogiaris, G., Micheas, A., Zografos, K.: On testing local hypotheses via local divergence. Stat. Methodol. 31, 20–42 (2016)
Ay, N., Jost, J., Le, H.V., Schwachhöfer, L.: Information Geometry. Springer, Berlin (2017)
Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)
Basu, A., Harris, I.R., Hjort, N.L., Jones, M.C.: Robust and efficient estimation by minimizing a density power divergence. Biometrika 85(3), 549–559 (1998)
Basu, A., Lindsay, B.G.: Minimum disparity estimation for continuous models: efficiency, distributions and robustness. Ann. Inst. Stat. Math. 46(4), 683–705 (1994)
Basu, A., Mandal, A., Martin, N., Pardo, L.: Robust tests for the equality of two normal means based on the density power divergence. Metrika 78, 611–634 (2015)
Basu, A., Shioya, H., Park, C.: Statistical Inference: The Minimum Distance Approach. CRC Press, Boca Raton (2011)
Birkhoff, G.D: A set of postulates for plane geometry, based on scale and protractor. Ann. Math. 33(2) 329–345 (1932)
Boissonnat, J.-D., Nielsen, F., Nock, R.: Bregman Voronoi diagrams. Discret. Comput. Geom. 44(2), 281–307 (2010)
Broniatowski, M., Keziou, A.: Minimization of \(\phi \)-divergences on sets of signed measures. Stud. Sci. Math. Hungar. 43, 403–442 (2006)
Broniatowski, M., Keziou, A.: Parametric estimation and tests through divergences and the duality technique. J. Multiv. Anal. 100(1), 16–36 (2009)
Broniatowski, M., Vajda, I.: Several applications of divergence criteria in continuous families. Kybernetika 48(4), 600–636 (2012)
Broniatowski, M., Toma, A., Vajda, I.: Decomposable pseudodistances in statistical estimation. J. Stat. Plan. Inf. 142, 2574–2585 (2012)
Buckland, M.K.: Information as thing. J. Am. Soc. Inf. Sci. 42(5), 351–360 (1991)
Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning and Games. Cambridge University Press, Cambridge (2006)
Chhogyal, K., Nayak, A., Sattar, A.: On the KL divergence of probability mixtures for belief contraction. In: Hölldobler,S., et al. (eds.) KI 2015: Advances in Artificial Intelligence. Lecture Notes in Artificial Intelligence, vol. 9324, pp. 249–255. Springer International Publishing (2015)
Cliff, O.M., Prokopenko, M., Fitch, R.: An information criterion for inferring coupling in distributed dynamical systems. Front. Robot. AI 3(71). https://doi.org/10.3389/frobt.2016.00071 (2016)
Cliff, O.M., Prokopenko, M., Fitch, R.: Minimising the Kullback-Leibler divergence for model selection in distributed nonlinear systems. Entropy 20(51). https://doi.org/10.3390/e20020051 (2018)
Collins, M., Schapire, R.E., Singer, Y.: Logistic regression, AdaBoost and Bregman distances. Mach. Learn. 48, 253–285 (2002)
Cooper, V.N., Haddad, H.M., Shahriar, H.: Android malware detection using Kullback-Leibler divergence. Adv. Distrib. Comp. Art. Int. J., Special Issue 3(2) (2014)
Csiszar, I.: Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten. Publ. Math. Inst. Hungar. Acad. Sci. A-8, 85–108 (1963)
DasGupta, A.: Some results on the curse of dimensionality and sample size recommendations. Calcutta Stat. Assoc. Bull. 50(3–4), 157–178 (2000)
De Groot, M.H.: Uncertainty, information and sequential experiments. Ann. Math. Stat. 33, 404–419 (1962)
Ghosh, A., Basu, A.: Robust Bayes estimation using the density power divergence. Ann. Inst. Stat. Math. 68, 413–437 (2016)
Ghosh, A., Basu, A.: Robust estimation in generalized linear models: the density power divergence approach. TEST 25, 269–290 (2016)
Ghosh, A., Harris, I.R., Maji, A., Basu, A., Pardo, L.: A generalized divergence for statistical inference. Bernoulli 23(4A), 2746–2783 (2017)
Hampel, F.R., Ronchetti, E.M., Rousseuw, P.J., Stahel, W.A.: Robust Statistics: The Approach Based on Influence Functions. Wiley, New York (1986)
Karakida, R., Amari, S.-I.: Information geometry of Wasserstein divergence. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 119–126. Springer International (2017)
Kißlinger, A.-L., Stummer, W.: Some decision procedures based on scaled Bregman distance surfaces. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2013. Lecture Notes in Computer Science, vol. 8085, pp. 479–486. Springer, Berlin (2013)
Kißlinger, A.-L., Stummer, W.: New model search for nonlinear recursive models, regressions and autoregressions. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2015. Lecture Notes in Computer Science, vol. 9389, pp. 693–701. Springer International (2015)
Kißlinger, A.-L., Stummer, W.: Robust statistical engineering by means of scaled Bregman distances. In: Agostinelli, C., Basu, A., Filzmoser, P., Mukherjee, D. (eds.) Recent Advances in Robust Statistics - Theory and Applications, pp. 81–113. Springer, India (2016)
Kißlinger, A.-L., Stummer, W.: A new toolkit for robust distributional change detection. Appl. Stochastic Models Bus. Ind. 34, 682–699 (2018)
Kuchibhotla, A.K., Basu, A.: A general setup for minimum disparity estimation. Stat. Prob. Lett. 96, 68–74 (2015)
Liese, F., Miescke, K.J.: Statistical Decision Theory: Estimation, Testing, and Selection. Springer, New York (2008)
Liese, F., Vajda, I.: Convex Statistical Distances. Teubner, Leipzig (1987)
Liese, F., Vajda, I.: On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory 52(10), 4394–4412 (2006)
Lin, N., He, X.: Robust and efficient estimation under data grouping. Biometrika 93(1), 99–112 (2006)
Liu, M., Vemuri, B.C., Amari, S.-I., Nielsen, F.: Total Bregman divergence and its applications to shape retrieval. In: Proceedings of 23rd IEEE CVPR, pp. 3463–3468 (2010)
Liu, M., Vemuri, B.C., Amari, S.-I., Nielsen, F.: Shape retrieval using hierarchical total Bregman soft clustering. IEEE Trans. Pattern Anal. Mach. Intell. 34(12), 2407–2419 (2012)
Lizier, J.T.: JIDT: an information-theoretic toolkit for studying the dynamcis of complex systems. Front. Robot. AI 1(11). https://doi.org/10.3389/frobt.2014.00011 (2014)
Menendez, M., Morales, D., Pardo, L., Vajda, I.: Two approaches to grouping of data and related disparity statistics. Comm. Stat. - Theory Methods 27(3), 609–633 (1998)
Menendez, M., Morales, D., Pardo, L., Vajda, I.: Minimum divergence estimators based on grouped data. Ann. Inst. Stat. Math. 53(2), 277–288 (2001)
Menendez, M., Morales, D., Pardo, L., Vajda, I.: Minimum disparity estimators for discrete and continuous models. Appl. Math. 46(6), 439–466 (2001)
Millmann, R.S., Parker, G.D.: Geometry - A Metric Approach With Models, 2nd edn. Springer, New York (1991)
Minka, T.: Divergence measures and message passing. Technical Report MSR-TR-2005-173, Microsoft Research Ltd., Cambridge, UK (2005)
Morales, D., Pardo, L., Vajda, I.: Digitalization of observations permits efficient estimation in continuous models. In: Lopez-Diaz, M., et al. (eds.) Soft Methodology and Random Information Systems, pp. 315–322. Springer, Berlin (2004)
Morales, D., Pardo, L., Vajda, I.: On efficient estimation in continuous models based on finitely quantized observations. Comm. Stat. - Theory Methods 35(9), 1629–1653 (2006)
Murata, N., Takenouchi, T., Kanamori, T., Eguchi, S.: Information geometry of U-boost and Bregman divergence. Neural Comput. 16(7), 1437–1481 (2004)
Nielsen, F., Barbaresco, F. (eds.): Geometric Science of Information GSI 2013. Lecture Notes in Computer Science, vol. 8085. Springer, Berlin (2013)
Nielsen, F., Barbaresco, F. (eds.): Geometric Science of Information GSI 2015. Lecture Notes in Computer Science, vol. 9389. Springer International (2015)
Nielsen, F., Barbaresco, F. (eds.): Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589. Springer International (2017)
Nielsen, F., Bhatia, R. (eds.): Matrix Information Geometry. Springer, Berlin (2013)
Nielsen, F., Nock, R.: Bregman divergences from comparative convexity. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 639–647. Springer International (2017)
Nielsen, F., Sun, K., Marchand-Maillet, S.: On Hölder projective divergences. Entropy 19, 122 (2017)
Nielsen, F., Sun, K., Marchand-Maillet,S.: K-means clustering with Hölder divergences. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 856–863. Springer International (2017)
Nock, R., Menon, A.K., Ong, C.S.: A scaled Bregman theorem with applications. Advances in Neural Information Processing Systems 29 (NIPS 2016), pp. 19–27 (2016)
Nock, R., Nielsen, F.: Bregman divergences and surrogates for learning. IEEE Trans. Pattern Anal. Mach. Intell. 31(11), 2048–2059 (2009)
Nock, R., Nielsen, F., Amari, S.-I.: On conformal divergences and their population minimizers. IEEE Trans. Inf. Theory 62(1), 527–538 (2016)
Österreicher, F., Vajda, I.: Statistical information and discrimination. IEEE Trans. Inf. Theory 39, 1036–1039 (1993)
Pal, S., Wong, T.-K.L.: The geometry of relative arbitrage. Math. Financ. Econ. 10, 263–293 (2016)
Pal, S., Wong, T.-K.L.: Exponentially concave functions and a new information geometry. Ann. Probab. 46(2), 1070–1113 (2018)
Pardo, L.: Statistical Inference Based on Divergence Measures. Chapman & Hall/CRC, Boca Raton (2006)
Park, C., Basu, A.: Minimum disparity estimation: asymptotic normality and breakdown point results. Bull. Inf. Kybern. 36, 19–33 (2004)
Patra, S., Maji, A., Basu, A., Pardo, L.: The power divergence and the density power divergence families: the mathematical connection. Sankhya 75-B Part 1, 16–28 (2013)
Peyre, G., Cuturi M.: Computational Optimal Transport (2018). arXiv:1803.00567v1
Read, T.R.C., Cressie, N.A.C.: Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer, New York (1988)
Reid, M.D., Williamson, R.C.: Information, divergence and risk for binary experiments. J. Mach. Learn. Res. 12, 731–817 (2011)
Roensch, B., Stummer, W.: 3D insights to some divergences for robust statistics and machine learning. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 460–469. Springer International (2017)
Rüschendorf, L.: On the minimum discrimination information system. Stat. Decis. Suppl. Issue 1, 263–283 (1984)
Scott, D.W.: Multivariate Density Estimation - Theory, Practice and Visualization, 2nd edn. Wiley, Hoboken (2015)
Scott, D.W., Wand, M.P.: Feasibility of multivariate density estimates. Biometrika 78(1), 197–205 (1991)
Stummer, W.: On a statistical information measure of diffusion processes. Stat. Decis. 17, 359–376 (1999)
Stummer, W.: On a statistical information measure for a generalized Samuelson-Black-Scholes model. Stat. Decis. 19, 289–314 (2001)
Stummer, W.: Exponentials, Diffusions, Finance. Entropy and Information. Shaker, Aachen (2004)
Stummer, W.: Some Bregman distances between financial diffusion processes. Proc. Appl. Math. Mech. 7(1), 1050503–1050504 (2007)
Stummer, W., Kißlinger, A-L.: Some new flexibilizations of Bregman divergences and their asymptotics. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 514–522. Springer International (2017)
Stummer, W., Vajda, I.: On divergences of finite measures and their applicability in statistics and information theory. Statistics 44, 169–187 (2010)
Stummer, W., Vajda, I.: On Bregman distances and divergences of probability measures. IEEE Trans. Inf. Theory 58(3), 1277–1288 (2012)
Sugiyama, M., Suzuki, T., Kanamori, T.: Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation. Ann. Inst. Stat. Math. 64, 1009–1044 (2012)
Toma, A., Broniatowski, M.: Dual divergence estimators and tests: robustness results. J. Multiv. Anal. 102, 20–36 (2011)
Tsuda, K., Rätsch, G., Warmuth, M.: Matrix exponentiated gradient updates for on-line learning and Bregman projection. J. Mach. Learn. Res. 6, 995–1018 (2005)
van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes. Springer, Berlin (1996)
Vajda, I.: Theory of Statistical Inference and Information. Kluwer, Dordrecht (1989)
Vajda, I.: Modifications of divergence criteria for applications in continuous families. Research Report No. 2230, Institute of Information Theory and Automation, Prague (2008)
Vemuri, B.C., Liu, M., Amari, S.-I., Nielsen, F.: Total Bregman divergence and its applications to DTI analysis. IEEE Trans. Med. Imag. 30(2), 475–483 (2011)
Victoria-Feser, M.-P., Ronchetti, E.: Robust estimation for grouped data. J. Am. Stat. Assoc. 92(437), 333–340 (1997)
Weller-Fahy, D.J., Borghetti, B.J., Sodemann, A.A.: A survey of distance and similarity measures used within network intrusion anomaly detection. IEEE Commun. Surv. Tutor. 17(1), 70–91 (2015)
Wu, L., Hoi, S.C.H., Jin, R., Zhu, J., Yu, N.: Learning Bregman distance functions for semi-supervised clustering. IEEE Trans. Knowl. Data Engin. 24(3), 478–491 (2012)
Zhang, J., Naudts, J.: Information geometry under monotone embedding, part I: divergence functions. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 205–214. Springer International (2017)
Zhang, J., Wang, X., Yao, L., Li, J., Shen, X.: Using Kullback-Leibler divergence to model opponents in poker. Computer Poker and Imperfect Information: Papers from the AAAI-14 Workshop (2014)
Acknowledgements
We are grateful to three anonymous referees for their very useful suggestions and comments. W. Stummer wants to thank very much the Sorbonne Universite Pierre et Marie Curie Paris for its partial financial support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Proofs
Appendix: Proofs
Proof of Theorem 4. Assertion (1) and the “if-part” of (2) follow immediately from Theorem 1 which uses less restrictive assumptions. In order to show the “only-if” part of (2) (and the “if-part” of (2) in an alternative way), one can use the straightforwardly provable fact that the Assumption 2 implies
for all \(s \in \mathscr {R}\big (\frac{P}{M_{1}}\big )\), all \(t \in \mathscr {R}\big (\frac{Q}{M_{2}}\big )\) and \(\lambda \)-a.a. \(x \in \mathscr {X}\). To proceed, assume that \(D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q) = 0\), which by the non-negativity of \(\overline{\mathbbm {w}_{3} \cdot \psi _{\phi ,c}}(\cdot ,\cdot )\) implies that \(\overline{\mathbbm {w}_{3} \cdot \psi _{\phi ,c}}\big (\frac{p(x)}{m_{1}(x)},\frac{q(x)}{m_{2}(x)}\big ) = 0\) for \(\lambda \)-a.a. \(x \in \mathscr {X}\). From this and the “only-if” part of (136), we obtain the identity \(\frac{p(x)}{m_1(x)}=\frac{q(x)}{m_2(x)} \ \text {for}\, \lambda \text {-a.a.}\, x \in \mathscr {X}\). \(\square \)
Proof of Theorem 5. Consistently with Theorem 1 (and our adaptions) the “if-part” follows from (51). By our above investigations on the adaptions of the Assumptions 2 to the current context, it remains to investigate the “only-if” part (2) for the following four cases (recall that \(\phi \) is strictly convex at \(t=1\)): (ia) \(\phi \) is differentiable at \(t=1\) (hence, c is obsolete and \(\phi _{+,c}^{\prime }(1)\) collapses to \(\phi ^{\prime }(1)\)) and the function \(\phi \) is affine linear on [1, s] for some \(s \in \mathscr {R}\big (\frac{P}{Q}\big )\backslash [a,1]\); (ib) \(\phi \) is differentiable at \(t=1\), and the function \(\phi \) is affine linear on [s, 1] for some \(s \in \mathscr {R}\big (\frac{P}{Q}\big )\backslash [1,b]\); (ii) \(\phi \) is not differentiable at \(t=1\), \(c=1\), and the function \(\phi \) is affine linear on [1, s] for some \(s \in \mathscr {R}\big (\frac{P}{Q}\big )\backslash [a,1]\); (iii) \(\phi \) is not differentiable at \(t=1\), \(c=0\), and the function \(\phi \) is affine linear on [s, 1] for some \(s \in \mathscr {R}\big (\frac{P}{Q}\big )\backslash [1,b]\). It is easy to see from the strict convexity at 1 that for (ii) one has \(\phi (0) + \phi _{+,1}^{\prime }(1) - \phi (1) >0\), whereas for (iii) one gets \(\phi ^{*}(0) -\phi _{+,0}^{\prime }(1) >0\); furthermore, for (ia) there holds \(\phi (0) + \phi ^{\prime }(1) - \phi (1) >0\) and for (ib) \(\phi ^{*}(0) -\phi ^{\prime }(1) >0\). Let us first examine the situations (ia) respectively (ii) under the assumptive constraint \(D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q})= 0\) with \(c=1\) respectively (in case of differentiability) obsolete c, for which we can deduce from (51)
and hence \(\int _{{\mathscr {X}}} \varvec{1}_{]\mathbbm {p}(x),\infty [}\big (\mathbbm {q}(x)\big ) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \, = 0\). From this and (55) we obtain
and therefore \(\int _{{\mathscr {X}}} \varvec{1}_{]\mathbbm {q}(x),\infty [}\big (\mathbbm {p}(x)\big ) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \, = 0\). Since for \(\lambda \)-a.a. \(x\in \mathscr {X}\) we have \(\mathbbm {r}(x) >0\), we arrive at \(\mathbbm {p}(x) =\mathbbm {q}(x)\) for \(\lambda \)-a.a. \(x\in \mathscr {X}\). The remaining cases (ib) respectively (iii) can be treated analogously. \(\square \)
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Broniatowski, M., Stummer, W. (2019). Some Universal Insights on Divergences for Statistics, Machine Learning and Artificial Intelligence. In: Nielsen, F. (eds) Geometric Structures of Information. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-030-02520-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-02520-5_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02519-9
Online ISBN: 978-3-030-02520-5
eBook Packages: EngineeringEngineering (R0)