Skip to main content

Some Universal Insights on Divergences for Statistics, Machine Learning and Artificial Intelligence

  • Chapter
  • First Online:
Geometric Structures of Information

Part of the book series: Signals and Communication Technology ((SCT))

Abstract

Dissimilarity quantifiers such as divergences (e.g. Kullback–Leibler information, relative entropy) and distances between probability distributions are widely used in statistics, machine learning, information theory and adjacent artificial intelligence (AI). Within these fields, in contrast, some applications deal with divergences between other-type real-valued functions and vectors. For a broad readership, we present a correspondingly unifying framework which – by its nature as a “structure on structures” – also qualifies as a basis for similarity-based multistage AI and more humanlike (robustly generalizing) machine learning. Furthermore, we discuss some specificalities, subtleties as well as pitfalls when e.g. one “moves away” from the probability context. Several subcases and examples are given, including a new approach to obtain parameter estimators in continuous models which is based on noisy divergence minimization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    See e.g. Weller-Fahy et al. [93].

  2. 2.

    Alternatively, one can think of d(pq) as degree of proximity from p to q.

  3. 3.

    Measurable.

  4. 4.

    In a probabilistic approach rather than a chaos-theoretic approach.

  5. 5.

    Where one of them may e.g. stem from training data.

  6. 6.

    This means that there exists a \(N \in \mathscr {F}\) with \(\lambda [N]=0\) (where the empty set \(N = \emptyset \) is allowed) such that for all \(x \in \mathscr {X}\backslash \{N\}\) (say) \(p(x) \in ]-\infty ,\infty [\) holds.

  7. 7.

    As an example, let \(\mathscr {X}= \mathbb {R}\), \(\lambda = \lambda _{L}\) be the Lebesgue measure (and hence, except for rare cases, the integral turns into a Riemann integral) and ; since this qualifies as a probability density and thus is a possible candidate for in Sect. 3.3.1.2 below.

  8. 8.

    Respectively working with canonical space representation and \(Y : = id\).

  9. 9.

    As a side remark, let us mention here that in the special case of continuously differentiable strictly log-convex divergence generator \(\phi \), one can construct divergences which are tighter than (38) respectively (39), see Stummer and Kißlinger [82]; in a finite discrete space and for differentiable exponentially concave divergence generator \(\phi \), a similar tightening (called L-divergence) can be found in Pal and Wong [66, 67].

  10. 10.

    The first resp. second resp. third integral in (41) can be interpreted as divergence-contribution of the function-(support-)overlap resp. of one part of the function-nonoverlap (e.g. describing “extreme outliers”) resp. of the other part of the function-nonoverlap (e.g. describing “extreme inliers”).

  11. 11.

    This can be interpreted analogously as in footnote 10.

  12. 12.

    i.e. the properties (D1) and (D2) (respectively (D2) respectively (D1), (D2) and (D3)) are satisfied.

  13. 13.

    E.g. applying the divergence (46) for \(\alpha \in \mathbb {R}\backslash \{0,1\}\), the sum-entry appears, which can be viewed as penalty for the cell x being empty of data observations (“intrinsic empty-cell-penalty”); for divergence (60), the penalty is .

  14. 14.

    In several situations, such a conversion can appear in a natural way; e.g. an institution may generate/collect data of “continuous value” but mask them for external data analysts to group-frequencies, for reasons of confidentiality (information asymmetry).

  15. 15.

    In an encompassing way, the part (a) reflects a measure-theoretic “plug-in” version of decomposable pseudo-divergences \(D: (\mathscr {P}^{meas,\lambda _1} \cup \mathscr {P}^{meas,\lambda _2}) \otimes \mathscr {P}^{meas,\lambda _1} \mapsto \mathbb {R}\), where \(\mathscr {P}^{meas,\lambda _1}\) is a family of mutually equivalent nonnegative measures of the form \(\mathfrak {P}[\bullet ] := \mathfrak {P}^{\mathbbm {1} \cdot \lambda _{1}}[\bullet ] : = \int _{\bullet } \mathbbm {p}(x) \, \mathrm {d}\lambda _{1}(x)\), \(\mathscr {P}^{meas,\lambda _2}\) is a family of nonnegative measures of the form \(\overline{\mathfrak {P}}[\bullet ]:= \overline{\mathfrak {P}}^{\mathbbm {1} \cdot \lambda _{2}}[\bullet ] : = \int _{\bullet } \mathbbm {q}(x) \, \mathrm {d}\lambda _{2}(x)\) such that any \(\mathfrak {P} \in \mathscr {P}^{meas,\lambda _1}\) is not equivalent to any \(\overline{\mathfrak {P}} \in \mathscr {P}^{meas,\lambda _2}\), and (101) is replaced with \(D(\mathfrak {P},\mathfrak {Q})=\mathfrak {D}^{0}(\mathfrak {P})+\mathfrak {D}^{1}(\mathfrak {Q}) +\int _{\mathscr {X}} \rho _{\mathfrak {Q}}(x) \, \mathrm {d}\mathfrak {P}(x) \quad \text {for all }\mathbbm {P} \in \mathfrak {P} \in \mathscr {P}^{meas,\lambda _1}\cup \mathscr {P}^{meas,\lambda _2}, \mathfrak {Q} \in \mathscr {P}^{meas,\lambda _2}\); cf. Vajda [90], Broniatowski and Vajda [18], Broniatowski et al. [19]; part (b) is new.

References

  1. Amari, S.-I.: Information Geometry and Its Applications. Springer, Japan (2016)

    Book  Google Scholar 

  2. Amari, S.-I., Karakida, R., Oizumi, M.: Information geometry connecting Wasserstein distance and Kullback-Leibler divergence via the entropy-relaxed transportation problem. Info. Geo. (2018). https://doi.org/10.1007/s41884-018-0002-8

    Article  Google Scholar 

  3. Amari, S.-I., Nagaoka, H.: Methods of Information Geometry. Oxford University Press, Oxford (2000)

    Google Scholar 

  4. Ali, M.S., Silvey, D.: A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. B–28, 131–140 (1966)

    MathSciNet  MATH  Google Scholar 

  5. Al Mohamad, D.: Towards a better understanding of the dual representation of phi divergences. Stat. Papers (2016). https://doi.org/10.1007/s00362-016-0812-5

    Article  MathSciNet  Google Scholar 

  6. Avlogiaris, G., Micheas, A., Zografos, K.: On local divergences between two probability measures. Metrika 79, 303–333 (2016)

    Article  MathSciNet  Google Scholar 

  7. Avlogiaris, G., Micheas, A., Zografos, K.: On testing local hypotheses via local divergence. Stat. Methodol. 31, 20–42 (2016)

    Article  MathSciNet  Google Scholar 

  8. Ay, N., Jost, J., Le, H.V., Schwachhöfer, L.: Information Geometry. Springer, Berlin (2017)

    Book  Google Scholar 

  9. Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)

    MathSciNet  MATH  Google Scholar 

  10. Basu, A., Harris, I.R., Hjort, N.L., Jones, M.C.: Robust and efficient estimation by minimizing a density power divergence. Biometrika 85(3), 549–559 (1998)

    Article  MathSciNet  Google Scholar 

  11. Basu, A., Lindsay, B.G.: Minimum disparity estimation for continuous models: efficiency, distributions and robustness. Ann. Inst. Stat. Math. 46(4), 683–705 (1994)

    Article  MathSciNet  Google Scholar 

  12. Basu, A., Mandal, A., Martin, N., Pardo, L.: Robust tests for the equality of two normal means based on the density power divergence. Metrika 78, 611–634 (2015)

    Article  MathSciNet  Google Scholar 

  13. Basu, A., Shioya, H., Park, C.: Statistical Inference: The Minimum Distance Approach. CRC Press, Boca Raton (2011)

    MATH  Google Scholar 

  14. Birkhoff, G.D: A set of postulates for plane geometry, based on scale and protractor. Ann. Math. 33(2) 329–345 (1932)

    Article  MathSciNet  Google Scholar 

  15. Boissonnat, J.-D., Nielsen, F., Nock, R.: Bregman Voronoi diagrams. Discret. Comput. Geom. 44(2), 281–307 (2010)

    Article  MathSciNet  Google Scholar 

  16. Broniatowski, M., Keziou, A.: Minimization of \(\phi \)-divergences on sets of signed measures. Stud. Sci. Math. Hungar. 43, 403–442 (2006)

    MathSciNet  MATH  Google Scholar 

  17. Broniatowski, M., Keziou, A.: Parametric estimation and tests through divergences and the duality technique. J. Multiv. Anal. 100(1), 16–36 (2009)

    Article  MathSciNet  Google Scholar 

  18. Broniatowski, M., Vajda, I.: Several applications of divergence criteria in continuous families. Kybernetika 48(4), 600–636 (2012)

    MathSciNet  MATH  Google Scholar 

  19. Broniatowski, M., Toma, A., Vajda, I.: Decomposable pseudodistances in statistical estimation. J. Stat. Plan. Inf. 142, 2574–2585 (2012)

    Article  MathSciNet  Google Scholar 

  20. Buckland, M.K.: Information as thing. J. Am. Soc. Inf. Sci. 42(5), 351–360 (1991)

    Article  Google Scholar 

  21. Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning and Games. Cambridge University Press, Cambridge (2006)

    Book  Google Scholar 

  22. Chhogyal, K., Nayak, A., Sattar, A.: On the KL divergence of probability mixtures for belief contraction. In: Hölldobler,S., et al. (eds.) KI 2015: Advances in Artificial Intelligence. Lecture Notes in Artificial Intelligence, vol. 9324, pp. 249–255. Springer International Publishing (2015)

    Google Scholar 

  23. Cliff, O.M., Prokopenko, M., Fitch, R.: An information criterion for inferring coupling in distributed dynamical systems. Front. Robot. AI 3(71). https://doi.org/10.3389/frobt.2016.00071 (2016)

  24. Cliff, O.M., Prokopenko, M., Fitch, R.: Minimising the Kullback-Leibler divergence for model selection in distributed nonlinear systems. Entropy 20(51). https://doi.org/10.3390/e20020051 (2018)

    Article  MathSciNet  Google Scholar 

  25. Collins, M., Schapire, R.E., Singer, Y.: Logistic regression, AdaBoost and Bregman distances. Mach. Learn. 48, 253–285 (2002)

    Article  Google Scholar 

  26. Cooper, V.N., Haddad, H.M., Shahriar, H.: Android malware detection using Kullback-Leibler divergence. Adv. Distrib. Comp. Art. Int. J., Special Issue 3(2) (2014)

    Article  Google Scholar 

  27. Csiszar, I.: Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten. Publ. Math. Inst. Hungar. Acad. Sci. A-8, 85–108 (1963)

    Google Scholar 

  28. DasGupta, A.: Some results on the curse of dimensionality and sample size recommendations. Calcutta Stat. Assoc. Bull. 50(3–4), 157–178 (2000)

    Article  MathSciNet  Google Scholar 

  29. De Groot, M.H.: Uncertainty, information and sequential experiments. Ann. Math. Stat. 33, 404–419 (1962)

    Article  MathSciNet  Google Scholar 

  30. Ghosh, A., Basu, A.: Robust Bayes estimation using the density power divergence. Ann. Inst. Stat. Math. 68, 413–437 (2016)

    Article  MathSciNet  Google Scholar 

  31. Ghosh, A., Basu, A.: Robust estimation in generalized linear models: the density power divergence approach. TEST 25, 269–290 (2016)

    Article  MathSciNet  Google Scholar 

  32. Ghosh, A., Harris, I.R., Maji, A., Basu, A., Pardo, L.: A generalized divergence for statistical inference. Bernoulli 23(4A), 2746–2783 (2017)

    Article  MathSciNet  Google Scholar 

  33. Hampel, F.R., Ronchetti, E.M., Rousseuw, P.J., Stahel, W.A.: Robust Statistics: The Approach Based on Influence Functions. Wiley, New York (1986)

    MATH  Google Scholar 

  34. Karakida, R., Amari, S.-I.: Information geometry of Wasserstein divergence. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 119–126. Springer International (2017)

    Google Scholar 

  35. Kißlinger, A.-L., Stummer, W.: Some decision procedures based on scaled Bregman distance surfaces. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2013. Lecture Notes in Computer Science, vol. 8085, pp. 479–486. Springer, Berlin (2013)

    Google Scholar 

  36. Kißlinger, A.-L., Stummer, W.: New model search for nonlinear recursive models, regressions and autoregressions. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2015. Lecture Notes in Computer Science, vol. 9389, pp. 693–701. Springer International (2015)

    Google Scholar 

  37. Kißlinger, A.-L., Stummer, W.: Robust statistical engineering by means of scaled Bregman distances. In: Agostinelli, C., Basu, A., Filzmoser, P., Mukherjee, D. (eds.) Recent Advances in Robust Statistics - Theory and Applications, pp. 81–113. Springer, India (2016)

    Chapter  Google Scholar 

  38. Kißlinger, A.-L., Stummer, W.: A new toolkit for robust distributional change detection. Appl. Stochastic Models Bus. Ind. 34, 682–699 (2018)

    Article  Google Scholar 

  39. Kuchibhotla, A.K., Basu, A.: A general setup for minimum disparity estimation. Stat. Prob. Lett. 96, 68–74 (2015)

    Article  Google Scholar 

  40. Liese, F., Miescke, K.J.: Statistical Decision Theory: Estimation, Testing, and Selection. Springer, New York (2008)

    Book  Google Scholar 

  41. Liese, F., Vajda, I.: Convex Statistical Distances. Teubner, Leipzig (1987)

    MATH  Google Scholar 

  42. Liese, F., Vajda, I.: On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory 52(10), 4394–4412 (2006)

    Article  MathSciNet  Google Scholar 

  43. Lin, N., He, X.: Robust and efficient estimation under data grouping. Biometrika 93(1), 99–112 (2006)

    Article  MathSciNet  Google Scholar 

  44. Liu, M., Vemuri, B.C., Amari, S.-I., Nielsen, F.: Total Bregman divergence and its applications to shape retrieval. In: Proceedings of 23rd IEEE CVPR, pp. 3463–3468 (2010)

    Google Scholar 

  45. Liu, M., Vemuri, B.C., Amari, S.-I., Nielsen, F.: Shape retrieval using hierarchical total Bregman soft clustering. IEEE Trans. Pattern Anal. Mach. Intell. 34(12), 2407–2419 (2012)

    Article  Google Scholar 

  46. Lizier, J.T.: JIDT: an information-theoretic toolkit for studying the dynamcis of complex systems. Front. Robot. AI 1(11). https://doi.org/10.3389/frobt.2014.00011 (2014)

  47. Menendez, M., Morales, D., Pardo, L., Vajda, I.: Two approaches to grouping of data and related disparity statistics. Comm. Stat. - Theory Methods 27(3), 609–633 (1998)

    Article  MathSciNet  Google Scholar 

  48. Menendez, M., Morales, D., Pardo, L., Vajda, I.: Minimum divergence estimators based on grouped data. Ann. Inst. Stat. Math. 53(2), 277–288 (2001)

    Article  MathSciNet  Google Scholar 

  49. Menendez, M., Morales, D., Pardo, L., Vajda, I.: Minimum disparity estimators for discrete and continuous models. Appl. Math. 46(6), 439–466 (2001)

    Article  MathSciNet  Google Scholar 

  50. Millmann, R.S., Parker, G.D.: Geometry - A Metric Approach With Models, 2nd edn. Springer, New York (1991)

    Google Scholar 

  51. Minka, T.: Divergence measures and message passing. Technical Report MSR-TR-2005-173, Microsoft Research Ltd., Cambridge, UK (2005)

    Google Scholar 

  52. Morales, D., Pardo, L., Vajda, I.: Digitalization of observations permits efficient estimation in continuous models. In: Lopez-Diaz, M., et al. (eds.) Soft Methodology and Random Information Systems, pp. 315–322. Springer, Berlin (2004)

    Chapter  Google Scholar 

  53. Morales, D., Pardo, L., Vajda, I.: On efficient estimation in continuous models based on finitely quantized observations. Comm. Stat. - Theory Methods 35(9), 1629–1653 (2006)

    Article  MathSciNet  Google Scholar 

  54. Murata, N., Takenouchi, T., Kanamori, T., Eguchi, S.: Information geometry of U-boost and Bregman divergence. Neural Comput. 16(7), 1437–1481 (2004)

    Article  Google Scholar 

  55. Nielsen, F., Barbaresco, F. (eds.): Geometric Science of Information GSI 2013. Lecture Notes in Computer Science, vol. 8085. Springer, Berlin (2013)

    Google Scholar 

  56. Nielsen, F., Barbaresco, F. (eds.): Geometric Science of Information GSI 2015. Lecture Notes in Computer Science, vol. 9389. Springer International (2015)

    Google Scholar 

  57. Nielsen, F., Barbaresco, F. (eds.): Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589. Springer International (2017)

    Google Scholar 

  58. Nielsen, F., Bhatia, R. (eds.): Matrix Information Geometry. Springer, Berlin (2013)

    Google Scholar 

  59. Nielsen, F., Nock, R.: Bregman divergences from comparative convexity. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 639–647. Springer International (2017)

    Google Scholar 

  60. Nielsen, F., Sun, K., Marchand-Maillet, S.: On Hölder projective divergences. Entropy 19, 122 (2017)

    Article  Google Scholar 

  61. Nielsen, F., Sun, K., Marchand-Maillet,S.: K-means clustering with Hölder divergences. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 856–863. Springer International (2017)

    Google Scholar 

  62. Nock, R., Menon, A.K., Ong, C.S.: A scaled Bregman theorem with applications. Advances in Neural Information Processing Systems 29 (NIPS 2016), pp. 19–27 (2016)

    Google Scholar 

  63. Nock, R., Nielsen, F.: Bregman divergences and surrogates for learning. IEEE Trans. Pattern Anal. Mach. Intell. 31(11), 2048–2059 (2009)

    Article  Google Scholar 

  64. Nock, R., Nielsen, F., Amari, S.-I.: On conformal divergences and their population minimizers. IEEE Trans. Inf. Theory 62(1), 527–538 (2016)

    Article  MathSciNet  Google Scholar 

  65. Österreicher, F., Vajda, I.: Statistical information and discrimination. IEEE Trans. Inf. Theory 39, 1036–1039 (1993)

    Article  MathSciNet  Google Scholar 

  66. Pal, S., Wong, T.-K.L.: The geometry of relative arbitrage. Math. Financ. Econ. 10, 263–293 (2016)

    Article  MathSciNet  Google Scholar 

  67. Pal, S., Wong, T.-K.L.: Exponentially concave functions and a new information geometry. Ann. Probab. 46(2), 1070–1113 (2018)

    Article  MathSciNet  Google Scholar 

  68. Pardo, L.: Statistical Inference Based on Divergence Measures. Chapman & Hall/CRC, Boca Raton (2006)

    MATH  Google Scholar 

  69. Park, C., Basu, A.: Minimum disparity estimation: asymptotic normality and breakdown point results. Bull. Inf. Kybern. 36, 19–33 (2004)

    MathSciNet  MATH  Google Scholar 

  70. Patra, S., Maji, A., Basu, A., Pardo, L.: The power divergence and the density power divergence families: the mathematical connection. Sankhya 75-B Part 1, 16–28 (2013)

    Article  MathSciNet  Google Scholar 

  71. Peyre, G., Cuturi M.: Computational Optimal Transport (2018). arXiv:1803.00567v1

  72. Read, T.R.C., Cressie, N.A.C.: Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer, New York (1988)

    Book  Google Scholar 

  73. Reid, M.D., Williamson, R.C.: Information, divergence and risk for binary experiments. J. Mach. Learn. Res. 12, 731–817 (2011)

    MathSciNet  MATH  Google Scholar 

  74. Roensch, B., Stummer, W.: 3D insights to some divergences for robust statistics and machine learning. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 460–469. Springer International (2017)

    Google Scholar 

  75. Rüschendorf, L.: On the minimum discrimination information system. Stat. Decis. Suppl. Issue 1, 263–283 (1984)

    MATH  Google Scholar 

  76. Scott, D.W.: Multivariate Density Estimation - Theory, Practice and Visualization, 2nd edn. Wiley, Hoboken (2015)

    MATH  Google Scholar 

  77. Scott, D.W., Wand, M.P.: Feasibility of multivariate density estimates. Biometrika 78(1), 197–205 (1991)

    Article  MathSciNet  Google Scholar 

  78. Stummer, W.: On a statistical information measure of diffusion processes. Stat. Decis. 17, 359–376 (1999)

    MathSciNet  MATH  Google Scholar 

  79. Stummer, W.: On a statistical information measure for a generalized Samuelson-Black-Scholes model. Stat. Decis. 19, 289–314 (2001)

    MathSciNet  MATH  Google Scholar 

  80. Stummer, W.: Exponentials, Diffusions, Finance. Entropy and Information. Shaker, Aachen (2004)

    MATH  Google Scholar 

  81. Stummer, W.: Some Bregman distances between financial diffusion processes. Proc. Appl. Math. Mech. 7(1), 1050503–1050504 (2007)

    Article  Google Scholar 

  82. Stummer, W., Kißlinger, A-L.: Some new flexibilizations of Bregman divergences and their asymptotics. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 514–522. Springer International (2017)

    Google Scholar 

  83. Stummer, W., Vajda, I.: On divergences of finite measures and their applicability in statistics and information theory. Statistics 44, 169–187 (2010)

    Article  MathSciNet  Google Scholar 

  84. Stummer, W., Vajda, I.: On Bregman distances and divergences of probability measures. IEEE Trans. Inf. Theory 58(3), 1277–1288 (2012)

    Article  MathSciNet  Google Scholar 

  85. Sugiyama, M., Suzuki, T., Kanamori, T.: Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation. Ann. Inst. Stat. Math. 64, 1009–1044 (2012)

    Article  MathSciNet  Google Scholar 

  86. Toma, A., Broniatowski, M.: Dual divergence estimators and tests: robustness results. J. Multiv. Anal. 102, 20–36 (2011)

    Article  MathSciNet  Google Scholar 

  87. Tsuda, K., Rätsch, G., Warmuth, M.: Matrix exponentiated gradient updates for on-line learning and Bregman projection. J. Mach. Learn. Res. 6, 995–1018 (2005)

    MathSciNet  MATH  Google Scholar 

  88. van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes. Springer, Berlin (1996)

    Book  Google Scholar 

  89. Vajda, I.: Theory of Statistical Inference and Information. Kluwer, Dordrecht (1989)

    MATH  Google Scholar 

  90. Vajda, I.: Modifications of divergence criteria for applications in continuous families. Research Report No. 2230, Institute of Information Theory and Automation, Prague (2008)

    Google Scholar 

  91. Vemuri, B.C., Liu, M., Amari, S.-I., Nielsen, F.: Total Bregman divergence and its applications to DTI analysis. IEEE Trans. Med. Imag. 30(2), 475–483 (2011)

    Article  Google Scholar 

  92. Victoria-Feser, M.-P., Ronchetti, E.: Robust estimation for grouped data. J. Am. Stat. Assoc. 92(437), 333–340 (1997)

    Article  MathSciNet  Google Scholar 

  93. Weller-Fahy, D.J., Borghetti, B.J., Sodemann, A.A.: A survey of distance and similarity measures used within network intrusion anomaly detection. IEEE Commun. Surv. Tutor. 17(1), 70–91 (2015)

    Article  Google Scholar 

  94. Wu, L., Hoi, S.C.H., Jin, R., Zhu, J., Yu, N.: Learning Bregman distance functions for semi-supervised clustering. IEEE Trans. Knowl. Data Engin. 24(3), 478–491 (2012)

    Article  Google Scholar 

  95. Zhang, J., Naudts, J.: Information geometry under monotone embedding, part I: divergence functions. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information GSI 2017. Lecture Notes in Computer Science, vol. 10589, pp. 205–214. Springer International (2017)

    Google Scholar 

  96. Zhang, J., Wang, X., Yao, L., Li, J., Shen, X.: Using Kullback-Leibler divergence to model opponents in poker. Computer Poker and Imperfect Information: Papers from the AAAI-14 Workshop (2014)

    Google Scholar 

Download references

Acknowledgements

We are grateful to three anonymous referees for their very useful suggestions and comments. W. Stummer wants to thank very much the Sorbonne Universite Pierre et Marie Curie Paris for its partial financial support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wolfgang Stummer .

Editor information

Editors and Affiliations

Appendix: Proofs

Appendix: Proofs

Proof of Theorem 4. Assertion (1) and the “if-part” of (2) follow immediately from Theorem 1 which uses less restrictive assumptions. In order to show the “only-if” part of (2) (and the “if-part” of (2) in an alternative way), one can use the straightforwardly provable fact that the Assumption 2 implies

$$\begin{aligned}& \overline{\mathbbm {w}_{3} \cdot \psi _{\phi ,c}}(x,s,t) \ = \ 0 \qquad \text {if and only if} \qquad s \ = \ t \end{aligned}$$
(136)

for all \(s \in \mathscr {R}\big (\frac{P}{M_{1}}\big )\), all \(t \in \mathscr {R}\big (\frac{Q}{M_{2}}\big )\) and \(\lambda \)-a.a. \(x \in \mathscr {X}\). To proceed, assume that \(D^{c}_{\phi ,M_{1},M_{2},\mathbbm {M}_{3},\lambda }(P,Q) = 0\), which by the non-negativity of \(\overline{\mathbbm {w}_{3} \cdot \psi _{\phi ,c}}(\cdot ,\cdot )\) implies that \(\overline{\mathbbm {w}_{3} \cdot \psi _{\phi ,c}}\big (\frac{p(x)}{m_{1}(x)},\frac{q(x)}{m_{2}(x)}\big ) = 0\) for \(\lambda \)-a.a. \(x \in \mathscr {X}\). From this and the “only-if” part of (136), we obtain the identity \(\frac{p(x)}{m_1(x)}=\frac{q(x)}{m_2(x)} \ \text {for}\, \lambda \text {-a.a.}\, x \in \mathscr {X}\).    \(\square \)

Proof of Theorem 5. Consistently with Theorem 1 (and our adaptions) the “if-part” follows from (51). By our above investigations on the adaptions of the Assumptions 2 to the current context, it remains to investigate the “only-if” part (2) for the following four cases (recall that \(\phi \) is strictly convex at \(t=1\)): (ia) \(\phi \) is differentiable at \(t=1\) (hence, c is obsolete and \(\phi _{+,c}^{\prime }(1)\) collapses to \(\phi ^{\prime }(1)\)) and the function \(\phi \) is affine linear on [1, s] for some \(s \in \mathscr {R}\big (\frac{P}{Q}\big )\backslash [a,1]\); (ib) \(\phi \) is differentiable at \(t=1\), and the function \(\phi \) is affine linear on [s, 1] for some \(s \in \mathscr {R}\big (\frac{P}{Q}\big )\backslash [1,b]\); (ii) \(\phi \) is not differentiable at \(t=1\), \(c=1\), and the function \(\phi \) is affine linear on [1, s] for some \(s \in \mathscr {R}\big (\frac{P}{Q}\big )\backslash [a,1]\); (iii) \(\phi \) is not differentiable at \(t=1\), \(c=0\), and the function \(\phi \) is affine linear on [s, 1] for some \(s \in \mathscr {R}\big (\frac{P}{Q}\big )\backslash [1,b]\). It is easy to see from the strict convexity at 1 that for (ii) one has \(\phi (0) + \phi _{+,1}^{\prime }(1) - \phi (1) >0\), whereas for (iii) one gets \(\phi ^{*}(0) -\phi _{+,0}^{\prime }(1) >0\); furthermore, for (ia) there holds \(\phi (0) + \phi ^{\prime }(1) - \phi (1) >0\) and for (ib) \(\phi ^{*}(0) -\phi ^{\prime }(1) >0\). Let us first examine the situations (ia) respectively (ii) under the assumptive constraint \(D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q})= 0\) with \(c=1\) respectively (in case of differentiability) obsolete c, for which we can deduce from (51)

$$\begin{aligned}& \textstyle \textstyle 0 = D^{c}_{\phi ,\mathbbm {Q},\mathbbm {Q},\mathbbm {R}\cdot \mathbbm {Q},\lambda }(\mathbbm {P},\mathbbm {Q}) \nonumber \\& \textstyle \geqslant \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \big [ \mathbbm {q}(x) \cdot \phi \big ( { \frac{\mathbbm {p}(x)}{\mathbbm {q}(x)}}\big ) - \mathbbm {q}(x) \cdot \phi \big ( 1 \big ) - \phi _{+,c}^{\prime } \big ( 1 \big ) \cdot \big ( \mathbbm {p}(x) - \mathbbm {q}(x) \big ) \big ] \, \nonumber \\[-0.1cm]&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \cdot \varvec{1}_{]0,\infty [}\big (\mathbbm {p}(x) \big ) \cdot \varvec{1}_{]\mathbbm {p}(x),\infty [}\big (\mathbbm {q}(x) \big ) \, \mathrm {d}\lambda (x) \nonumber \\& \textstyle + \big [ \phi (0) + \phi _{+,c}^{\prime }(1) - \phi (1) \big ] \cdot \int _{{\mathscr {X}}} \mathbbm {r}(x) \cdot \mathbbm {q}(x) \cdot \varvec{1}_{\{0\}}\big (\mathbbm {p}(x)\big ) \cdot \varvec{1}_{]\mathbbm {p}(x),\infty [}\big (\mathbbm {q}(x)\big ) \, \mathrm {d}\lambda (x) \, \geqslant 0 , \nonumber \end{aligned}$$

and hence \(\int _{{\mathscr {X}}} \varvec{1}_{]\mathbbm {p}(x),\infty [}\big (\mathbbm {q}(x)\big ) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \, = 0\). From this and (55) we obtain

$$\begin{aligned}& \textstyle 0 \, = \, \int _{{\mathscr {X}}} \, \big ( \mathbbm {p}(x) \, - \, \mathbbm {q}(x) \big ) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \, = \, \int _{{\mathscr {X}}} \, \big ( \mathbbm {p}(x) - \mathbbm {q}(x) \big ) \, \cdot \,\varvec{1}_{]\mathbbm {q}(x),\infty [}\big (\mathbbm {p}(x)\big ) \, \cdot \, \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \qquad \ \ \nonumber \end{aligned}$$

and therefore \(\int _{{\mathscr {X}}} \varvec{1}_{]\mathbbm {q}(x),\infty [}\big (\mathbbm {p}(x)\big ) \cdot \mathbbm {r}(x) \, \mathrm {d}\lambda (x) \, = 0\). Since for \(\lambda \)-a.a. \(x\in \mathscr {X}\) we have \(\mathbbm {r}(x) >0\), we arrive at \(\mathbbm {p}(x) =\mathbbm {q}(x)\) for \(\lambda \)-a.a. \(x\in \mathscr {X}\). The remaining cases (ib) respectively (iii) can be treated analogously.        \(\square \)

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Broniatowski, M., Stummer, W. (2019). Some Universal Insights on Divergences for Statistics, Machine Learning and Artificial Intelligence. In: Nielsen, F. (eds) Geometric Structures of Information. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-030-02520-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-02520-5_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-02519-9

  • Online ISBN: 978-3-030-02520-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics