Information geometry in optimization, machine learning and statistical inference

  • Shun-ichi Amari
Research Article


The present article gives an introduction to information geometry and surveys its applications in the area of machine learning, optimization and statistical inference. Information geometry is explained intuitively by using divergence functions introduced in a manifold of probability distributions and other general manifolds. They give a Riemannian structure together with a pair of dual flatness criteria. Many manifolds are dually flat. When a manifold is dually flat, a generalized Pythagorean theorem and related projection theorem are introduced. They provide useful means for various approximation and optimization problems. We apply them to alternative minimization problems, Ying-Yang machines and belief propagation algorithm in machine learning.


information geometry machine learning optimization statistical inference divergence graphical model Ying-Yang machine 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amari S, Nagaoka H. Methods of Information Geometry. New York: Oxford University Press, 2000zbMATHGoogle Scholar
  2. 2.
    Csiszár I. Information-type measures of difference of probability distributions and indirect observations. Studia Scientiarum Mathematicarum Hungarica, 1967, 2: 299–318zbMATHMathSciNetGoogle Scholar
  3. 3.
    Bregman L. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 1967, 7(3): 200–217CrossRefGoogle Scholar
  4. 4.
    Eguchi S. Second order efficiency of minimum contrast estimators in a curved exponential family. The Annals of Statistics, 1983, 11(3): 793–803zbMATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Chentsov N N. Statistical Decision Rules and Optimal Inference. Rhode Island, USA: American Mathematical Society, 1982 (originally published in Russian, Moscow: Nauka, 1972)zbMATHGoogle Scholar
  6. 6.
    Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B, 1977, 39(1): 1–38zbMATHMathSciNetGoogle Scholar
  7. 7.
    Csiszár I, Tusnády G. Information geometry and alternating minimization procedures. Statistics and Decisions, 1984, Supplement Issue 1: 205–237Google Scholar
  8. 8.
    Amari S. Information geometry of the EM and em algorithms for neural networks. Neural Networks, 1995, 8(9): 1379–1408CrossRefGoogle Scholar
  9. 9.
    Xu L. Bayesian Ying-Yang machine, clustering and number of clusters. Pattern Recognition Letters, 1997, 18(11–13): 1167–1178CrossRefGoogle Scholar
  10. 10.
    Xu L. RBF nets, mixture experts, and Bayesian Ying-Yang learning. Neurocomputing, 1998, 19(1–3): 223–257zbMATHCrossRefGoogle Scholar
  11. 11.
    Xu L. Bayesian Kullback Ying-Yang dependence reduction theory. Neurocomputing, 1998, 22(1–3): 81–111zbMATHCrossRefGoogle Scholar
  12. 12.
    Xu L. BYY harmony learning, independent state space, and generalized APT financial analyses. IEEE Transactions on Neural Networks, 2001, 12(4): 822–849CrossRefGoogle Scholar
  13. 13.
    Xu L. Best harmony, unified RPCL and automated model selection for unsupervised and supervised learning on Gaussian mixtures, three-layer nets and ME-RBF-SVM models. International Journal of Neural Systems, 2001, 11(1): 43–69Google Scholar
  14. 14.
    Xu L. BYY harmony learning, structural RPCL, and topological self-organizing on mixture models. Neural Networks, 2002, 15(8–9): 1125–1151CrossRefGoogle Scholar
  15. 15.
    Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann, 1988Google Scholar
  16. 16.
    Ikeda S, Tanaka T, Amari S. Information geometry of turbo and low-density parity-check codes. IEEE Transactions on Information Theory, 2004, 50(6): 1097–1114CrossRefMathSciNetGoogle Scholar
  17. 17.
    Ikeda S, Tanaka T, Amari S. Stochastic reasoning, free energy, and information geometry. Neural Computation, 2004, 16(9): 1779–1810zbMATHCrossRefGoogle Scholar
  18. 18.
    Csiszár I. Information measures: A critical survey. In: Transactions of the 7th Prague Conference. 1974, 83–86Google Scholar
  19. 19.
    Csiszár I. Axiomatic characterizations of information measures. Entropy, 2008, 10(3): 261–273zbMATHCrossRefGoogle Scholar
  20. 20.
    Ali M S, Silvey S D. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B, 1966, 28(1): 131–142zbMATHMathSciNetGoogle Scholar
  21. 21.
    Amari S. α-divergence is unique, belonging to both f-divergence and Bregman divergence classes. IEEE Transactions on Information Theory, 2009, 55(11): 4925–4931CrossRefMathSciNetGoogle Scholar
  22. 22.
    Cichocki A, Adunek R, Phan A H, Amari S. Nonnegative Matrix and Tensor Factorizations. John Wiley, 2009Google Scholar
  23. 23.
    Havrda J, Charvát F. Quantification method of classification process: Concept of structural α-entropy. Kybernetika, 1967, 3: 30–35zbMATHMathSciNetGoogle Scholar
  24. 24.
    Chernoff H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 1952, 23(4): 493–507zbMATHCrossRefMathSciNetGoogle Scholar
  25. 25.
    Matsuyama Y. The α-EM algorithm: Surrogate likelihood maximization using α-logarithmic information measures. IEEE Transactions on Information Theory, 2002, 49(3): 672–706MathSciNetGoogle Scholar
  26. 26.
    Amari S. Integration of stochastic models by minimizing α-divergence. Neural Computation, 2007, 19(10): 2780–2796zbMATHCrossRefMathSciNetGoogle Scholar
  27. 27.
    Amari S. Information geometry and its applications: Convex function and dually flat manifold. In: Nielsen F ed. Emerging Trends in Visual Computing. Lecture Notes in Computer Science, Vol 5416. Berlin: Springer-Verlag, 2009, 75–102CrossRefGoogle Scholar
  28. 28.
    Eguchi S, Copas J. A class of logistic-type discriminant functions. Biometrika, 2002, 89(1): 1–22zbMATHCrossRefMathSciNetGoogle Scholar
  29. 29.
    Murata N, Takenouchi T, Kanamori T, Eguchi S. Information geometry of U-boost and Bregman divergence. Neural Computation, 2004, 16(7): 1437–1481zbMATHCrossRefGoogle Scholar
  30. 30.
    Minami M, Eguchi S. Robust blind source separation by beta-divergence. Neural Computation, 2002, 14(8): 1859–1886zbMATHCrossRefGoogle Scholar
  31. 31.
    Byrne W. Alternating minimization and Boltzmann machine learning. IEEE Transactions on Neural Networks, 1992, 3(4): 612–620CrossRefMathSciNetGoogle Scholar
  32. 32.
    Amari S, Kurata K, Nagaoka H. Information geometry of Boltzmann machines. IEEE Transactions on Neural Networks, 1992, 3(2): 260–271CrossRefGoogle Scholar
  33. 33.
    Amari S. Natural gradient works efficiently in learning. Neural Computation, 1998, 10(2): 251–276CrossRefMathSciNetGoogle Scholar
  34. 34.
    Amari S, Takeuchi A. Mathematical theory on formation of category detecting nerve cells. Biological Cybernetics, 1978, 29(3): 127–136zbMATHCrossRefMathSciNetGoogle Scholar
  35. 35.
    Jordan M I. Learning in Graphical Models. Cambridge, MA: MIT Press, 1999Google Scholar
  36. 36.
    Yuille A L. CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation. Neural Computation, 2002, 14(7): 1691–1722zbMATHCrossRefGoogle Scholar
  37. 37.
    Yuille A L, Rangarajan A. The concave-convex procedure. Neural Computation, 2003, 15(4): 915–936zbMATHCrossRefGoogle Scholar
  38. 38.
    Opper M, Saad D. Advanced Mean Field Methods-Theory and Practice. Cambridge, MA: MIT Press, 2001zbMATHGoogle Scholar
  39. 39.
    Tanaka T. Information geometry of mean-field approximation. Neural Computation, 2000, 12(8): 1951–1968CrossRefGoogle Scholar
  40. 40.
    Amari S, Ikeda S, Shimokawa H. Information geometry and mean field approximation: The α-projection approach. In: Opper M, Saad D, eds. Advanced Mean Field Methods-Theory and Practice. Cambridge, MA: MIT Press, 2001, 241–257Google Scholar

Copyright information

© Higher Education Press and Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  1. 1.RIKEN Brain Science InstituteSaitamaJapan

Personalised recommendations