Skip to main content
Log in

Information geometry in optimization, machine learning and statistical inference

  • Research Article
  • Published:
Frontiers of Electrical and Electronic Engineering in China

Abstract

The present article gives an introduction to information geometry and surveys its applications in the area of machine learning, optimization and statistical inference. Information geometry is explained intuitively by using divergence functions introduced in a manifold of probability distributions and other general manifolds. They give a Riemannian structure together with a pair of dual flatness criteria. Many manifolds are dually flat. When a manifold is dually flat, a generalized Pythagorean theorem and related projection theorem are introduced. They provide useful means for various approximation and optimization problems. We apply them to alternative minimization problems, Ying-Yang machines and belief propagation algorithm in machine learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Amari S, Nagaoka H. Methods of Information Geometry. New York: Oxford University Press, 2000

    MATH  Google Scholar 

  2. Csiszár I. Information-type measures of difference of probability distributions and indirect observations. Studia Scientiarum Mathematicarum Hungarica, 1967, 2: 299–318

    MATH  MathSciNet  Google Scholar 

  3. Bregman L. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 1967, 7(3): 200–217

    Article  Google Scholar 

  4. Eguchi S. Second order efficiency of minimum contrast estimators in a curved exponential family. The Annals of Statistics, 1983, 11(3): 793–803

    Article  MATH  MathSciNet  Google Scholar 

  5. Chentsov N N. Statistical Decision Rules and Optimal Inference. Rhode Island, USA: American Mathematical Society, 1982 (originally published in Russian, Moscow: Nauka, 1972)

    MATH  Google Scholar 

  6. Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B, 1977, 39(1): 1–38

    MATH  MathSciNet  Google Scholar 

  7. Csiszár I, Tusnády G. Information geometry and alternating minimization procedures. Statistics and Decisions, 1984, Supplement Issue 1: 205–237

  8. Amari S. Information geometry of the EM and em algorithms for neural networks. Neural Networks, 1995, 8(9): 1379–1408

    Article  Google Scholar 

  9. Xu L. Bayesian Ying-Yang machine, clustering and number of clusters. Pattern Recognition Letters, 1997, 18(11–13): 1167–1178

    Article  Google Scholar 

  10. Xu L. RBF nets, mixture experts, and Bayesian Ying-Yang learning. Neurocomputing, 1998, 19(1–3): 223–257

    Article  MATH  Google Scholar 

  11. Xu L. Bayesian Kullback Ying-Yang dependence reduction theory. Neurocomputing, 1998, 22(1–3): 81–111

    Article  MATH  Google Scholar 

  12. Xu L. BYY harmony learning, independent state space, and generalized APT financial analyses. IEEE Transactions on Neural Networks, 2001, 12(4): 822–849

    Article  Google Scholar 

  13. Xu L. Best harmony, unified RPCL and automated model selection for unsupervised and supervised learning on Gaussian mixtures, three-layer nets and ME-RBF-SVM models. International Journal of Neural Systems, 2001, 11(1): 43–69

    Google Scholar 

  14. Xu L. BYY harmony learning, structural RPCL, and topological self-organizing on mixture models. Neural Networks, 2002, 15(8–9): 1125–1151

    Article  Google Scholar 

  15. Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann, 1988

    Google Scholar 

  16. Ikeda S, Tanaka T, Amari S. Information geometry of turbo and low-density parity-check codes. IEEE Transactions on Information Theory, 2004, 50(6): 1097–1114

    Article  MathSciNet  Google Scholar 

  17. Ikeda S, Tanaka T, Amari S. Stochastic reasoning, free energy, and information geometry. Neural Computation, 2004, 16(9): 1779–1810

    Article  MATH  Google Scholar 

  18. Csiszár I. Information measures: A critical survey. In: Transactions of the 7th Prague Conference. 1974, 83–86

  19. Csiszár I. Axiomatic characterizations of information measures. Entropy, 2008, 10(3): 261–273

    Article  MATH  Google Scholar 

  20. Ali M S, Silvey S D. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B, 1966, 28(1): 131–142

    MATH  MathSciNet  Google Scholar 

  21. Amari S. α-divergence is unique, belonging to both f-divergence and Bregman divergence classes. IEEE Transactions on Information Theory, 2009, 55(11): 4925–4931

    Article  MathSciNet  Google Scholar 

  22. Cichocki A, Adunek R, Phan A H, Amari S. Nonnegative Matrix and Tensor Factorizations. John Wiley, 2009

  23. Havrda J, Charvát F. Quantification method of classification process: Concept of structural α-entropy. Kybernetika, 1967, 3: 30–35

    MATH  MathSciNet  Google Scholar 

  24. Chernoff H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 1952, 23(4): 493–507

    Article  MATH  MathSciNet  Google Scholar 

  25. Matsuyama Y. The α-EM algorithm: Surrogate likelihood maximization using α-logarithmic information measures. IEEE Transactions on Information Theory, 2002, 49(3): 672–706

    MathSciNet  Google Scholar 

  26. Amari S. Integration of stochastic models by minimizing α-divergence. Neural Computation, 2007, 19(10): 2780–2796

    Article  MATH  MathSciNet  Google Scholar 

  27. Amari S. Information geometry and its applications: Convex function and dually flat manifold. In: Nielsen F ed. Emerging Trends in Visual Computing. Lecture Notes in Computer Science, Vol 5416. Berlin: Springer-Verlag, 2009, 75–102

    Chapter  Google Scholar 

  28. Eguchi S, Copas J. A class of logistic-type discriminant functions. Biometrika, 2002, 89(1): 1–22

    Article  MATH  MathSciNet  Google Scholar 

  29. Murata N, Takenouchi T, Kanamori T, Eguchi S. Information geometry of U-boost and Bregman divergence. Neural Computation, 2004, 16(7): 1437–1481

    Article  MATH  Google Scholar 

  30. Minami M, Eguchi S. Robust blind source separation by beta-divergence. Neural Computation, 2002, 14(8): 1859–1886

    Article  MATH  Google Scholar 

  31. Byrne W. Alternating minimization and Boltzmann machine learning. IEEE Transactions on Neural Networks, 1992, 3(4): 612–620

    Article  MathSciNet  Google Scholar 

  32. Amari S, Kurata K, Nagaoka H. Information geometry of Boltzmann machines. IEEE Transactions on Neural Networks, 1992, 3(2): 260–271

    Article  Google Scholar 

  33. Amari S. Natural gradient works efficiently in learning. Neural Computation, 1998, 10(2): 251–276

    Article  MathSciNet  Google Scholar 

  34. Amari S, Takeuchi A. Mathematical theory on formation of category detecting nerve cells. Biological Cybernetics, 1978, 29(3): 127–136

    Article  MATH  MathSciNet  Google Scholar 

  35. Jordan M I. Learning in Graphical Models. Cambridge, MA: MIT Press, 1999

    Google Scholar 

  36. Yuille A L. CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation. Neural Computation, 2002, 14(7): 1691–1722

    Article  MATH  Google Scholar 

  37. Yuille A L, Rangarajan A. The concave-convex procedure. Neural Computation, 2003, 15(4): 915–936

    Article  MATH  Google Scholar 

  38. Opper M, Saad D. Advanced Mean Field Methods-Theory and Practice. Cambridge, MA: MIT Press, 2001

    MATH  Google Scholar 

  39. Tanaka T. Information geometry of mean-field approximation. Neural Computation, 2000, 12(8): 1951–1968

    Article  Google Scholar 

  40. Amari S, Ikeda S, Shimokawa H. Information geometry and mean field approximation: The α-projection approach. In: Opper M, Saad D, eds. Advanced Mean Field Methods-Theory and Practice. Cambridge, MA: MIT Press, 2001, 241–257

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shun-ichi Amari.

Additional information

Shun-ichi Amari was born in Tokyo, Japan, on January 3, 1936. He graduated from the Graduate School of the University of Tokyo in 1963 majoring in mathematical engineering and received Degree of Doctor of Engineering. He worked as an Associate Professor at Kyushu University and the University of Tokyo, and then a Full Professor at the University of Tokyo, and is now Professor-Emeritus. He moved to RIKEN Brain Science Institute and served as Director for five years and is now Senior Advisor. He has been engaged in research in wide areas of mathematical science and engineering, such as topological network theory, differential geometry of continuum mechanics, pattern recognition, and information sciences. In particular, he has devoted himself to mathematical foundations of neural networks, including statistical neurodynamics, dynamical theory of neural fields, associative memory, self-organization, and general learning theory. Another main subject of his research is information geometry initiated by himself, which applies modern differential geometry to statistical inference, information theory, control theory, stochastic reasoning, and neural networks, providing a new powerful method to information sciences and probability theory. Dr. Amari is past President of International Neural Networks Society and Institute of Electronic, Information and Communication Engineers, Japan. He received Emanuel R. Piore Award and Neural Networks Pioneer Award from the IEEE, the Japan Academy Award, C&C Award and Caianiello Memorial Award. He was the founding co-editor-in-chief of Neural Networks, among many other journals.

About this article

Cite this article

Amari, Si. Information geometry in optimization, machine learning and statistical inference. Front. Electr. Electron. Eng. China 5, 241–260 (2010). https://doi.org/10.1007/s11460-010-0101-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11460-010-0101-3

Keywords

Navigation