Abstract
The present article gives an introduction to information geometry and surveys its applications in the area of machine learning, optimization and statistical inference. Information geometry is explained intuitively by using divergence functions introduced in a manifold of probability distributions and other general manifolds. They give a Riemannian structure together with a pair of dual flatness criteria. Many manifolds are dually flat. When a manifold is dually flat, a generalized Pythagorean theorem and related projection theorem are introduced. They provide useful means for various approximation and optimization problems. We apply them to alternative minimization problems, Ying-Yang machines and belief propagation algorithm in machine learning.
Similar content being viewed by others
References
Amari S, Nagaoka H. Methods of Information Geometry. New York: Oxford University Press, 2000
Csiszár I. Information-type measures of difference of probability distributions and indirect observations. Studia Scientiarum Mathematicarum Hungarica, 1967, 2: 299–318
Bregman L. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 1967, 7(3): 200–217
Eguchi S. Second order efficiency of minimum contrast estimators in a curved exponential family. The Annals of Statistics, 1983, 11(3): 793–803
Chentsov N N. Statistical Decision Rules and Optimal Inference. Rhode Island, USA: American Mathematical Society, 1982 (originally published in Russian, Moscow: Nauka, 1972)
Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B, 1977, 39(1): 1–38
Csiszár I, Tusnády G. Information geometry and alternating minimization procedures. Statistics and Decisions, 1984, Supplement Issue 1: 205–237
Amari S. Information geometry of the EM and em algorithms for neural networks. Neural Networks, 1995, 8(9): 1379–1408
Xu L. Bayesian Ying-Yang machine, clustering and number of clusters. Pattern Recognition Letters, 1997, 18(11–13): 1167–1178
Xu L. RBF nets, mixture experts, and Bayesian Ying-Yang learning. Neurocomputing, 1998, 19(1–3): 223–257
Xu L. Bayesian Kullback Ying-Yang dependence reduction theory. Neurocomputing, 1998, 22(1–3): 81–111
Xu L. BYY harmony learning, independent state space, and generalized APT financial analyses. IEEE Transactions on Neural Networks, 2001, 12(4): 822–849
Xu L. Best harmony, unified RPCL and automated model selection for unsupervised and supervised learning on Gaussian mixtures, three-layer nets and ME-RBF-SVM models. International Journal of Neural Systems, 2001, 11(1): 43–69
Xu L. BYY harmony learning, structural RPCL, and topological self-organizing on mixture models. Neural Networks, 2002, 15(8–9): 1125–1151
Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann, 1988
Ikeda S, Tanaka T, Amari S. Information geometry of turbo and low-density parity-check codes. IEEE Transactions on Information Theory, 2004, 50(6): 1097–1114
Ikeda S, Tanaka T, Amari S. Stochastic reasoning, free energy, and information geometry. Neural Computation, 2004, 16(9): 1779–1810
Csiszár I. Information measures: A critical survey. In: Transactions of the 7th Prague Conference. 1974, 83–86
Csiszár I. Axiomatic characterizations of information measures. Entropy, 2008, 10(3): 261–273
Ali M S, Silvey S D. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B, 1966, 28(1): 131–142
Amari S. α-divergence is unique, belonging to both f-divergence and Bregman divergence classes. IEEE Transactions on Information Theory, 2009, 55(11): 4925–4931
Cichocki A, Adunek R, Phan A H, Amari S. Nonnegative Matrix and Tensor Factorizations. John Wiley, 2009
Havrda J, Charvát F. Quantification method of classification process: Concept of structural α-entropy. Kybernetika, 1967, 3: 30–35
Chernoff H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 1952, 23(4): 493–507
Matsuyama Y. The α-EM algorithm: Surrogate likelihood maximization using α-logarithmic information measures. IEEE Transactions on Information Theory, 2002, 49(3): 672–706
Amari S. Integration of stochastic models by minimizing α-divergence. Neural Computation, 2007, 19(10): 2780–2796
Amari S. Information geometry and its applications: Convex function and dually flat manifold. In: Nielsen F ed. Emerging Trends in Visual Computing. Lecture Notes in Computer Science, Vol 5416. Berlin: Springer-Verlag, 2009, 75–102
Eguchi S, Copas J. A class of logistic-type discriminant functions. Biometrika, 2002, 89(1): 1–22
Murata N, Takenouchi T, Kanamori T, Eguchi S. Information geometry of U-boost and Bregman divergence. Neural Computation, 2004, 16(7): 1437–1481
Minami M, Eguchi S. Robust blind source separation by beta-divergence. Neural Computation, 2002, 14(8): 1859–1886
Byrne W. Alternating minimization and Boltzmann machine learning. IEEE Transactions on Neural Networks, 1992, 3(4): 612–620
Amari S, Kurata K, Nagaoka H. Information geometry of Boltzmann machines. IEEE Transactions on Neural Networks, 1992, 3(2): 260–271
Amari S. Natural gradient works efficiently in learning. Neural Computation, 1998, 10(2): 251–276
Amari S, Takeuchi A. Mathematical theory on formation of category detecting nerve cells. Biological Cybernetics, 1978, 29(3): 127–136
Jordan M I. Learning in Graphical Models. Cambridge, MA: MIT Press, 1999
Yuille A L. CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation. Neural Computation, 2002, 14(7): 1691–1722
Yuille A L, Rangarajan A. The concave-convex procedure. Neural Computation, 2003, 15(4): 915–936
Opper M, Saad D. Advanced Mean Field Methods-Theory and Practice. Cambridge, MA: MIT Press, 2001
Tanaka T. Information geometry of mean-field approximation. Neural Computation, 2000, 12(8): 1951–1968
Amari S, Ikeda S, Shimokawa H. Information geometry and mean field approximation: The α-projection approach. In: Opper M, Saad D, eds. Advanced Mean Field Methods-Theory and Practice. Cambridge, MA: MIT Press, 2001, 241–257
Author information
Authors and Affiliations
Corresponding author
Additional information
Shun-ichi Amari was born in Tokyo, Japan, on January 3, 1936. He graduated from the Graduate School of the University of Tokyo in 1963 majoring in mathematical engineering and received Degree of Doctor of Engineering. He worked as an Associate Professor at Kyushu University and the University of Tokyo, and then a Full Professor at the University of Tokyo, and is now Professor-Emeritus. He moved to RIKEN Brain Science Institute and served as Director for five years and is now Senior Advisor. He has been engaged in research in wide areas of mathematical science and engineering, such as topological network theory, differential geometry of continuum mechanics, pattern recognition, and information sciences. In particular, he has devoted himself to mathematical foundations of neural networks, including statistical neurodynamics, dynamical theory of neural fields, associative memory, self-organization, and general learning theory. Another main subject of his research is information geometry initiated by himself, which applies modern differential geometry to statistical inference, information theory, control theory, stochastic reasoning, and neural networks, providing a new powerful method to information sciences and probability theory. Dr. Amari is past President of International Neural Networks Society and Institute of Electronic, Information and Communication Engineers, Japan. He received Emanuel R. Piore Award and Neural Networks Pioneer Award from the IEEE, the Japan Academy Award, C&C Award and Caianiello Memorial Award. He was the founding co-editor-in-chief of Neural Networks, among many other journals.
About this article
Cite this article
Amari, Si. Information geometry in optimization, machine learning and statistical inference. Front. Electr. Electron. Eng. China 5, 241–260 (2010). https://doi.org/10.1007/s11460-010-0101-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11460-010-0101-3