Conditioning, Mutual Information, and Information Gain

  • Günther Palm


In this chapter we want to discuss the extension of three concepts of classical information theory, namely conditional information, transinformation (also called mutual information), and information gain (also called Kullback–Leibler distance) from descriptions to (reasonably large classes of) covers. This extension will also extend these concepts from discrete to continuous random variables.


Mutual Information Information Gain Discrete Random Variable Continuous Random Variable Additive Symmetry 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. .
    Amari, S. (1967). A theory of adaptive pattern classifiers. IEEE Transactions on Electronic Computers, 16(3), 299–307.Google Scholar
  2. .
    Amari, S. (1982). Differential geometry of curved exponential families—curvature and information loss. Annals of Statistics, 10, 357–385.Google Scholar
  3. .
    Amari, S. (1985). Differential-geometrical methods in statistics. New York: Springer.Google Scholar
  4. .
    Amari, S., & Nagaoka, H. (2000). Methods of information geometry. USA: AMS and Oxford University Press.Google Scholar
  5. .
    Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in Neural Information Processing Systems (Vol. 8) (pp. 757–763). Cambridge: MIT Press.Google Scholar
  6. .
    Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network: Computation in Neural Systems, 3, 213–251.Google Scholar
  7. .
    Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295–311.Google Scholar
  8. .
    Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. Neural Networks, 5, 537–550.Google Scholar
  9. .
    Bauer, H. (1972). Probability theory and elements of measure theory. New York: Holt, Rinehart and Winston.Google Scholar
  10. .
    Brown, G. (2009). A new perspective for information theoretic feature selection. In Proceedings of the 12th international conference on artificial intelligence and statistics (AI-STATS 2009).Google Scholar
  11. .
    Chow, S. L. (1996). Statistical significance: Rationale, validity and utility. London: Sage Publications.Google Scholar
  12. .
    Coulter, W. K., Hillar, C. J., & Sommer, F. T. (2009). Adaptive compressed sensing—a new class of self-organizing coding models for neuroscience. arXiv:0906.1202v1.Google Scholar
  13. .
    Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. London: Wiley.Google Scholar
  14. .
    Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural systems. MA: MIT Press.Google Scholar
  15. .
    Deco, G., & Obradovic, D. (1996). An Information-theoretic approach to neural computing. New York: Springer.Google Scholar
  16. .
    Erdogmus, D., Principe, J. C., & II, K. E. H. (2003). On-line entropy manipulation: Stochastic information gradient. IEEE Signal Processing Letters, 10(8), 242–245.Google Scholar
  17. .
    Grosse, I., Herzel, H., Buldyrev, S., & Stanley, H. (2000). Species independence of mutual information in coding and noncoding DNA. Physical Review E, 61(5), 5624–5629.Google Scholar
  18. .
    Herzel, H., Ebeling, W., & Schmitt, A. (1994). Entropies of biosequences: The role of repeats. Physical Review E, 50(6), 5061–5071.Google Scholar
  19. .
    Hinton, G., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society B: Biological Sciences, 352(1358), 1177–1190.Google Scholar
  20. .
    Hyvärinen, A. (2002). An alternative approach to infomax and independent component analysis. Neurocomputing, 44–46, 1089–1097.Google Scholar
  21. .
    Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106(4), 620–630.Google Scholar
  22. .
    Jaynes, E. T. (1982). On the rationale of maximum entropy methods. Proceedings IEEE, 70, 939–952.Google Scholar
  23. .
    Kamimura, R. (2002). Information theoretic neural computation. New York: World Scientific.Google Scholar
  24. .
    Kolmogorov, A. N. (1956) On the Shannon theory of information transmission in the case of continuoussignals. IRE Transactions on Information Theory, IT-2, 102–108.Google Scholar
  25. .
    Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86.Google Scholar
  26. .
    Linsker, R. (1989b). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computation, 1(3), 402–411.Google Scholar
  27. .
    Linsker, R. (1992). Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Computation, 4, 691–702.Google Scholar
  28. .
    Linsker, R. (1997). A local learning rule that enables information maximization for arbitrary input distributions. Neural Computation, 9, 1661–1665.Google Scholar
  29. .
    MacKay, D. J. C. (2005). Information theory, inference, and learning algorithms. UK: Cambridge University Press.Google Scholar
  30. .
    Mac Dónaill, D. (2009). Molecular informatics: Hydrogen-bonding, error-coding, and genetic replication. In 43rd Annual Conference on Information Sciences and Systems (CISS 2009). MD: Baltimore.Google Scholar
  31. .
    Mongillo, G., & Denève, S. (2008). On-line learning with hidden Markov models. Neural Computation, 20, 1706–1716.Google Scholar
  32. .
    Ozertem, U., Erdogmus, D., & Jenssen, R. (2006). Spectral feature projections that maximize shannon mutual information with class labels. Pattern Recognition, 39(7), 1241–1252.Google Scholar
  33. .
    Pearlmutter, B. A., & Hinton, G. E. (1987). G-maximization: An unsupervised learning procedure for discovering regularities. In J. S. Denker (Ed.), AIP conference proceedings 151 on neural networks for computing (pp. 333–338). Woodbury: American Institute of Physics Inc.Google Scholar
  34. .
    Principe, J. C., Fischer III, J., & Xu, D. (2000). Information theoretic learning. In S. Haykin (Ed.), Unsupervised adaptive filtering (pp. 265–319). New York: Wiley.Google Scholar
  35. .
    Shannon, C. E. (1948). A mathematical theory of communication. Bell Systems Technical Journal, 27, 379–423, 623–656.Google Scholar
  36. .
    Schmitt, A. O., & Herzel, H. (1997). Estimating the entropy of DNA sequences. Journal of Theoretical Biology, 188(3), 369–377.Google Scholar
  37. .
    Slonim, N., Atwal, G., Tkačik, G., & Bialek, W. (2005). Estimating mutual information and multi-information in large networks. arXiv:cs/0502017v1. Google Scholar
  38. .
    Taylor, S. F., Tishby, N., & Bialek, W. (2007). Information and fitness. arXiv:0712.4382v1. Google Scholar
  39. .
    Tkačik, G., & Bialek, W. (2007). Cell biology: Networks, regulation, pathways. In R. A. Meyers (Ed.) Encyclopedia of complexity and systems science (pp. 719–741). Berlin: Springer. arXiv:0712.4385 [qbio.MN]Google Scholar
  40. .
    Torkkola, K., & Campbell, W. M. (2000). Mutual information in learning feature transformations. In ICML ’00: Proceedings of the Seventeenth International Conference on Machine Learning (pp. 1015–1022). San Francisco: Morgan Kaufmann.Google Scholar
  41. .
    Weiss, O., Jiménez-Montano, M., & Herzel, H. (2000). Information content protein sequences. Journal of Theoretical Biology, 206, 379–386.Google Scholar
  42. .
    Zemel, R. S., & Hinton, G. E. (1995). Learning population codes by minimizing description length. Neural Computation, 7, 549–564.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Günther Palm
    • 1
  1. 1.Neural Information ProcessingUniversity of UlmUlmGermany

Personalised recommendations