Skip to main content

Conditioning, Mutual Information, and Information Gain

  • Chapter
  • First Online:
Novelty, Information and Surprise
  • 964 Accesses

Abstract

In this chapter we want to discuss the extension of three concepts of classical information theory, namely conditional information, transinformation (also called mutual information), and information gain (also called Kullback–Leibler distance) from descriptions to (reasonably large classes of) covers. This extension will also extend these concepts from discrete to continuous random variables.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    See Chow (1996) for example.

  2. 2.

    cf. Sect. 3.3, Proposition 10.6 and Chap. 17. \(\alpha \vee \beta \) should be the smallest repertoire that is larger than α and β. This of course depends on the ordering \(\leq \) of repertoires or of subclasses of repertoires (see Chaps. 15 and 16). For repertoires \(\alpha \vee \beta \) for the ordering defined in Def. 10.3, is simply \(\alpha \cup \beta \), whereas for templates it turns out to be \(\alpha \cdot \beta \) (cf. Chap. 16).

  3. 3.

    See Bauer (1972) for example.

  4. 4.

    By the Radon–Nikodym theorem (e.g. Bauer 1972).

  5. 5.

    \(p(A\vert X)\) is a random variable that depends on the value of X, i.e., \(p(A\vert X)\,=\,f(X)\), where \(f(x)\,=\,p(A\vert [X = x])\) which can be properly defined for almost every \(x \in R(X)\).

References

  1. Amari, S. (1967). A theory of adaptive pattern classifiers. IEEE Transactions on Electronic Computers, 16(3), 299–307.

    Google Scholar 

  2. Amari, S. (1982). Differential geometry of curved exponential families—curvature and information loss. Annals of Statistics, 10, 357–385.

    Google Scholar 

  3. Amari, S. (1985). Differential-geometrical methods in statistics. New York: Springer.

    Google Scholar 

  4. Amari, S., & Nagaoka, H. (2000). Methods of information geometry. USA: AMS and Oxford University Press.

    Google Scholar 

  5. Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in Neural Information Processing Systems (Vol. 8) (pp. 757–763). Cambridge: MIT Press.

    Google Scholar 

  6. Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network: Computation in Neural Systems, 3, 213–251.

    Google Scholar 

  7. Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295–311.

    Google Scholar 

  8. Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. Neural Networks, 5, 537–550.

    Google Scholar 

  9. Bauer, H. (1972). Probability theory and elements of measure theory. New York: Holt, Rinehart and Winston.

    Google Scholar 

  10. Brown, G. (2009). A new perspective for information theoretic feature selection. In Proceedings of the 12th international conference on artificial intelligence and statistics (AI-STATS 2009).

    Google Scholar 

  11. Chow, S. L. (1996). Statistical significance: Rationale, validity and utility. London: Sage Publications.

    Google Scholar 

  12. Coulter, W. K., Hillar, C. J., & Sommer, F. T. (2009). Adaptive compressed sensing—a new class of self-organizing coding models for neuroscience. arXiv:0906.1202v1.

    Google Scholar 

  13. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. London: Wiley.

    Google Scholar 

  14. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural systems. MA: MIT Press.

    Google Scholar 

  15. Deco, G., & Obradovic, D. (1996). An Information-theoretic approach to neural computing. New York: Springer.

    Google Scholar 

  16. Erdogmus, D., Principe, J. C., & II, K. E. H. (2003). On-line entropy manipulation: Stochastic information gradient. IEEE Signal Processing Letters, 10(8), 242–245.

    Google Scholar 

  17. Grosse, I., Herzel, H., Buldyrev, S., & Stanley, H. (2000). Species independence of mutual information in coding and noncoding DNA. Physical Review E, 61(5), 5624–5629.

    Google Scholar 

  18. Herzel, H., Ebeling, W., & Schmitt, A. (1994). Entropies of biosequences: The role of repeats. Physical Review E, 50(6), 5061–5071.

    Google Scholar 

  19. Hinton, G., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society B: Biological Sciences, 352(1358), 1177–1190.

    Google Scholar 

  20. Hyvärinen, A. (2002). An alternative approach to infomax and independent component analysis. Neurocomputing, 44–46, 1089–1097.

    Google Scholar 

  21. Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106(4), 620–630.

    Google Scholar 

  22. Jaynes, E. T. (1982). On the rationale of maximum entropy methods. Proceedings IEEE, 70, 939–952.

    Google Scholar 

  23. Kamimura, R. (2002). Information theoretic neural computation. New York: World Scientific.

    Google Scholar 

  24. Kolmogorov, A. N. (1956) On the Shannon theory of information transmission in the case of continuoussignals. IRE Transactions on Information Theory, IT-2, 102–108.

    Google Scholar 

  25. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86.

    Google Scholar 

  26. Linsker, R. (1989b). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computation, 1(3), 402–411.

    Google Scholar 

  27. Linsker, R. (1992). Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Computation, 4, 691–702.

    Google Scholar 

  28. Linsker, R. (1997). A local learning rule that enables information maximization for arbitrary input distributions. Neural Computation, 9, 1661–1665.

    Google Scholar 

  29. MacKay, D. J. C. (2005). Information theory, inference, and learning algorithms. UK: Cambridge University Press.

    Google Scholar 

  30. Mac Dónaill, D. (2009). Molecular informatics: Hydrogen-bonding, error-coding, and genetic replication. In 43rd Annual Conference on Information Sciences and Systems (CISS 2009). MD: Baltimore.

    Google Scholar 

  31. Mongillo, G., & Denève, S. (2008). On-line learning with hidden Markov models. Neural Computation, 20, 1706–1716.

    Google Scholar 

  32. Ozertem, U., Erdogmus, D., & Jenssen, R. (2006). Spectral feature projections that maximize shannon mutual information with class labels. Pattern Recognition, 39(7), 1241–1252.

    Google Scholar 

  33. Pearlmutter, B. A., & Hinton, G. E. (1987). G-maximization: An unsupervised learning procedure for discovering regularities. In J. S. Denker (Ed.), AIP conference proceedings 151 on neural networks for computing (pp. 333–338). Woodbury: American Institute of Physics Inc.

    Google Scholar 

  34. Principe, J. C., Fischer III, J., & Xu, D. (2000). Information theoretic learning. In S. Haykin (Ed.), Unsupervised adaptive filtering (pp. 265–319). New York: Wiley.

    Google Scholar 

  35. Shannon, C. E. (1948). A mathematical theory of communication. Bell Systems Technical Journal, 27, 379–423, 623–656.

    Google Scholar 

  36. Schmitt, A. O., & Herzel, H. (1997). Estimating the entropy of DNA sequences. Journal of Theoretical Biology, 188(3), 369–377.

    Google Scholar 

  37. Slonim, N., Atwal, G., Tkačik, G., & Bialek, W. (2005). Estimating mutual information and multi-information in large networks. arXiv:cs/0502017v1.

    Google Scholar 

  38. Taylor, S. F., Tishby, N., & Bialek, W. (2007). Information and fitness. arXiv:0712.4382v1.

    Google Scholar 

  39. Tkačik, G., & Bialek, W. (2007). Cell biology: Networks, regulation, pathways. In R. A. Meyers (Ed.) Encyclopedia of complexity and systems science (pp. 719–741). Berlin: Springer. arXiv:0712.4385 [qbio.MN]

    Google Scholar 

  40. Torkkola, K., & Campbell, W. M. (2000). Mutual information in learning feature transformations. In ICML ’00: Proceedings of the Seventeenth International Conference on Machine Learning (pp. 1015–1022). San Francisco: Morgan Kaufmann.

    Google Scholar 

  41. Weiss, O., Jiménez-Montano, M., & Herzel, H. (2000). Information content protein sequences. Journal of Theoretical Biology, 206, 379–386.

    Google Scholar 

  42. Zemel, R. S., & Hinton, G. E. (1995). Learning population codes by minimizing description length. Neural Computation, 7, 549–564.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Palm, G. (2012). Conditioning, Mutual Information, and Information Gain. In: Novelty, Information and Surprise. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29075-6_11

Download citation

Publish with us

Policies and ethics