Conditioning, Mutual Information, and Information Gain

Palm, Günther

doi:10.1007/978-3-642-29075-6_11

Günther Palm²

964 Accesses

Abstract

In this chapter we want to discuss the extension of three concepts of classical information theory, namely conditional information, transinformation (also called mutual information), and information gain (also called Kullback–Leibler distance) from descriptions to (reasonably large classes of) covers. This extension will also extend these concepts from discrete to continuous random variables.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
See Chow (1996) for example.
2.
cf. Sect. 3.3, Proposition 10.6 and Chap. 17. \(\alpha \vee \beta \) should be the smallest repertoire that is larger than α and β. This of course depends on the ordering \(\leq \) of repertoires or of subclasses of repertoires (see Chaps. 15 and 16). For repertoires \(\alpha \vee \beta \) for the ordering defined in Def. 10.3, is simply \(\alpha \cup \beta \), whereas for templates it turns out to be \(\alpha \cdot \beta \) (cf. Chap. 16).
3.
See Bauer (1972) for example.
4.
By the Radon–Nikodym theorem (e.g. Bauer 1972).
5.
\(p(A\vert X)\) is a random variable that depends on the value of X, i.e., \(p(A\vert X)\,=\,f(X)\), where \(f(x)\,=\,p(A\vert [X = x])\) which can be properly defined for almost every \(x \in R(X)\).

References

Amari, S. (1967). A theory of adaptive pattern classifiers. IEEE Transactions on Electronic Computers, 16(3), 299–307.
Google Scholar
Amari, S. (1982). Differential geometry of curved exponential families—curvature and information loss. Annals of Statistics, 10, 357–385.
Google Scholar
Amari, S. (1985). Differential-geometrical methods in statistics. New York: Springer.
Google Scholar
Amari, S., & Nagaoka, H. (2000). Methods of information geometry. USA: AMS and Oxford University Press.
Google Scholar
Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in Neural Information Processing Systems (Vol. 8) (pp. 757–763). Cambridge: MIT Press.
Google Scholar
Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network: Computation in Neural Systems, 3, 213–251.
Google Scholar
Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295–311.
Google Scholar
Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. Neural Networks, 5, 537–550.
Google Scholar
Bauer, H. (1972). Probability theory and elements of measure theory. New York: Holt, Rinehart and Winston.
Google Scholar
Brown, G. (2009). A new perspective for information theoretic feature selection. In Proceedings of the 12th international conference on artificial intelligence and statistics (AI-STATS 2009).
Google Scholar
Chow, S. L. (1996). Statistical significance: Rationale, validity and utility. London: Sage Publications.
Google Scholar
Coulter, W. K., Hillar, C. J., & Sommer, F. T. (2009). Adaptive compressed sensing—a new class of self-organizing coding models for neuroscience. arXiv:0906.1202v1.
Google Scholar
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. London: Wiley.
Google Scholar
Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural systems. MA: MIT Press.
Google Scholar
Deco, G., & Obradovic, D. (1996). An Information-theoretic approach to neural computing. New York: Springer.
Google Scholar
Erdogmus, D., Principe, J. C., & II, K. E. H. (2003). On-line entropy manipulation: Stochastic information gradient. IEEE Signal Processing Letters, 10(8), 242–245.
Google Scholar
Grosse, I., Herzel, H., Buldyrev, S., & Stanley, H. (2000). Species independence of mutual information in coding and noncoding DNA. Physical Review E, 61(5), 5624–5629.
Google Scholar
Herzel, H., Ebeling, W., & Schmitt, A. (1994). Entropies of biosequences: The role of repeats. Physical Review E, 50(6), 5061–5071.
Google Scholar
Hinton, G., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society B: Biological Sciences, 352(1358), 1177–1190.
Google Scholar
Hyvärinen, A. (2002). An alternative approach to infomax and independent component analysis. Neurocomputing, 44–46, 1089–1097.
Google Scholar
Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106(4), 620–630.
Google Scholar
Jaynes, E. T. (1982). On the rationale of maximum entropy methods. Proceedings IEEE, 70, 939–952.
Google Scholar
Kamimura, R. (2002). Information theoretic neural computation. New York: World Scientific.
Google Scholar
Kolmogorov, A. N. (1956) On the Shannon theory of information transmission in the case of continuoussignals. IRE Transactions on Information Theory, IT-2, 102–108.
Google Scholar
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86.
Google Scholar
Linsker, R. (1989b). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computation, 1(3), 402–411.
Google Scholar
Linsker, R. (1992). Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Computation, 4, 691–702.
Google Scholar
Linsker, R. (1997). A local learning rule that enables information maximization for arbitrary input distributions. Neural Computation, 9, 1661–1665.
Google Scholar
MacKay, D. J. C. (2005). Information theory, inference, and learning algorithms. UK: Cambridge University Press.
Google Scholar
Mac Dónaill, D. (2009). Molecular informatics: Hydrogen-bonding, error-coding, and genetic replication. In 43rd Annual Conference on Information Sciences and Systems (CISS 2009). MD: Baltimore.
Google Scholar
Mongillo, G., & Denève, S. (2008). On-line learning with hidden Markov models. Neural Computation, 20, 1706–1716.
Google Scholar
Ozertem, U., Erdogmus, D., & Jenssen, R. (2006). Spectral feature projections that maximize shannon mutual information with class labels. Pattern Recognition, 39(7), 1241–1252.
Google Scholar
Pearlmutter, B. A., & Hinton, G. E. (1987). G-maximization: An unsupervised learning procedure for discovering regularities. In J. S. Denker (Ed.), AIP conference proceedings 151 on neural networks for computing (pp. 333–338). Woodbury: American Institute of Physics Inc.
Google Scholar
Principe, J. C., Fischer III, J., & Xu, D. (2000). Information theoretic learning. In S. Haykin (Ed.), Unsupervised adaptive filtering (pp. 265–319). New York: Wiley.
Google Scholar
Shannon, C. E. (1948). A mathematical theory of communication. Bell Systems Technical Journal, 27, 379–423, 623–656.
Google Scholar
Schmitt, A. O., & Herzel, H. (1997). Estimating the entropy of DNA sequences. Journal of Theoretical Biology, 188(3), 369–377.
Google Scholar
Slonim, N., Atwal, G., Tkačik, G., & Bialek, W. (2005). Estimating mutual information and multi-information in large networks. arXiv:cs/0502017v1.
Google Scholar
Taylor, S. F., Tishby, N., & Bialek, W. (2007). Information and fitness. arXiv:0712.4382v1.
Google Scholar
Tkačik, G., & Bialek, W. (2007). Cell biology: Networks, regulation, pathways. In R. A. Meyers (Ed.) Encyclopedia of complexity and systems science (pp. 719–741). Berlin: Springer. arXiv:0712.4385 [qbio.MN]
Google Scholar
Torkkola, K., & Campbell, W. M. (2000). Mutual information in learning feature transformations. In ICML ’00: Proceedings of the Seventeenth International Conference on Machine Learning (pp. 1015–1022). San Francisco: Morgan Kaufmann.
Google Scholar
Weiss, O., Jiménez-Montano, M., & Herzel, H. (2000). Information content protein sequences. Journal of Theoretical Biology, 206, 379–386.
Google Scholar
Zemel, R. S., & Hinton, G. E. (1995). Learning population codes by minimizing description length. Neural Computation, 7, 549–564.
Google Scholar

Download references

Author information

Authors and Affiliations

Neural Information Processing, University of Ulm, James-Franck-Ring, Ulm, Germany
Günther Palm

Authors

Günther Palm
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Palm, G. (2012). Conditioning, Mutual Information, and Information Gain. In: Novelty, Information and Surprise. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29075-6_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-29075-6_11
Published: 29 May 2012
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29074-9
Online ISBN: 978-3-642-29075-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics