Abstract
In this chapter we want to discuss the extension of three concepts of classical information theory, namely conditional information, transinformation (also called mutual information), and information gain (also called Kullback–Leibler distance) from descriptions to (reasonably large classes of) covers. This extension will also extend these concepts from discrete to continuous random variables.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
See Chow (1996) for example.
- 2.
cf. Sect. 3.3, Proposition 10.6 and Chap. 17. \(\alpha \vee \beta \) should be the smallest repertoire that is larger than α and β. This of course depends on the ordering \(\leq \) of repertoires or of subclasses of repertoires (see Chaps. 15 and 16). For repertoires \(\alpha \vee \beta \) for the ordering defined in Def. 10.3, is simply \(\alpha \cup \beta \), whereas for templates it turns out to be \(\alpha \cdot \beta \) (cf. Chap. 16).
- 3.
See Bauer (1972) for example.
- 4.
By the Radon–Nikodym theorem (e.g. Bauer 1972).
- 5.
\(p(A\vert X)\) is a random variable that depends on the value of X, i.e., \(p(A\vert X)\,=\,f(X)\), where \(f(x)\,=\,p(A\vert [X = x])\) which can be properly defined for almost every \(x \in R(X)\).
References
Amari, S. (1967). A theory of adaptive pattern classifiers. IEEE Transactions on Electronic Computers, 16(3), 299–307.
Amari, S. (1982). Differential geometry of curved exponential families—curvature and information loss. Annals of Statistics, 10, 357–385.
Amari, S. (1985). Differential-geometrical methods in statistics. New York: Springer.
Amari, S., & Nagaoka, H. (2000). Methods of information geometry. USA: AMS and Oxford University Press.
Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in Neural Information Processing Systems (Vol. 8) (pp. 757–763). Cambridge: MIT Press.
Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network: Computation in Neural Systems, 3, 213–251.
Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295–311.
Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. Neural Networks, 5, 537–550.
Bauer, H. (1972). Probability theory and elements of measure theory. New York: Holt, Rinehart and Winston.
Brown, G. (2009). A new perspective for information theoretic feature selection. In Proceedings of the 12th international conference on artificial intelligence and statistics (AI-STATS 2009).
Chow, S. L. (1996). Statistical significance: Rationale, validity and utility. London: Sage Publications.
Coulter, W. K., Hillar, C. J., & Sommer, F. T. (2009). Adaptive compressed sensing—a new class of self-organizing coding models for neuroscience. arXiv:0906.1202v1.
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. London: Wiley.
Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural systems. MA: MIT Press.
Deco, G., & Obradovic, D. (1996). An Information-theoretic approach to neural computing. New York: Springer.
Erdogmus, D., Principe, J. C., & II, K. E. H. (2003). On-line entropy manipulation: Stochastic information gradient. IEEE Signal Processing Letters, 10(8), 242–245.
Grosse, I., Herzel, H., Buldyrev, S., & Stanley, H. (2000). Species independence of mutual information in coding and noncoding DNA. Physical Review E, 61(5), 5624–5629.
Herzel, H., Ebeling, W., & Schmitt, A. (1994). Entropies of biosequences: The role of repeats. Physical Review E, 50(6), 5061–5071.
Hinton, G., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society B: Biological Sciences, 352(1358), 1177–1190.
Hyvärinen, A. (2002). An alternative approach to infomax and independent component analysis. Neurocomputing, 44–46, 1089–1097.
Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106(4), 620–630.
Jaynes, E. T. (1982). On the rationale of maximum entropy methods. Proceedings IEEE, 70, 939–952.
Kamimura, R. (2002). Information theoretic neural computation. New York: World Scientific.
Kolmogorov, A. N. (1956) On the Shannon theory of information transmission in the case of continuoussignals. IRE Transactions on Information Theory, IT-2, 102–108.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86.
Linsker, R. (1989b). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computation, 1(3), 402–411.
Linsker, R. (1992). Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Computation, 4, 691–702.
Linsker, R. (1997). A local learning rule that enables information maximization for arbitrary input distributions. Neural Computation, 9, 1661–1665.
MacKay, D. J. C. (2005). Information theory, inference, and learning algorithms. UK: Cambridge University Press.
Mac Dónaill, D. (2009). Molecular informatics: Hydrogen-bonding, error-coding, and genetic replication. In 43rd Annual Conference on Information Sciences and Systems (CISS 2009). MD: Baltimore.
Mongillo, G., & Denève, S. (2008). On-line learning with hidden Markov models. Neural Computation, 20, 1706–1716.
Ozertem, U., Erdogmus, D., & Jenssen, R. (2006). Spectral feature projections that maximize shannon mutual information with class labels. Pattern Recognition, 39(7), 1241–1252.
Pearlmutter, B. A., & Hinton, G. E. (1987). G-maximization: An unsupervised learning procedure for discovering regularities. In J. S. Denker (Ed.), AIP conference proceedings 151 on neural networks for computing (pp. 333–338). Woodbury: American Institute of Physics Inc.
Principe, J. C., Fischer III, J., & Xu, D. (2000). Information theoretic learning. In S. Haykin (Ed.), Unsupervised adaptive filtering (pp. 265–319). New York: Wiley.
Shannon, C. E. (1948). A mathematical theory of communication. Bell Systems Technical Journal, 27, 379–423, 623–656.
Schmitt, A. O., & Herzel, H. (1997). Estimating the entropy of DNA sequences. Journal of Theoretical Biology, 188(3), 369–377.
Slonim, N., Atwal, G., Tkačik, G., & Bialek, W. (2005). Estimating mutual information and multi-information in large networks. arXiv:cs/0502017v1.
Taylor, S. F., Tishby, N., & Bialek, W. (2007). Information and fitness. arXiv:0712.4382v1.
Tkačik, G., & Bialek, W. (2007). Cell biology: Networks, regulation, pathways. In R. A. Meyers (Ed.) Encyclopedia of complexity and systems science (pp. 719–741). Berlin: Springer. arXiv:0712.4385 [qbio.MN]
Torkkola, K., & Campbell, W. M. (2000). Mutual information in learning feature transformations. In ICML ’00: Proceedings of the Seventeenth International Conference on Machine Learning (pp. 1015–1022). San Francisco: Morgan Kaufmann.
Weiss, O., Jiménez-Montano, M., & Herzel, H. (2000). Information content protein sequences. Journal of Theoretical Biology, 206, 379–386.
Zemel, R. S., & Hinton, G. E. (1995). Learning population codes by minimizing description length. Neural Computation, 7, 549–564.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Palm, G. (2012). Conditioning, Mutual Information, and Information Gain. In: Novelty, Information and Surprise. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29075-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-29075-6_11
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29074-9
Online ISBN: 978-3-642-29075-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)