Abstract
Shannon’s seminal work on information theory provided the conceptual framework for communication through noisy channels (Shannon, 1948). This work, quantifying the information content of coded messages, established the basis for all current systems aiming to transmit information through any medium.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
M.E. Aladjem. Nonparametric discriminant analysis via recursive optimization of Patrick-Fisher distance. IEEE Transactions on Systems, Man, and Cybernetics, 28(2):292–299, April 1998.
C. Aliferis, I. Tsamardinos, and A. Statnikov. HITON, a novel Markov blanket algorithm for optimal variable selection. In Proceedings of the 2003 American Medical Informatics Association (AMIA) Annual Symposium, pages 21–25, Washington, DC, USA, November 8–12 2003.
A. Antos, L. Devroye, and L. Gyorfi. Lower bounds for Bayes error estimation. IEEE Transactions on PAMI, 21(7):643–645, July 1999.
A. Banerjee, I. Dhillon, J. Ghosh, and S. Merugu. An information theoretic analysis of maximum likelihood mixture estimation for exponential families. In Proc. International Conference on Machine Learning (ICML), pages 57–64, Banff, Canada, July 2004.
R. Battiti. Using mutual information for selecting features in supervised neural net learning. Neural Networks, 5(4):537–550, July 1994.
S. Becker. Mutual information maximization: Models of cortical self-organization. Network: Computation in Neural Systems, 7(1), February 1996.
L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
L. Breiman, J.F. Friedman, R.A. Olshen, and P.J. Stone. Classification and regression trees. Wadsworth International Group, Belmont, CA, 1984.
G. Chechik, A. Globerson, N. Tishby, and Y. Weiss. Information bottleneck for gaussian variables. Journal of Machine Learning Research, 6:168–188, 2005.
P.A. Devijver and J. Kittler. Pattern recognition: A statistical approach. Prentice Hall, London, 1982.
D. Erdogmus, K.E. Hild, and J.C. Principe. Online entropy manipulation: Stochastic information gradient. IEEE Signal Processing Letters, 10:242–245, 2003.
D. Erdogmus, J.C. Principe, and K.E. Hild. Beyond second order statistics for learning: A pairwise interaction model for entropy estimation. Natural Computing, 1:85–108, 2002.
R.M. Fano. Transmission of Information: A Statistical theory of Communications. Wiley, New York, 1961.
M. Feder and N. Merhav. Relations between entropy and error probability. IEEE Trans. on Information Theory, 40:259–266, 1994.
M. Feder, N. Merhav, and M. Gutman. Universal prediction of individual sequences. IEEE Trans. on Information Theory, 38:1258–1270, 1992.
F. Fleuret. Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research, 5:1531–1555, 2004.
G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3:1289–1305, March 2003.
L. Frey, D. Fisher, I. Tsamardinos, C. Aliferis, and A. Statnikov. Identifying Markov blankets with decision tree induction. In Proc. of IEEE Conference on Data Mining, Melbourne, FL, USA, Nov. 19–22 2003.
A. Globerson and N. Tishby. Sufficient dimensionality reduction. Journal of Machine Learning Research, 3:1307–1331, 2003.
X. Guorong, C. Peiqi, and W. Minhui. Bhattacharyya distance feature selection. In Proceedings of the 13th International Conference on Pattern Recognition, volume 2, pages 195–199. IEEE, 25–29 Aug. 1996.
T. S. Han and S. Verdú. Generalizing the fano inequality. IEEE Trans. on Information Theory, 40(4):1147–1157, July 1994.
M.E. Hellman and J. Raviv. Probability of error, equivocation and the Chernoff bound. IEEE Transactions on Information Theory, 16:368–372, 1970.
A.O. Hero, B. Ma, O. Michel, and J. Gorman. Alpha-divergence for classification, indexing and retrieval. Technical Report CSPL-328, University of Michigan Ann Arbor, Communications and Signal Processing Laboratory, May 2001.
J.N. Kapur. Measures of information and their applications. Wiley, New Delhi, India, 1994.
S. Kaski and J. Sinkkonen. Principle of learning metrics for data analysis. Journal of VLSI Signal Processing, special issue on Machine Learning for Signal Processing, 37:177–188, 2004.
R. Kohavi and G.H. John. Wrappers for feature subset selection. Artificial Intelligence, 97:273–324, 1997.
D. Koller and M. Sahami. Toward optimal feature selection. In Proceedings of ICML-96, 13th International Conference on Machine Learning, pages 284–292, Bari, Italy, 1996.
A. Kraskov, H. Stögbauer, and P. Grassberger. Estimating mutual information. e-print arXiv.org/cond-mat/0305641, 2003.
L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15:1191–1253, 2003.
J. Peltonen and S. Kaski. Discriminative components of data. IEEE Transactions on Neural Networks, 2005.
J. Peltonen, A. Klami, and S. Kaski. Improved learning of Riemannian learning metrics for exploratory analysis. Neural Networks, 17:1087–1100, 2004.
J.C. Principe, J.W. Fisher III, and D. Xu. Information theoretic learning. In Simon Haykin, editor, Unsupervised Adaptive Filtering. Wiley, New York, NY, 2000.
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.
A. Renyi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, pages 547–561. University of California Press, 1961.
G. Saon and M. Padmanabhan. Minimum Bayes error feature selection for continuous speech recognition. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13 (Proc. NIPS’00), pages 800–806. MIT Press, 2001.
C. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27:379–423, 623–656, July, October 1948.
N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368–377, 1999.
K. Torkkola. Feature extraction by non-parametric mutual information maximization. Journal of Machine Learning Research, 3:1415–1438, March 2003.
I. Tsamardinos, C. Aliferis, and A. Statnikov. Algorithms for large scale Markov blanket discovery. In The 16th International FLAIRS Conference, St. Augustine, Florida, USA, 2003.
I. Tsamardinos and C.F. Aliferis. Towards principled feature selection: Relevancy, filters and wrappers. In Proceedings of the Workshop on Artificial Intelligence and Statistics, 2003.
E. Tuv. Feature selection and ensemble learning. In I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh, editors, Feature Extraction, Foundations and Applications. Springer, New York, 2005.
N. Vasconcelos. Feature selection by maximum marginal diversity: optimality and implications for visual recognition. In Proc. IEEE Conf on CVPR, pages 762–772, Madison, WI, USA, 2003.
D.R. Wolf and E.I. George. Maximally informative statistics. In José M. Bernardo, editor, Bayesian Methods in the Sciences. Real Academia de Ciencias, Madrid, Spain, 1999.
D.H. Wolpert and D.R. Wolf. Estimating functions of distributions from a finite set of samples. Phys. Rev. E, 52(6):6841–6854, 1995.
E.P. Xing, M.I. Jordan, and R.M. Karp. Feature selection for high-dimensional genomic microarray data. In Proc. 18th International Conf. on Machine Learning, pages 601–608. Morgan Kaufmann, San Francisco, CA, 2001.
Y. Yang and J.O. Pedersen. A comparative study on feature selection in text categorization. In Proc. 14th International Conference on Machine Learning, pages 412–420. Morgan Kaufmann, 1997.
L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In ICML’03, Washington, D.C., 2003.
M. Zaffalon and M. Hutter. Robust feature selection by mutual information distributions. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, pages 577–584, San Francisco, 2002. Morgan Kaufmann.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Torkkola, K. (2008). Information-Theoretic Methods. In: Guyon, I., Nikravesh, M., Gunn, S., Zadeh, L.A. (eds) Feature Extraction. Studies in Fuzziness and Soft Computing, vol 207. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-35488-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-540-35488-8_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35487-1
Online ISBN: 978-3-540-35488-8
eBook Packages: EngineeringEngineering (R0)