Skip to main content

Information-Theoretic Methods

  • Chapter
Feature Extraction

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 207))

Abstract

Shannon’s seminal work on information theory provided the conceptual framework for communication through noisy channels (Shannon, 1948). This work, quantifying the information content of coded messages, established the basis for all current systems aiming to transmit information through any medium.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 329.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • M.E. Aladjem. Nonparametric discriminant analysis via recursive optimization of Patrick-Fisher distance. IEEE Transactions on Systems, Man, and Cybernetics, 28(2):292–299, April 1998.

    Article  Google Scholar 

  • C. Aliferis, I. Tsamardinos, and A. Statnikov. HITON, a novel Markov blanket algorithm for optimal variable selection. In Proceedings of the 2003 American Medical Informatics Association (AMIA) Annual Symposium, pages 21–25, Washington, DC, USA, November 8–12 2003.

    Google Scholar 

  • A. Antos, L. Devroye, and L. Gyorfi. Lower bounds for Bayes error estimation. IEEE Transactions on PAMI, 21(7):643–645, July 1999.

    Google Scholar 

  • A. Banerjee, I. Dhillon, J. Ghosh, and S. Merugu. An information theoretic analysis of maximum likelihood mixture estimation for exponential families. In Proc. International Conference on Machine Learning (ICML), pages 57–64, Banff, Canada, July 2004.

    Google Scholar 

  • R. Battiti. Using mutual information for selecting features in supervised neural net learning. Neural Networks, 5(4):537–550, July 1994.

    Article  Google Scholar 

  • S. Becker. Mutual information maximization: Models of cortical self-organization. Network: Computation in Neural Systems, 7(1), February 1996.

    Google Scholar 

  • L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

    Article  MATH  Google Scholar 

  • L. Breiman, J.F. Friedman, R.A. Olshen, and P.J. Stone. Classification and regression trees. Wadsworth International Group, Belmont, CA, 1984.

    MATH  Google Scholar 

  • G. Chechik, A. Globerson, N. Tishby, and Y. Weiss. Information bottleneck for gaussian variables. Journal of Machine Learning Research, 6:168–188, 2005.

    MathSciNet  Google Scholar 

  • P.A. Devijver and J. Kittler. Pattern recognition: A statistical approach. Prentice Hall, London, 1982.

    MATH  Google Scholar 

  • D. Erdogmus, K.E. Hild, and J.C. Principe. Online entropy manipulation: Stochastic information gradient. IEEE Signal Processing Letters, 10:242–245, 2003.

    Article  Google Scholar 

  • D. Erdogmus, J.C. Principe, and K.E. Hild. Beyond second order statistics for learning: A pairwise interaction model for entropy estimation. Natural Computing, 1:85–108, 2002.

    Article  MATH  MathSciNet  Google Scholar 

  • R.M. Fano. Transmission of Information: A Statistical theory of Communications. Wiley, New York, 1961.

    Google Scholar 

  • M. Feder and N. Merhav. Relations between entropy and error probability. IEEE Trans. on Information Theory, 40:259–266, 1994.

    Article  MATH  Google Scholar 

  • M. Feder, N. Merhav, and M. Gutman. Universal prediction of individual sequences. IEEE Trans. on Information Theory, 38:1258–1270, 1992.

    Article  MATH  MathSciNet  Google Scholar 

  • F. Fleuret. Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research, 5:1531–1555, 2004.

    MathSciNet  Google Scholar 

  • G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3:1289–1305, March 2003.

    Article  MATH  Google Scholar 

  • L. Frey, D. Fisher, I. Tsamardinos, C. Aliferis, and A. Statnikov. Identifying Markov blankets with decision tree induction. In Proc. of IEEE Conference on Data Mining, Melbourne, FL, USA, Nov. 19–22 2003.

    Google Scholar 

  • A. Globerson and N. Tishby. Sufficient dimensionality reduction. Journal of Machine Learning Research, 3:1307–1331, 2003.

    Article  MATH  Google Scholar 

  • X. Guorong, C. Peiqi, and W. Minhui. Bhattacharyya distance feature selection. In Proceedings of the 13th International Conference on Pattern Recognition, volume 2, pages 195–199. IEEE, 25–29 Aug. 1996.

    Article  Google Scholar 

  • T. S. Han and S. VerdĂş. Generalizing the fano inequality. IEEE Trans. on Information Theory, 40(4):1147–1157, July 1994.

    Article  MATH  Google Scholar 

  • M.E. Hellman and J. Raviv. Probability of error, equivocation and the Chernoff bound. IEEE Transactions on Information Theory, 16:368–372, 1970.

    Article  MATH  MathSciNet  Google Scholar 

  • A.O. Hero, B. Ma, O. Michel, and J. Gorman. Alpha-divergence for classification, indexing and retrieval. Technical Report CSPL-328, University of Michigan Ann Arbor, Communications and Signal Processing Laboratory, May 2001.

    Google Scholar 

  • J.N. Kapur. Measures of information and their applications. Wiley, New Delhi, India, 1994.

    MATH  Google Scholar 

  • S. Kaski and J. Sinkkonen. Principle of learning metrics for data analysis. Journal of VLSI Signal Processing, special issue on Machine Learning for Signal Processing, 37:177–188, 2004.

    Google Scholar 

  • R. Kohavi and G.H. John. Wrappers for feature subset selection. Artificial Intelligence, 97:273–324, 1997.

    Article  MATH  Google Scholar 

  • D. Koller and M. Sahami. Toward optimal feature selection. In Proceedings of ICML-96, 13th International Conference on Machine Learning, pages 284–292, Bari, Italy, 1996.

    Google Scholar 

  • A. Kraskov, H. Stögbauer, and P. Grassberger. Estimating mutual information. e-print arXiv.org/cond-mat/0305641, 2003.

    Google Scholar 

  • L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15:1191–1253, 2003.

    Article  MATH  Google Scholar 

  • J. Peltonen and S. Kaski. Discriminative components of data. IEEE Transactions on Neural Networks, 2005.

    Google Scholar 

  • J. Peltonen, A. Klami, and S. Kaski. Improved learning of Riemannian learning metrics for exploratory analysis. Neural Networks, 17:1087–1100, 2004.

    Article  MATH  Google Scholar 

  • J.C. Principe, J.W. Fisher III, and D. Xu. Information theoretic learning. In Simon Haykin, editor, Unsupervised Adaptive Filtering. Wiley, New York, NY, 2000.

    Google Scholar 

  • J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.

    Google Scholar 

  • A. Renyi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, pages 547–561. University of California Press, 1961.

    Google Scholar 

  • G. Saon and M. Padmanabhan. Minimum Bayes error feature selection for continuous speech recognition. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13 (Proc. NIPS’00), pages 800–806. MIT Press, 2001.

    Google Scholar 

  • C. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27:379–423, 623–656, July, October 1948.

    MathSciNet  Google Scholar 

  • N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368–377, 1999.

    Google Scholar 

  • K. Torkkola. Feature extraction by non-parametric mutual information maximization. Journal of Machine Learning Research, 3:1415–1438, March 2003.

    Article  MATH  MathSciNet  Google Scholar 

  • I. Tsamardinos, C. Aliferis, and A. Statnikov. Algorithms for large scale Markov blanket discovery. In The 16th International FLAIRS Conference, St. Augustine, Florida, USA, 2003.

    Google Scholar 

  • I. Tsamardinos and C.F. Aliferis. Towards principled feature selection: Relevancy, filters and wrappers. In Proceedings of the Workshop on Artificial Intelligence and Statistics, 2003.

    Google Scholar 

  • E. Tuv. Feature selection and ensemble learning. In I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh, editors, Feature Extraction, Foundations and Applications. Springer, New York, 2005.

    Google Scholar 

  • N. Vasconcelos. Feature selection by maximum marginal diversity: optimality and implications for visual recognition. In Proc. IEEE Conf on CVPR, pages 762–772, Madison, WI, USA, 2003.

    Google Scholar 

  • D.R. Wolf and E.I. George. Maximally informative statistics. In JosĂ© M. Bernardo, editor, Bayesian Methods in the Sciences. Real Academia de Ciencias, Madrid, Spain, 1999.

    Google Scholar 

  • D.H. Wolpert and D.R. Wolf. Estimating functions of distributions from a finite set of samples. Phys. Rev. E, 52(6):6841–6854, 1995.

    Article  MathSciNet  Google Scholar 

  • E.P. Xing, M.I. Jordan, and R.M. Karp. Feature selection for high-dimensional genomic microarray data. In Proc. 18th International Conf. on Machine Learning, pages 601–608. Morgan Kaufmann, San Francisco, CA, 2001.

    Google Scholar 

  • Y. Yang and J.O. Pedersen. A comparative study on feature selection in text categorization. In Proc. 14th International Conference on Machine Learning, pages 412–420. Morgan Kaufmann, 1997.

    Google Scholar 

  • L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In ICML’03, Washington, D.C., 2003.

    Google Scholar 

  • M. Zaffalon and M. Hutter. Robust feature selection by mutual information distributions. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, pages 577–584, San Francisco, 2002. Morgan Kaufmann.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Torkkola, K. (2008). Information-Theoretic Methods. In: Guyon, I., Nikravesh, M., Gunn, S., Zadeh, L.A. (eds) Feature Extraction. Studies in Fuzziness and Soft Computing, vol 207. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-35488-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-35488-8_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-35487-1

  • Online ISBN: 978-3-540-35488-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics