Abstract
The notion of relevance is used in many technical fields. In the areas of machine learning and data mining, for example, relevance is frequently used as a measure in feature subset selection (FSS). In previous studies, the interpretation of relevance has varied and its connection to FSS has been loose. In this paper a rigorous mathematical formalism is proposed for relevance, which is quantitative and normalized. To apply the formalism in FSS, a characterization is proposed for FSS: preservation of learning information and minimization of joint entropy. Based on the characterization, a tight connection between relevance and FSS is established: maximizing the relevance of features to the decision attribute, and the relevance of the decision attribute to the features. This connection is then used to design an algorithm for FSS. The algorithm is linear in the number of instances and quadratic in the number of features. The algorithm is evaluated using 23 public datasets, resulting in an improvement in prediction accuracy on 16 datasets, and a loss in accuracy on only 1 dataset. This provides evidence that both the formalism and its connection to FSS are sound.
Article PDF
Similar content being viewed by others
References
Aha, D. W. & Bankert, R. L. (1994). Feature selection for case-based classification of cloud types. In Working notes of the AAAI94 Workshop on Case-based Reasoning (pp. 106–112). AAAI Press.
Almuallim, H. & Dietterich, T. G. (1991). Learning with many irrelevant features. In Proc. Ninth National Conference on Artificial Intelligence (pp. 547–552). MIT Press.
Almuallim, H. & Dietterich, T. G. (1994). Learning boolean concepts in the presence of many irrelevant features. Artificial Intelligence, 69, 279–305.
Amirikian, B. & Nishimura, H. (1994). What size network is good for generalization of a specific task of interest? Neural Networks, 7(2), 321–329.
Blum, A. (1994). Relevant examples & relevant features: thoughts from computational learning theory. In Relevance: Proc. 1994 AAAI Fall Symposium (pp. 14–18). AAAI Press.
Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam's Razor. Information Processing Letters, 24, 377–380.
Carnap, R. (1962). Logical foundations of probability. The University of Chicago Press.
Caruana, R. A. & Freitag, D. (1994). Greedy attribute selection. In Proceedings of the 11th international conference on machine learning (pp. 28–36). New Brunswick, NJ: Morgan Kaufmann.
Cover, T. M. & Thomas, J. A. 1991. Elements of information theory. John Wiley & Sons, Inc.
Davies, S. & Russell, S. 1994. NP-completeness of searches for smallest possible feature sets. In Proceedings of the 1994 AAAI Fall Symposium on Relevance (pp. 37–39). AAAI Press.
Fayyad, U. & Irani, K. (1990). What should be minimized in a decision tree? In AAAI-90: Proceedings of 8th National Conference on Artificial Intelligence (pp. 749–754).
Fayyad, U. & Irani, K. (1992). The attribute selection problem in decision tree generation. In AAAI-92: Proceedings of 10th National Conference on Artificial Intelligence (pp. 104–110).
Gärdenfors, P. (1978). On the logic of relevance. Synthese, 37, 351–367.
Gennari, J. H., Langley, P., & Fisher, D. (1989). Models of incremental concept formation. Artificial Intelligence, 40, 11–61.
Greiner, R. & Subramanian, D. (Eds.). 1994. In Relevance: Proc. 1994 AAAI Fall Symposium. The AAAI Press. AAAI Technical Report FS–94–02.
John, G. H., Kohavi, R., & Pfleger, K. (1994). Irrelevant features and the subset selection problem. In Proceedings of the 11th international conference on machine learning (pp. 121–129). New Brunswick, NJ: Morgan Kaufmann.
Keynes, J. M. (1921). A treatise on probability. London: Macmillan.
Kira, K. & Rendell, L. A. (1992). The feature selection problem: traditional methods and a new algorithm. In AAAI-92 (pp. 129–134).
Kohavi, R. & Sommerfield, D. (1995). Feature subset selection using the wrapper method: Overfitting and dynamic search space topology. In U. M. Fayyad & R. Uthurusamy (Eds.), Proceedings of KDD'95 (pp. 192–197).
Kohavi, R. (1994). Feature Subset Selection as Search with Probabilistic Estimates. In R. Greiner, & D. Subramanian (Eds.). Relevance: Proc 1994 AAAI Fall Symposium (pp. 122–126). The AAAI Press.
Kononenko, I., Simec, E., & Robnik-Sikonja, M. (1997). Overcoming the myopia of inductive learning algorithms with relieff. Applied Intelligence, 7, 39–55.
Kononenko, I. (1994) Estimating attributes: Analysis and extensions of RELIEF. In Proceedings of the 1994 European Conference on Machine Learning (pp. 171–182).
Lakemeyer, G. (1995). A Logical account of relevance. In Proc. of IJCAI-95 (pp. 853–859).
Littlestone, N. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. (1988). Machine learning, 2, 285–318.
Liu, H. & Setiono, R. (1998). Feature transformation and multivariate decision tree induction. In Proceedings of The First International Conference on Discovery Science (DS'98) (pp. 279–290). Fukuoka, Japan. Springer-Verlag.
Liu, H. & Setiono, R. (1997). Feature selection via discretization of numeric attributes. IEEE Trans on Knowledge and Data Engineering, 9(4), 642–645.
Muggleton, S. (ed.). 1992. Inductive Logic Programming. London: Academic Press.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Francisco, California: Morgan Kaufmann Publishers, Inc.
Quinlan, J. & Rivest, R. (1989). Inferring decision trees using the minimum description length principle. Information and Computation, 80, 227–248.
Rissanen, J. (1986). Stochastic complexity and modeling. Ann. Statist., 14, 1080–1100.
Schlimmer, J. C. (1993). Efficiently inducing determinations: A complete and systematic search algorithm that uses optimal pruning. In ML93, pp. 284–290.
Schweitzer, H. (1995). Occam algorithms for computing visual motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(11), 1033–1042.
Shore, J. E. & Johnson, R.W. (1980). Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Trans. Information Theory, 26, 26–37.
Skalak, D. B. (1994). Prototype and feature selection by sampling and random mutation hill-climbing algorithms. In Proceedings of the 11th International Conference on Machine Learning (pp. 293–301). New Brunswick, N.J.: Morgan Kaufmann.
Subramanian, D. & Genesereth, M. R. (1987). The relevance of irrelevance. In Proc. of IJCAI-87 (pp. 416–422).
Ullman, J. D. (1989). Principles of database and knowledgebase systems. Computer Science Press.
Wallace, C. & Freeman, P. (1987). Estimation and inference by compact coding. Journal of the Royal Statistical Society (B), 49, 240–265.
Wang, H. (1996). Towards a unified framework of relevance. Ph.D. Thesis, Faculty of Informatics, University of Ulster, N. Ireland, UK. http://www.infj.ulst.ac.uk/∼cbcj23/thesis.ps.
Wolpert, D. H. (1990). The relationship between Occam's Razor and convergent guessing. Complex Systems, 4, 319–368.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Bell, D.A., Wang, H. A Formalism for Relevance and Its Application in Feature Subset Selection. Machine Learning 41, 175–195 (2000). https://doi.org/10.1023/A:1007612503587
Issue Date:
DOI: https://doi.org/10.1023/A:1007612503587