Abstract
This paper describes several known and some new methods for feature subset selection on large text data. Experimental comparison given on real-world data collected from Web users shows that characteristics of the problem domain and machine learning algorithm should be considered when feature scoring measure is selected. Our problem domain consists of hyperlinks given in a form of small-documents represented with word vectors. In our learning experiments naive Bayesian classifier was used on text data. The best performance was achieved by the feature selection methods based on the feature scoring measure called Odds ratio that is known from information retrieval.
Chapter PDF
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Apté, C., Damerau, F., Weiss, S.M., Toward Language Independent Automated Learning of Text Categorization Models, Proc. of the 7th Annual Int. ACM-SIGIR Conference on Research and Development in Information Retrieval, 1994.
Kraut, R., Scherlis, W., Mukhopadhyay, T., Manning, J., Kiesler, S., The HomeNet Field Trial of Residential Internet Services, Communications of the ACM Vol. 39, No. 12, pp.55–63, December 1996.
Joachims, T., A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Proc. of the 14th International Conference on Machine Learning ICML97, pp. 143–151, 1997.
John, G.H., Kohavi, R., Pfleger, K., Irrelevant Features and the Subset Selection Problem, Proc. of the 11th International Conference on Machine Learning ICML94, pp. 121–129, 1994.
Kindo, T., Yoshida, H., Morimoto, T., Watanabe, T., Adaptive Personal Information Filtering System that Organizes Personal Profiles Automatically, Proc. of the 15th Int. Joint Conference on Artificial Intelligence IJCAI-97, 716–721, 1997.
Koller, D., Sahami, M., Hierarchically classifying documents using very few words, Proc. of the 14th International Conference on Machine Learning ICML97, pp. 170–178, 1997.
Kononenko, I. and Bratko, I., Information-Based Evaluation Criterion for Classifier's Performance, Machine Learning 6, Kluwer Academic Publishers, 1991.
Kubat, M., Holte, R., Matwing, S., Learning When Negative Examples Abound, 9th European Conference on Machine Learning ECML97, pp. 146–153, 1997.
Mitchell, T.M., Machine Learning, The McGraw-Hill Companies, Inc., 1997.
Mladenić, D., Personal WebWatcher: Implementation and Design, Technical Report IJS-DP-7472, October, 1996. http://www-ai.ijs.si/DunjaMladenic/papers/PWW/
Pazzani, M., Billsus, D., Learning and Revising User Profiles: The Identification of Interesting Web Sites, Machine Learning 27, Kluwer Academic Publishers, pp. 313–331, 1997.
van Rijsbergen, C.J,. Harper, D.J., Porter, M.F., The selection of good search terms, Information Processing & Management, 17, pp.77–91, 1981.
Shaw Jr, W.M., Term-relevance computations and perfect retrieval performance, Information Processing & Management, 31(4), pp.491–498, 1995.
Yang, Y., Pedersen, J.O., A Comparative Study on Feature Selection in Text Categorization, Proc. of the 14th International Conference on Machine Learning ICML97, pp. 412–420, 1997.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mladenić, D. (1998). Feature subset selection in text-learning. In: Nédellec, C., Rouveirol, C. (eds) Machine Learning: ECML-98. ECML 1998. Lecture Notes in Computer Science, vol 1398. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0026677
Download citation
DOI: https://doi.org/10.1007/BFb0026677
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64417-0
Online ISBN: 978-3-540-69781-7
eBook Packages: Springer Book Archive