Feature subset selection in text-learning

Mladenić, Dunja

doi:10.1007/BFb0026677

Dunja Mladenić¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1398))

Included in the following conference series:

European Conference on Machine Learning

707 Accesses
48 Citations

Abstract

This paper describes several known and some new methods for feature subset selection on large text data. Experimental comparison given on real-world data collected from Web users shows that characteristics of the problem domain and machine learning algorithm should be considered when feature scoring measure is selected. Our problem domain consists of hyperlinks given in a form of small-documents represented with word vectors. In our learning experiments naive Bayesian classifier was used on text data. The best performance was achieved by the feature selection methods based on the feature scoring measure called Odds ratio that is known from information retrieval.

Download to read the full chapter text

Chapter PDF

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Apté, C., Damerau, F., Weiss, S.M., Toward Language Independent Automated Learning of Text Categorization Models, Proc. of the 7th Annual Int. ACM-SIGIR Conference on Research and Development in Information Retrieval, 1994.
Google Scholar
Kraut, R., Scherlis, W., Mukhopadhyay, T., Manning, J., Kiesler, S., The HomeNet Field Trial of Residential Internet Services, Communications of the ACM Vol. 39, No. 12, pp.55–63, December 1996.
Article Google Scholar
Joachims, T., A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Proc. of the 14th International Conference on Machine Learning ICML97, pp. 143–151, 1997.
Google Scholar
John, G.H., Kohavi, R., Pfleger, K., Irrelevant Features and the Subset Selection Problem, Proc. of the 11th International Conference on Machine Learning ICML94, pp. 121–129, 1994.
Google Scholar
Kindo, T., Yoshida, H., Morimoto, T., Watanabe, T., Adaptive Personal Information Filtering System that Organizes Personal Profiles Automatically, Proc. of the 15th Int. Joint Conference on Artificial Intelligence IJCAI-97, 716–721, 1997.
Google Scholar
Koller, D., Sahami, M., Hierarchically classifying documents using very few words, Proc. of the 14th International Conference on Machine Learning ICML97, pp. 170–178, 1997.
Google Scholar
Kononenko, I. and Bratko, I., Information-Based Evaluation Criterion for Classifier's Performance, Machine Learning 6, Kluwer Academic Publishers, 1991.
Google Scholar
Kubat, M., Holte, R., Matwing, S., Learning When Negative Examples Abound, 9th European Conference on Machine Learning ECML97, pp. 146–153, 1997.
Google Scholar
Mitchell, T.M., Machine Learning, The McGraw-Hill Companies, Inc., 1997.
Google Scholar
Mladenić, D., Personal WebWatcher: Implementation and Design, Technical Report IJS-DP-7472, October, 1996. http://www-ai.ijs.si/DunjaMladenic/papers/PWW/
Google Scholar
Pazzani, M., Billsus, D., Learning and Revising User Profiles: The Identification of Interesting Web Sites, Machine Learning 27, Kluwer Academic Publishers, pp. 313–331, 1997.
Google Scholar
van Rijsbergen, C.J,. Harper, D.J., Porter, M.F., The selection of good search terms, Information Processing & Management, 17, pp.77–91, 1981.
Google Scholar
Shaw Jr, W.M., Term-relevance computations and perfect retrieval performance, Information Processing & Management, 31(4), pp.491–498, 1995.
Google Scholar
Yang, Y., Pedersen, J.O., A Comparative Study on Feature Selection in Text Categorization, Proc. of the 14th International Conference on Machine Learning ICML97, pp. 412–420, 1997.
Google Scholar

Download references

Author information

Authors and Affiliations

Department for Intelligent Systems, J.Stefan Institute, Jamova 39, 1100, Ljubljana, Slovenia
Dunja Mladenić

Authors

Dunja Mladenić
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Claire Nédellec Céline Rouveirol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mladenić, D. (1998). Feature subset selection in text-learning. In: Nédellec, C., Rouveirol, C. (eds) Machine Learning: ECML-98. ECML 1998. Lecture Notes in Computer Science, vol 1398. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0026677

Download citation

DOI: https://doi.org/10.1007/BFb0026677
Published: 16 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64417-0
Online ISBN: 978-3-540-69781-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics