Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization

Galavotti, Luigi; Sebastiani, Fabrizio; Simi, Maria

doi:10.1007/3-540-45268-0_6

Luigi Galavotti³,
Fabrizio Sebastiani⁴ &
Maria Simi⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1923))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

1010 Accesses
93 Citations

Abstract

We tackle two different problems of text categorization (TC), namely feature selection and classifier induction. Feature selection (FS) refers to the activity of selecting, from the set of r distinct features (i.e. words) occurring in the collection, the subset of r′ ≪ r features that are most useful for compactly representing the meaning of the documents. We propose a novel FS technique, based on a simplified variant of the X ² statistics. Classifier induction refers instead to the problem of auto- matically building a text classifier by learning from a set of documents pre-classified under the categories of interest. We propose a novel variant, based on the exploitation of negative evidence, of the well-known k-NN method. We report the results of systematic experimentation of these two methods performed on the standard Reuters-21578 benchmark.

We here make the assumptions that a document d _j can belong to zero, one or many of the categories in C; this assumption is verified in the Reuters-21578 benchmark we use for our experiments. All the techniques we discuss here can be straightforwardly adapted to the other case in which each document belongs to exactly one category.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

D. J._Ittner, D. D. Lewis, and D. D. Ahn. Text categorization of low quality images. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval, pages 301–315, Las Vegas, US, 1995.
Google Scholar
T. Joachims. Text categorization with support vector machines: learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398 in Lecture Notes in Computer Science, pages 137–142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE.
Google Scholar
D. D. Lewis. Representation and learning in information retrieval. PhD thesis, Department of Computer Science, University of Massachusetts, Amherst, US, 1992.
Google Scholar
H. T. Ng, W. B. Goh, and K. L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. In N. J. Belkin, A. D. Narasimhalu, and P. Willett, editors, Proceedings of SIGIR-97, 20th ACM International Conference on Research and Development in Information Retrieval, pages 67–73, Philadelphia, US, 1997. ACM Press, New York, US.
Google Scholar
R. E. Schapire, Y. Singer, and A. Singhal. Boosting and Rocchio applied to text filtering. In W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, editors, Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval, pages 215–223, Melbourne, AU, 1998. ACM Press, New York, US.
Google Scholar
F. Sebastiani. Machine learning in automated text categorisation: a survey. Technical Report IEI-B4-31-1999, Istituto di Elaborazione dell’Informazione, Consiglio Nazionale delle Ricerche, Pisa, IT, 1999.
Google Scholar
Y. Yang. Expert network: effective and efficient learning from human decisions in text categorisation and retrieval. In W. B. Croft and C. J. van Rijsbergen, editors, Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pages 13–22, Dublin, IE, 1994. Springer Verlag, Heidelberg, DE.
Chapter Google Scholar
Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1-2):69–90, 1999.
Article Google Scholar
Y. Yang and X. Liu. A re-examination of text categorization methods. In M. A. Hearst, F. Gey, and R. Tong, editors, Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, pages 42–49, Berkeley, US, 1999. ACM Press, New York, US.
Google Scholar
Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In D. H. Fisher, editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412–420, Nashville, US, 1997. Morgan Kaufmann Publishers, San Francisco, US.
Google Scholar

Download references

Author information

Authors and Affiliations

AUTON S.R.L., Via Jacopo Nardi, 2 - 50132, Firenze, Italy
Luigi Galavotti
Consiglio Nazionale delle Ricerche, Istituto di Elaborazione dell’Informazione, 56100, Pisa, Italy
Fabrizio Sebastiani
Dipartimento di Informatica, Università di Pisa, 56125, Pisa, Italy
Maria Simi

Authors

Luigi Galavotti
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Sebastiani
View author publications
You can also search for this author in PubMed Google Scholar
Maria Simi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Library of Portugal, Campo Grande, 83, 1749-081, Lisboa, Portugal
José Borbinha
GMD Library, Schloss Birlinghoven, 53754, Sankt Augustin, Germany
Thomas Baker

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Galavotti, L., Sebastiani, F., Simi, M. (2000). Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization. In: Borbinha, J., Baker, T. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2000. Lecture Notes in Computer Science, vol 1923. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45268-0_6

Download citation

DOI: https://doi.org/10.1007/3-540-45268-0_6
Published: 17 November 2000
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41023-2
Online ISBN: 978-3-540-45268-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics