Exploiting Extremely Rare Features in Text Categorization

  • Péter Schönhofen
  • András A. Benczúr
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4212)


One of the first steps of document classification, clustering and many other information retrieval tasks is to discard words occurring only a few times in the corpus, based on the assumption that they have little contribution to the bag of words representation. However, as we will show, rare n-grams and other similar features are able to indicate surprisingly well if two documents belong to the same category, and thus can aid classification. In our experiments over four corpora, we found that while keeping the size of the training set constant, 5-25% of the test set can be classified essentially for free based on rare features without any loss of accuracy, even experiencing an improvement of 0.6-1.6%.


Text Categorization Inverse Document Frequency Rare Word Rare Feature Document Pair 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Comeau, D.C., Wilbur, W.J.: Non-word identification or spell checking without a dictionary. J. Am. Soc. Inf. Sci. Technol. 55(2), 169–177 (2004)CrossRefGoogle Scholar
  2. 2.
    Goodman, J.: A bit of progress in language modeling. CoRR, cs.CL/0108005 (2001)Google Scholar
  3. 3.
    Iwayama, M., Tokunaga, T.: Cluster-based text categorization: a comparison of category search strategies. In: SIGIR 1995, pp. 273–280 (1995)Google Scholar
  4. 4.
    Joachims, T.: Text categorization with suport vector machines: Learning with many relevant features. In: Proc. European Conference on Machine Learning, pp. 137–142 (1998)Google Scholar
  5. 5.
    Lewis, D.D.: Reuters-21578 text categorization test collection, distribution 1.0 (1997), available at: http://www.daviddlewis.com/resources/testcollections/reuters21578
  6. 6.
    Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)Google Scholar
  7. 7.
    Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Research and Development 1(4), 309–317 (1957)MathSciNetCrossRefGoogle Scholar
  8. 8.
    McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering (1996), http://www.cs.cmu.edu/~mccallum/bow
  9. 9.
    Miller, G.A.: Wordnet: A lexical database for English. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
  10. 10.
    Pantel, P., Lin, D.: Discovering word senses from text. In: Proc. SigKDD (2002)Google Scholar
  11. 11.
    Pekar, V., Krkoska, M.: Weighting distributional features for automatic semantic classification of words. In: International Conference on Recent Advances In Natural Language Processing, pp. 369–373 (2003)Google Scholar
  12. 12.
    Price, L., Thelwall, M.: The clustering power of low frequency words in academic webs: Brief communication. J. Am. Soc. Inf. Sci. Technol. 56(8), 883–888 (2005)CrossRefGoogle Scholar
  13. 13.
    Rigouste, L., Cappe, O., Yvon, F.: Evaluation of a probabilistic method for unsupervised text clustering. In: International Symposium on Applied Stochastic Models and Data Analysis (ASMDA) (2005)Google Scholar
  14. 14.
    Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: Proc. International Conference on Information and Knowledge Management, pp. 659–661 (2002)Google Scholar
  15. 15.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Technical report, Ithaca, NY, USA (1974)Google Scholar
  16. 16.
    Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–21 (1972)CrossRefGoogle Scholar
  17. 17.
    Thelwall, M.: Vocabulary spectral analysis as an exploratory tool for scientific web intelligence. In: Proc. Information Visualisation, pp. 501–506 (2004)Google Scholar
  18. 18.
    Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Dept. of Computer Science, University of Glasgow (1979)Google Scholar
  19. 19.
    Weeber, M., Vos, R., Baayen, R.H.: Extracting the lowest frequency words: Pitfalls and possibilities. Computational Linguistics 26(3), 301–317 (2000)CrossRefGoogle Scholar
  20. 20.
    Willett, P.: Recent trends in hierarchic document clustering: A critical review. Inf. Process. Manage. 24(5), 577–597 (1988)CrossRefGoogle Scholar
  21. 21.
    Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: Proc. ICML 1997, pp. 412–420 (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Péter Schönhofen
    • 1
  • András A. Benczúr
    • 1
  1. 1.Informatics Laboratory, Computer and Automation Research InstituteHungarian Academy of SciencesBudapest

Personalised recommendations