Automatic Extraction of Domain-Specific Stopwords from Labeled Documents

  • Masoud Makrehchi
  • Mohamed S. Kamel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4956)


Automatic extraction of domain-specific stopword list from a large labeled corpus is discussed. Most researches remove the stopwords using a standard stopword list, and high and low document frequencies. In this paper, a new approach for stopword extraction based on the notion of backward filter level performance and sparsity measure of training data, is proposed. First, we discuss the motivation for updating existing lists or building new ones. Second, based on the proposed backward filter-level performance, we examine the effectiveness of high document frequency filtering for stopword reduction. Finally, a new method for building general and domain-specific stopwords is proposed. The method assumes that a set of candidate stopwords must have minimum information content and prediction capacity, which can be estimated by a classifier performance. The proposed approach is extensively compared with other methods including inverse document frequency and information gain. According to the comparative study, the proposed approach offers more promising results, which guarantee minimum information loss by filtering out most stopwords.


Text Mining Text Categorization Automatic Extraction Document Frequency Feature Ranking 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Chen, A., Gey, F.C.: Building an Arabic stemmer for information retrieval. In: TREC (2002)Google Scholar
  2. 2.
    Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the world wide web. In: Proceedings of the 15th National Conference on Artificial Intelligence (AAAI 1998), pp. 509–516 (1998)Google Scholar
  3. 3.
    Crow, D., De Santo, J.: A hybrid approach to concept extraction and recognition-based matching in the domain of human resources. In: ICTAI, pp. 535–539 (2004)Google Scholar
  4. 4.
    Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)zbMATHCrossRefGoogle Scholar
  5. 5.
    Forman, G.: A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of ICML 2004, Twenty-first international conference on Machine learning, pp. 297–304 (2004)Google Scholar
  6. 6.
    Hayes, J.H., Dekhtyar, A., Sundaram, S.: Text mining for software engineering: how analyst feedback impacts final results. In: MSR 2005: Proceedings of the 2005 international workshop on Mining software repositories, pp. 1–5. ACM Press, New York (2005)CrossRefGoogle Scholar
  7. 7.
    Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 143–151. Morgan Kaufmann Publishers, San Francisco (1997)Google Scholar
  8. 8.
    Kawahara, M., Kawano, H.: Mining association algorithm with threshold based on roc analysis. In: Proceedings of the 34th Annual Hawaii International Conference on System Sciences (HICSS-34), vol. 3, pp. 3010–3017. IEEE Computer Society, Los Alamitos (2001)Google Scholar
  9. 9.
    Koo, S.O., Lim, S.Y., Lee, S.-J.: Building an ontology based on hub words for information retrieval. In: Web Intelligence, pp. 466–469 (2003)Google Scholar
  10. 10.
    Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)Google Scholar
  11. 11.
    Liu, T., Liu, S., Chen, Z., Ma, W.-Y.: An evaluation on feature selection for text clustering. In: Proceedings of ICML 2003, pp. 488–495 (2003)Google Scholar
  12. 12.
    Lo, R.T., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. The Journal on Digital Information Management: special issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR 2005) 3(1), 3–8 (2005)Google Scholar
  13. 13.
    Maletic, J.I., Valluri, N.: Automatic software clustering via latent semantic analysis. In: Proceedings 14th IEEE International Conference on Automated Software Engineering (ASE 1999), Cocoa Beach Florida, October 1999, pp. 251–254 (1999)Google Scholar
  14. 14.
    McCallum, A.K., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: Shavlik, J.W. (ed.) Proceedings of ICML 1998, 15th International Conference on Machine Learning, Madison, US, pp. 359–367. Morgan Kaufmann Publishers, San Francisco (1998)Google Scholar
  15. 15.
    Petras, V., Perelman, N., Gey, F.C.: UC berkeley at clef-2003 - Russian language experiments and domain-specific retrieval. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 401–411. Springer, Heidelberg (2004)Google Scholar
  16. 16.
    Rijsbergen, C.J., Harper, D.J., Porter, M.F.: The selection of good search terms. In: Information Processing and Management, pp. 77–91 (1981)Google Scholar
  17. 17.
    Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: Proceedings of the eleventh international conference on Information and knowledge management, pp. 659–661 (2002)Google Scholar
  18. 18.
    Savoy, J.: A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science, 944–952 (1999)Google Scholar
  19. 19.
    Seki, K., Mostafa, J.: An application of text categorization methods to gene ontology annotation. In: SIGIR, pp. 138–145 (2005)Google Scholar
  20. 20.
    Sinka, M.P., Corne, D.W.: Evolving better stoplists for document clustering and web intelligence. Design and application of hybrid intelligent systems, 1015–1023 (2003)Google Scholar
  21. 21.
    Sinka, M.P., Corne, D.W.: Towards modernised and web-specific stoplists for web document analysis. In: Proceedings of the IEEE/WIC International Conference on Web Intelligence, pp. 396–402. IEEE Computer Society, Los Alamitos (2003)CrossRefGoogle Scholar
  22. 22.
    Van Rijsbergen, C.J.: Information Retrieval, 2nd edn., Dept. of Computer Science, University of Glasgow (1979)Google Scholar
  23. 23.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann Publishers, San Francisco (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Masoud Makrehchi
    • 1
  • Mohamed S. Kamel
    • 1
  1. 1.Pattern Analysis and Machine Intelligence Lab, Department of Electrical and Computer EngineeringUniversity of WaterlooWaterlooCanada

Personalised recommendations