Skip to main content

An Extended Document Frequency Metric for Feature Selection in Text Categorization

  • Conference paper
Book cover Information Retrieval Technology (AIRS 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4993))

Included in the following conference series:

Abstract

Feature selection plays an important role in text categorization. Many sophisticated feature selection methods such as Information Gain (IG), Mutual Information (MI) and χ2 statistic measure (CHI) have been proposed. However, when compared to these above methods, a very simple technique called Document Frequency thresholding (DF) has shown to be one of the best methods either on Chinese or English text data. A problem is that DF method is usually considered as an empirical approach and it does not consider Term Frequency (TF) factor. In this paper, we put forward an extended DF method called TFDF which combines the Term Frequency (TF) factor. Experimental results on Reuters-21578 and OHSUMED corpora show that TFDF performs much better than the original DF method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Liu-ling, D., He-yan, H., Zhao-xiong, C.: A comparative Study on Feature Selection in Chinese Text Categorization. Journal of Chinese Information Processing 18(1), 26–32 (2005)

    Google Scholar 

  2. Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of CIKM 1998, pp. 148–155 (1998)

    Google Scholar 

  3. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  4. Itner, D.J., Lewis, D.D.: Text categorization of low quality images. In: Proceedings of SDAIR 1995, pp. 301–315 (1995)

    Google Scholar 

  5. Komorowski, J., Pawlak, Z., Polkowski, L., Skowron, A.: Rough sets: A tutorial. In: A New Trend in Decision-Making, pp. 3–98. Springer, Singapore (1999)

    Google Scholar 

  6. Li, Y.H., Jain, A.K.: Classification of text documents. Comput. J. 41(8), 537–546 (1998)

    Article  MATH  Google Scholar 

  7. Maron, M.: Automatic indexing: an experimental inquiry. J. Assoc. Comput. Mach. 8(3), 404–417 (1961)

    MATH  Google Scholar 

  8. Pawlak, Z.: Rough Sets. International Journal of Computer and Information Science 11(5), 341–356 (1982)

    Article  MathSciNet  Google Scholar 

  9. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inform. Process. Man 24(5), 513–523 (1988)

    Article  Google Scholar 

  10. Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  11. Songwei, S., Shicong, F., Xiaoming, L.: A Comparative Study on Several Typical Feature Selection Methods for Chinese Web Page Categorization. Journal of the Computer Engineering and Application 39(22), 146–148 (2003)

    Google Scholar 

  12. Yang, S.M., Wu, X.-B., Deng, Z.-H., Zhang, M., Yang, D.-Q.: Modification of Feature Selection Methods Using Relative Term Frequency. In: Proceedings of ICMLC 2002, pp. 1432–1436 (2002)

    Google Scholar 

  13. Yang, Y., Pedersen, J.O.: Comparative Study on Feature Selection in Text Categorization. In: Proceedings of ICML 1997, pp. 412–420 (1997)

    Google Scholar 

  14. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: SIGIR 1999, pp. 42–49 (1999)

    Google Scholar 

  15. Yang, Y.: An evaluation of statistical approaches to text categorization. Journal of Information Retrieval 1(1/2), 67–88 (1999)

    Google Scholar 

  16. Zhang, H.: The optimality of naive Bayes. In: The 17th International FLAIRS conference, Miami Beach, May 17-19 (2004)

    Google Scholar 

  17. Reuters 21578, http://www.daviddlewis.com/resources/testcollections/reuters21578/

  18. Weka, http://www.cs.waikato.ac.nz/ml/weka/

  19. OHSUMED, http://www.cs.umn.edu/%CB%9Chan/data/tmdata.tar.gz

Download references

Author information

Authors and Affiliations

Authors

Editor information

Hang Li Ting Liu Wei-Ying Ma Tetsuya Sakai Kam-Fai Wong Guodong Zhou

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Xu, Y., Wang, B., Li, J., Jing, H. (2008). An Extended Document Frequency Metric for Feature Selection in Text Categorization. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-68636-1_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68633-0

  • Online ISBN: 978-3-540-68636-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics