Skip to main content

Efficient Text Classification Using Term Projection

  • Conference paper
Information Retrieval Technology (AIRS 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5839))

Included in the following conference series:

  • 838 Accesses

Abstract

In this paper, we propose an efficient text classification method using term projection. Firstly, we use a modified χ 2 statistic to project terms into predefined categories, which is more efficient compared to other clustering methods. Afterwards, we utilize the generated clusters as features to represent the documents. The classification is then performed in a rule-based manner or via SVM. Experiment results show that our modified χ 2 statistic feature selection method outperforms traditional χ 2 statistic especially at lower dimensionalities. And our method is also more efficient than Latent Semantic Analysis (LSA) on homogeneous dataset. Meanwhile, we can reduce the feature dimensionality by three orders of magnitude to save training and testing cost, and maintain comparable accuracy. Moreover, we could use a small training set to gain an approximately 4.3% improvement on heterogeneous dataset as compared to traditional method, which indicates that our method has better generalization capability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  2. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning, pp. 137–142 (1998)

    Google Scholar 

  3. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification, 2nd edn. Wiley-Interscience, New York (2000)

    MATH  Google Scholar 

  4. Yang, Y.M., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)

    Google Scholar 

  5. Yang, Y.M., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of 14th International Conference on Machine Learning, pp. 412–420 (1997)

    Google Scholar 

  6. Li, J.Y., Sun, M.S., Zhang, X.: A comparison and semi-quantitative analysis of words and character-bigrams as features in Chinese text categorization. In: Proceedings of COLING-ACL 2006, pp. 545–552 (2006)

    Google Scholar 

  7. Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proceedings of 21st ACM International Conference on Research and Development in Information Retrieval, pp. 96–103 (1998)

    Google Scholar 

  8. Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In: Proceedings of 24th ACM International Conference on Research and Development in Information Retrieval, pp. 146–153 (2001)

    Google Scholar 

  9. Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research 3, 1183–1208 (2003)

    MATH  Google Scholar 

  10. Chen, W.L., Chang, X.Z., Wang, H.Z., Zhu, J.B., Yao, T.S.: Automatic word clustering for text categorization using global information. In: First Asia Information Retrieval Symposium, pp. 1–6 (2004)

    Google Scholar 

  11. Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 183–190 (1993)

    Google Scholar 

  12. Ling, X., Dai, W.Y., Jiang, Y., Xue, G.R., Yang, Q., Yu, Y.: Can Chinese Web Pages be Classified with English Data Source? In: Proceedings of the 17th international conference on World Wide Web (2008)

    Google Scholar 

  13. Dai, W.Y., Xue, G.R., Yang, Q., Yu, Y.: Transferring Naive Bayes Classifiers for Text Classification. In: Proceedings of the 22nd AAAI Conference on Artificial Intelligence (2007)

    Google Scholar 

  14. Li, J.Y., Sun, M.S.: Scalable term selection for text categorization. In: Proceedings of EMNLP 2007, pp. 774–782 (2007)

    Google Scholar 

  15. Forman, G.: An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289–1305 (2003)

    MATH  Google Scholar 

  16. Rennie, J.: 20Newsgroups dataset, http://people.csail.mit.edu/jrennie/20Newsgroups/

  17. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/cjlin/libsvm

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zheng, Y., Liu, Z., Teng, S., Sun, M. (2009). Efficient Text Classification Using Term Projection. In: Lee, G.G., et al. Information Retrieval Technology. AIRS 2009. Lecture Notes in Computer Science, vol 5839. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04769-5_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-04769-5_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-04768-8

  • Online ISBN: 978-3-642-04769-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics