Skip to main content

A Comparison of Term Weighting Schemes for Text Classification and Sentiment Analysis with a Supervised Variant of tf.idf

  • Conference paper
  • First Online:
Data Management Technologies and Applications (DATA 2015)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 584))

Abstract

In text analysis tasks like text classification and sentiment analysis, the careful choice of term weighting schemes can have an important impact on the effectiveness. Classic unsupervised schemes are based solely on the distribution of terms across documents, while newer supervised ones leverage the knowledge of membership of training documents to categories; these latter ones are often specifically tailored for either topic or sentiment classification. We propose here a supervised variant of the well-known tf.idf scheme, where the idf factor is computed without considering documents within the category under analysis, so that terms frequently appearing only within it are not penalized. The importance of these terms is further boosted in a second variant inspired by relevance frequency. We performed extensive experiments to compare these novel schemes to known ones, observing top performances in text categorization by topic and satisfactory results in sentiment classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.daviddlewis.com/resources/testcollections/reuters21578/.

  2. 2.

    http://people.csail.mit.edu/jrennie/20Newsgroups/.

  3. 3.

    http://www.cs.cornell.edu/people/pabo/movie-review-data/.

  4. 4.

    https://www.cs.jhu.edu/~mdredze/datasets/sentiment/.

References

  1. Blitzer, J., Dredze, M., Pereira, F.: Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. Assoc. Comput. Linguist. 7, 440–447 (2007)

    Google Scholar 

  2. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  3. Carmel, D., Mejer, A., Pinter, Y., Szpektor, I.: Improving term weighting for community question answering search using syntactic analysis. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014, pp. 351–360. ACM, New York (2014)

    Google Scholar 

  4. Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Proceedings of the 18th ACM Symposium on Applied Computing, SAC 2003, pp. 784–788. ACM Press (2003)

    Google Scholar 

  5. Deisy, C., Gowri, M., Baskar, S., Kalaiarasi, S., Ramraj, N.: A novel term weighting scheme midf for text categorization. J. Eng. Sci. Technol. 5(1), 94–107 (2010)

    Google Scholar 

  6. Deng, Z.H., Luo, K.H., Yu, H.L.: A study of supervised term weighting scheme for sentiment analysis. Expert Syst. Appl. 41(7), 3506–3513 (2014)

    Article  Google Scholar 

  7. Deng, Z.-H., Tang, S., Yang, D., Li, M.Z.L.-Y., Xie, K.-Q.: A comparative study on feature weight in text categorization. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 588–597. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  8. Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7), 1895–1923 (1998)

    Article  Google Scholar 

  9. Domeniconi, G., Masseroli, M., Moro, G., Pinoli, P.: Random perturbations of term weighted gene ontology annotations for discovering gene unknown functionalities. In: Fred, A., Dietz, J.L.G., Aveiro, D., Liu, K., Filipe, J. (eds.) IC3K 2014. CCIS, vol. 553, pp. 181–197. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

  10. Domeniconi, G., Moro, G., Pagliarani, A., Pasolini, R.: Markov chain based method for in-domain and cross-domain sentiment classification. In: Proceedings of the 7th International Conference on Knowledge Discovery and Information Retrieval (2015)

    Google Scholar 

  11. Domeniconi, G., Moro, G., Pasolini, R., Sartori, C.: Cross-domain text classification through iterative refining of target categories representations. In: Proceedings of the 6th International Conference on Knowledge Discovery and Information Retrieval (2014)

    Google Scholar 

  12. Domeniconi, G., Moro, G., Pasolini, R., Sartori, C.: Iterative refining of category profiles for nearest centroid cross-domain text classification. In: Fred, A., Dietz, J.L.G., Aveiro, D., Liu, K., Filipe, J. (eds.) IC3K 2014. CCIS, vol. 553, pp. 50–67. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25840-9_4

    Chapter  Google Scholar 

  13. Domeniconi, G., Moro, G., Pasolini, R., Sartori, C.: A study on term weighting for text categorization: a novel supervised variant of tf.idf. In: 4th International Conference on Data Management Technologies and Applications (2015)

    Google Scholar 

  14. Fattah, M.A.: New term weighting schemes with combination of multiple classifiers for sentiment analysis. Neurocomputing 167, 434–442 (2015)

    Article  Google Scholar 

  15. Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the use of feature selection and negative evidence in automated text categorization. In: Borbinha, J.L., Baker, T. (eds.) ECDL 2000. LNCS, vol. 1923, pp. 59–68. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  16. Lan, M., Sung, S.Y., Low, H.B., Tan, C.L.: A comparative study on term weighting schemes for text categorization. In: Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, IJCNN 2005, vol. 1, pp. 546–551. IEEE (2005)

    Google Scholar 

  17. Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)

    Article  Google Scholar 

  18. Leopold, E., Kindermann, J.: Text categorization with support vector machines. How to represent texts in input space? Mach. Learn. 46(1–3), 423–444 (2002)

    Article  MATH  Google Scholar 

  19. Lewis, D.D.: Evaluating and optimizing autonomous text classification systems. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1995, pp. 246–254. ACM, New York (1995)

    Google Scholar 

  20. Luo, Q., Chen, E., Xiong, H.: A semantic term weighting scheme for text categorization. Expert Syst. Appl. 38(10), 12708–12716 (2011)

    Article  Google Scholar 

  21. Martineau, J.C., Finin, T.: Delta tfidf: An improved feature space for sentiment analysis. In: Third International AAAI Conference on Weblogs and Social Media (2009)

    Google Scholar 

  22. Paltoglou, G., Thelwall, M.: A study of information retrieval weighting schemes for sentiment analysis. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, pp. 1386–1395. Association for Computational Linguistics, Stroudsburg (2010)

    Google Scholar 

  23. Papineni, K.: Why inverse document frequency? In: Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies. pp. 1–8. Association for Computational Linguistics (2001)

    Google Scholar 

  24. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  25. Song, S.K., Myaeng, S.H.: A novel term weighting scheme based on discrimination power obtained from past retrieval results. Inf. Process. Manage. 48(5), 919–930 (2012)

    Article  Google Scholar 

  26. Tokunaga, T., Makoto, I.: Text categorization based on weighted inverse document frequency. In: Special Interest Groups and Information Process Society of Japan (SIG-IPSJ). Citeseer (1994)

    Google Scholar 

  27. Tsai, F.S., Kwee, A.T.: Experiments in term weighting for novelty mining. Expert Syst. Appl. 38(11), 14094–14101 (2011)

    Google Scholar 

  28. Wang, D., Zhang, H.: Inverse-category-frequency based supervised term weighting schemes for text categorization. J. Inf. Sci. Eng. 29(2), 209–225 (2013)

    MATH  Google Scholar 

  29. Wu, H., Gu, X.: Reducing over-weighting in supervised term weighting for sentiment analysis. In: Proceedings of the 25th International Conference on Computational Linguistics, COLING 2014 (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roberto Pasolini .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Domeniconi, G., Moro, G., Pasolini, R., Sartori, C. (2016). A Comparison of Term Weighting Schemes for Text Classification and Sentiment Analysis with a Supervised Variant of tf.idf. In: Helfert, M., Holzinger, A., Belo, O., Francalanci, C. (eds) Data Management Technologies and Applications. DATA 2015. Communications in Computer and Information Science, vol 584. Springer, Cham. https://doi.org/10.1007/978-3-319-30162-4_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-30162-4_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-30161-7

  • Online ISBN: 978-3-319-30162-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics