A Term Weighting Scheme Approach for Vietnamese Text Classification

  • Vu Thanh Nguyen
  • Nguyen Tri Hai
  • Nguyen Hoang Nghia
  • Tuan Dinh Le
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9446)

Abstract

The term weighting scheme, which is used to convert the documents to vectors in the term space, is a vital step in automatic text categorization. The previous studies showed that term weighting schemes dominate the performance. There have been extensive studies on term weighting for English text classification. However, not many works have been studied on Vietnamese text classification.. In this paper, we proposed a term weighting scheme called normalize(tf.rf max ), which is based on tf.rf term weighting scheme – one of the most effective term weighting schemes to date. We conducted experiments to compare our proposed normalize(tf.rf max ) term weighting scheme to tf.rf and tf.idf on Vietnamese text classification benchmark. The results showed that our proposed term weighting scheme can achieve about 3 %–5 % accuracy better than other term weighting schemes.

Keywords

Term weighting scheme Vietnamese text classification tf.idf tf.rf 

Notes

Acknowledgment

This research is funded by Vietnam National University, Ho Chi Minh City (VNU-HCM) under grant number C2014-26-04.

References

  1. 1.
    Chang, C.C., Chih, J.L.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)Google Scholar
  2. 2.
    Debole, F., Fabrizio, S.: Supervised term weighting for automated text categorization. In: Sirmakessis, S. (ed.) Text Mining and Its Applications, pp. 81–97. Springer, Berlin, Heidelberg (2004)CrossRefGoogle Scholar
  3. 3.
    Deng, Z.-H., Tang, S.-W., Yang, D.-Q., Li, M.Z.L.-Y., Xie, K.-Q.: A comparative study on feature weight in text categorization. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 588–597. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  4. 4.
    Hoang, V.C.D., et al.: A comparative study on Vietnamese text classification methods. In: 2007 IEEE International Conference on Research, Innovation and Vision for the Future. IEEE (2007)Google Scholar
  5. 5.
    Hsu, C.W., Chih, J.L.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13(2), 415–425 (2002)CrossRefGoogle Scholar
  6. 6.
    Phuong, L.H., Huyên, N.T.M., Roussanaly, A., Vinh, H.T.: A hybrid approach to word segmentation of vietnamese texts. In: Martín-Vide, C., Otto, F., Fernau, H. (eds.) LATA 2008. LNCS, vol. 5196, pp. 240–249. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  7. 7.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Berlin, Heidelberg (1998)CrossRefGoogle Scholar
  8. 8.
    Lei, H., Govindaraju, V.: Half-against-half multi-class support vector machines. In: Oza, N.C., Polikar, R., Kittler, J., Roli, F. (eds.) MCS 2005. LNCS, vol. 3541, pp. 156–164. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  9. 9.
    Leopold, E., Jörg, K.: Text categorization with support vector machines. How to represent texts in input space? Mach. Learn. 46(1–3), 423–444 (2002)CrossRefMATHGoogle Scholar
  10. 10.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)CrossRefGoogle Scholar
  11. 11.
    Yang, Y., Jan, O.P.: A comparative study on feature selection in text categorization. In: ICML, vol. 97 (1997)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Vu Thanh Nguyen
    • 1
  • Nguyen Tri Hai
    • 1
  • Nguyen Hoang Nghia
    • 1
  • Tuan Dinh Le
    • 2
  1. 1.University of Information Technology VNU-HCMHo Chi MinhVietnam
  2. 2.Long an University of Economics and IndustryTan An CityVietnam

Personalised recommendations