Advertisement

Inter-Category Distribution Enhanced Feature Extraction for Efficient Text Classification

  • Yuming WangEmail author
  • Jun Huang
  • Yun Liu
  • Lai Tu
  • Ling Liu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10968)

Abstract

Text data is one of the dominating data types in Big Data driven services and applications. The performance of text classification largely depends on the quality of feature extraction over the text corpus. For supervised learning over text documents, the TF-IDF (Term Frequency-Inverse Document Frequency) weighting factor is one of the most frequently used features in text classification. In this paper, we address two known limitations of TF-IDF based feature extraction method: First, the conventional TF-IDF weighting factor lacks of consideration about the synonymous relationship between feature terms. Second, for big corpus with large number of text documents and large number of feature terms, the computational complexity of text classification increases with the dimensionality of the feature space. We address these problems by introducing an optimization technique based on the Inter-Category Distributions (ICD) of terms and the Inter-Category Distributions of documents. We call this new weighting factor TF-IDF-ICD, namely TF-IDF with Inter-Category Distributions. To further enhance the effectiveness of our TF-IDF-ICD method, we describe a TF-IDF-ICD threshold based Dimensionality Reduction (DR) optimization. We test the text classifier with a corpus of 10, 000 articles. The evaluation results show that the proposed TF-IDF-ICD based text classification method outperforms the conventional TF-IDF based classification solution by \(7.84\%\) at only about \(43.19\%\) of the training time used by the conventional TF-IDF based text classification methods.

Keywords

TF-IDF Feature extraction Text classification Inter-Category Distribution (ICD) Dimensionality reduction 

Notes

Acknowledgement

The authors from Huazhong University of Science and Technology, Wuhan, China, are supported by the Chinese university Social sciences Data Center (CSDC) construction projects (2017–2018) from the Ministry of Education, China. The first author, Dr. Yuming Wang, is currently a visiting scholar at the School of Computer Science, Georgia Institute of Technology, funded by China Scholarship Council (CSC) for the visiting period of one year from December 2017 to December 2018. Prof. Ling Liu’s research is partially supported by the USA National Science Foundation CISE grant 1564097 and an IBM faculty award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies.

References

  1. 1.
    Joachims, T.: Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms, vol. 186. Kluwer Academic Publishers, Norwell (2002)CrossRefGoogle Scholar
  2. 2.
    Almeida, T., Hidalgo, J.M.G., Silva, T.P.: Towards sms spam filtering: results under a new dataset. Int. J. Inf. Secur. Sci. 2(1), 1–18 (2013)Google Scholar
  3. 3.
    Liu, S., Huang, K., Chai, J.: Research of news tagging based on word frequency statistics and user information. In: 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), pp. 1–5. IEEE (2017)Google Scholar
  4. 4.
    Ali, K., Dong, H., Bouguettaya, A., Erradi, A., Hadjidj, R.: Sentiment analysis as a service: a social media based sentiment analysis framework. In: 2017 IEEE International Conference on Web Services (ICWS), pp. 660–667. IEEE (2017)Google Scholar
  5. 5.
    Ramani, R.G., Jacob, S.G.: Benchmarking classification models for cancer prediction from gene expression data: a novel approach and new findings. Stud. Inf. Control 22(2), 134–143 (2013)Google Scholar
  6. 6.
    Chu, Z., Gianvecchio, S., Wang, H., Jajodia, S.: Who is tweeting on Twitter: human, bot, or cyborg? In: Proceedings of the 26th Annual Computer Security Applications Conference, pp. 21–30. ACM (2010)Google Scholar
  7. 7.
    Yang, Y.: An evaluation of statistical approaches to text categorization. Inf. Retrieval 1(1), 69–90 (1999)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRefGoogle Scholar
  9. 9.
    Su, J.S., Bo-Feng, Z., Xin, X.: Advances in machine learning based text categorization. J. Softw. 7, 1848–1859 (2006)CrossRefGoogle Scholar
  10. 10.
    Aggarwal, C.C., Zhai, C. (eds.): Mining Text Data, 1st edn. Springer, New York (2012).  https://doi.org/10.1007/978-1-4614-3223-4CrossRefGoogle Scholar
  11. 11.
    Mladenić, D., Brank, J., Grobelnik, M., Milic-Frayling, N.: Feature selection using linear classifier weights: interaction with classification models. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2004, pp. 234–241. ACM, New York (2004)Google Scholar
  12. 12.
    Salton, G., Yu, C.T.: On the construction of effective vocabularies for information retrieval. SIGIR Forum 9(3), 48–60 (1973)CrossRefGoogle Scholar
  13. 13.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, ICML 1997, pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997)Google Scholar
  14. 14.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, pp. 50–57. ACM, New York (1999)Google Scholar
  15. 15.
    Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, ICML 1997, pp. 143–151. Morgan Kaufmann Publishers Inc., San Francisco (1997)Google Scholar
  16. 16.
    Huang, C.H., Yin, J., Hou, F.: A text similarity measurement combining word semantic information with TF-IDF method. Chin. J. Comput. 34, 856–864 (2011)CrossRefGoogle Scholar
  17. 17.
    Zhu, L., Wang, G., Zou, X.: Improved information gain feature selection method for Chinese text classification based on word embedding. In: Proceedings of the 6th International Conference on Software and Computer Applications, pp. 72–76. ACM (2017)Google Scholar
  18. 18.
    Qu, S., Wang, S., Zou, Y.: Improvement of text feature selection method based on TFIDF. In: International Seminar on Future Information Technology and Management Engineering, FITME 2008, pp. 79–81. IEEE (2008)Google Scholar
  19. 19.
    HanLP: Han Language Processing (2014). https://github.com/hankcs/HanLP
  20. 20.
    Hua, X.L., Zhu, Q.M., Li, P.F.: Chinese text similarity method research by combining semantic analysis with statistics. Jisuanji Yingyong Yanjiu 29(3), 833–836 (2012)Google Scholar
  21. 21.
    LTP-Cloud: Language Technology Platform Cloud (2017). https://www.ltp-cloud.com

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Yuming Wang
    • 1
    • 2
    Email author
  • Jun Huang
    • 1
  • Yun Liu
    • 1
  • Lai Tu
    • 1
  • Ling Liu
    • 2
  1. 1.School of Electronic Information and CommunicationsHuazhong University of Science and TechnologyWuhanChina
  2. 2.School of Computer Science, College of ComputingGeorgia Institute of TechnologyAtlantaUSA

Personalised recommendations