Effectively Classifying Short Texts via Improved Lexical Category and Semantic Features

  • Huifang MaEmail author
  • Runan Zhou
  • Fang Liu
  • Xiaoyong Lu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9771)


Classification of short text is challenging due to its severe sparseness and high dimension, which are typical characteristics of short text. In this paper, we propose a novel approach to classify short texts based on both lexical and semantic features. Firstly, the term dictionary is constructed by selecting lexical features that are most representative words of a certain category, and then the optimal topic distribution from the background knowledge repository is extracted via Latent Dirichlet Allocation. The new feature for short text is thereafter constructed. The experimental results show that our method achieved significant quality enhancement in terms of short text classification.


Short text classification Latent Dirichlet allocation Lexical features Semantic features Optimal topic distribution 



This work is supported by the National Natural Science Foundation of China (No. 61363058), Youth Science and technology support program of Gansu Province (145RJZA232, 145RJYA259), 2016 undergraduate innovation capacity enhancement program and 2016 annual public record open space Fund Project 1505JTCA007.


  1. 1.
    Gupta, V., Lehal, G.S.: A survey of text mining techniques and applications. J. Emerg. Technol. Web Intell. 1(1), 60–76 (2009)Google Scholar
  2. 2.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRefGoogle Scholar
  3. 3.
    Cheng, Q.Q., Wang, L.L., Zheng, T., et al.: Microblog friend recommendation based on multi-feature classification. Comput. Eng. 41(4), 65–69 (2015)Google Scholar
  4. 4.
    Sun, A.: Short text classification using very few words. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, USA, pp. 1145–1146 (2012)Google Scholar
  5. 5.
    Vo, D.T., Ock, C.Y.: Learning to classify short text from scientific documents using topic models with various types of knowledge. Expert Syst. Appl. 42(3), 1684–1698 (2015)CrossRefGoogle Scholar
  6. 6.
    Hu, X., Zhang, X., Lu, C., et al.: Exploiting Wikipedia as external knowledge for document clustering. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, pp. 389–396 (2009)Google Scholar
  7. 7.
    Hu, J., Fang, L., Cao, Y.: Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of the 31th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, pp. 179–186 (2008)Google Scholar
  8. 8.
    Song, S., Zhu, H., Chen, L.: Probabilistic correlation-based similarity measure on text records. Inf. Sci. 289(1), 8–24 (2014)CrossRefGoogle Scholar
  9. 9.
    Yang, L.L., Li, C.P., Ding, Q., et al.: Combining lexical and semantic features for short text classification. In: Proceedings of the 17th International Conference in Knowledge Based and Intelligent Information and Engineering Systems, KES, pp. 78–86 (2013)Google Scholar
  10. 10.
    Cheng, H., Qin, Z., Qian, W., et al.: Conditional mutual information based feature selection. In: International Symposium on Knowledge Acquisition and Modeling, pp. 103–107 (2008)Google Scholar
  11. 11.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  12. 12.
    Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM, New York (2008)Google Scholar
  13. 13.
    Chen, M., Jin, X., Shen, D.: Short text classification improved by learning multi-granularity topics. In: Proceedings of the 22th International Joint Conference on Artificial Intelligence, pp. 1776–1781 (2011)Google Scholar
  14. 14.
    Kononenko, I.: Estimating attributes: analysis and extensions of relief. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994)CrossRefGoogle Scholar
  15. 15.
    Sogou Labs: Text Categorization Dataset [EB/OL]. Accessed 01 Sept 2008
  16. 16.
    ICTCLAS, ICTCLAS2012-SDK-0101, rar [EB/OL]. Accessed 18 Aug 2014

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.College of Computer Science and EngineeringNorthwest Normal UniversityLanzhouChina

Personalised recommendations