Soft Computing

, Volume 19, Issue 1, pp 29–38 | Cite as

Open-categorical text classification based on multi-LDA models

  • Ruiji Fu
  • Bing Qin
  • Ting LiuEmail author


We present a new and realistic problem, open-categorical text classification, which requires us to classify documents without the categorization system known beforehand. To solve this problem, we propose a novel approach to construct the categorization system and classify documents based on multi-latent Dirichlet allocation (LDA) models. We cluster topics and extract topical keywords to help category annotation. Subsequently, the LDA models are applied to predict the categories of documents comprehensively. Our result, a macro-averaged F1 measure of 84.02 %, outperforms the state-of-the-art supervised and semi-supervised text classification methods.


Topic model Text classification  Categorization system construction 



This work is supported by National Natural Science Foundation of China (NSFC) via Grant 61133012, 61273321 and the National 863 Leading Technology Research Project via grant 2012AA011102. Special thanks to Jianfei Guo and Xiaocheng Feng for their help in the experiments..


  1. Blei DM, Griffiths TL, Jordan MI, Tenenbaum JB (2003a) Hierarchical topic models and the nested Chinese restaurant process. In: NIPS, vol 16Google Scholar
  2. Blei DM, Ng AY, Jordan MI (2003b) Latent dirichlet allocation. J Mach Learn Res 3:993–1022zbMATHGoogle Scholar
  3. Blei DM, McAuliffe JD (2007) Supervised topic models. NIPS 7:121–128Google Scholar
  4. Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of COLT, pp 92–100Google Scholar
  5. Brown PF, Desouza PV, Mercer RL, Della Pietra VJ, Lai JC (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–479Google Scholar
  6. Carletta J (1996) Assessing agreement on classification tasks: the kappa statistic. Comput Linguist 22(2):249–254Google Scholar
  7. Carlson A, Betteridge J, Wang RC, Hruschka Jr ER, Mitchell TM (2010) Coupled semi-supervised learning for information extraction. In: Proceedings of the third ACM international conference on Web search and data mining, pp 101–110Google Scholar
  8. Che W, Li Z, Liu T (2010) Ltp: a Chinese language technology platform. In: Coling 2010: demonstrations, pp 13–16Google Scholar
  9. Cheng SJ, Huang QC, Liu JF, Tang XL (2013) A novel inductive semi-supervised SVM with graph-based self-training. In: Intelligent science and intelligent data engineering. Springer, Berlin Heidelberg, pp 82–89Google Scholar
  10. Collins M, Singer Y (1999) Unsupervised models for named entity classification. In: Proceedings of EMNLP, pp 100–110Google Scholar
  11. Danesh A, Moshiri B, Fatemi O (2007) Improve text classification accuracy based on classifier fusion methods. 10th international conference on information fusion, pp 1–6Google Scholar
  12. Donghui C, Zhijing L (2010) A new text categorization method based on HMM and SVM. In: 2nd international conference on computer engineering and technology (ICCET), vol 7, pp 383–386Google Scholar
  13. Fu JH, Lee SL (2012) A multi-class SVM classification system based on learning methods from indistinguishable chinese official documents. Expert Syst Appl 39(3):3127–3134CrossRefGoogle Scholar
  14. Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 50–57Google Scholar
  15. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML-98, 10th European conference on machine learning (Chemnitz, DE), pp 137–142Google Scholar
  16. Johnson DE, Oles FJ, Zhang T, Goetz T (2002) A decision-tree-based symbolic rule induction system for text categorization. IBM Syst J 41(3):428–437CrossRefGoogle Scholar
  17. Kim S-B, Rim H-C, Yook DS, Lim H-S (2002) Effective methods for improving naive bayes text classifiers. LNAI 2417:414–423Google Scholar
  18. Li CH, Park SC (2009) n efficient document classification model using an improved back propagation neural network and singular value decomposition. Expert Syst Appl 36(2):3208–3215CrossRefMathSciNetGoogle Scholar
  19. Lin Y (2002) Support vector machines and the Bayes rule in classification. Data Min Knowl Discov 6:259–275CrossRefMathSciNetGoogle Scholar
  20. Mao X-L, Ming Z-Y, Chua T-S, Li S, Yan H, Li X (2012) SSHLDA: a semi-supervised hierarchical topic model. In: Proceedings of EMNLP-CoNLL, pp 800–809Google Scholar
  21. McClosky D,Charniak E, Johnson M (2006) Effective self-training for parsing. In: Proceedings of NAACL, pp 152–159Google Scholar
  22. Ng HT, Goh WB, Low KL (1997) Feature selection, perception learning, and a usability case study for text categorization. In: Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, Philadelphia PA, pp 67–73Google Scholar
  23. Petinot Y, McKeown K, Thadani K (2011) A hierarchical model of web summaries. Proc ACL HLT Short Pap Vol 2:670–675Google Scholar
  24. Pham DT, Dimov SS, Nguyen CD (2005) Selection of K in K-means clustering. Proc Inst Mech Eng Part C J Mech Eng Sci 219(1):103–109CrossRefGoogle Scholar
  25. Qin Y-P, Wang X-K (2009) Study on multi-label text classification based on SVM. Sixth international conference on fuzzy systems and knowledge discovery, pp 300–304Google Scholar
  26. Salton G, Wong A, Yan C-S (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620CrossRefzbMATHGoogle Scholar
  27. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47CrossRefGoogle Scholar
  28. Trappey AJC, Hsu F-C, Trappey CV, Lin C-I (2006) Development of a patent document classification and search platform using a back-propagation network. Expert Syst Appl 31(4):755–765CrossRefGoogle Scholar
  29. Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the ACL, pp 384–394 Google Scholar
  30. Ueffing N (2006) Self-training for machine translation. In: Proceedings of NIPS workshop on machine learning for multilingual information accessGoogle Scholar
  31. Vateekul P, Kubat M (2009) Fast induction of multiple decision trees in text categorization from large scale, imbalanced, and multi-label data. IEEE International Conference on Data Mining Workshops, pp 320–325Google Scholar
  32. Yang Y (1999) An evaluation of statistical approaches to text categorization. Inf Retr 1(1–2):69–90CrossRefGoogle Scholar
  33. Zhang Y, Vogel S, Waibel A (2004) Interpreting bleu/nist scores: how much improvement do we need to have a better system. In: Proceedings of the 2004 international conference on language resources and evaluation. pp 2051–2054Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  1. 1.Harbin Institute of TechnologyHarbinPeople’s Republic of China

Personalised recommendations