Skip to main content
Log in

Open-categorical text classification based on multi-LDA models

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

We present a new and realistic problem, open-categorical text classification, which requires us to classify documents without the categorization system known beforehand. To solve this problem, we propose a novel approach to construct the categorization system and classify documents based on multi-latent Dirichlet allocation (LDA) models. We cluster topics and extract topical keywords to help category annotation. Subsequently, the LDA models are applied to predict the categories of documents comprehensively. Our result, a macro-averaged F1 measure of 84.02 %, outperforms the state-of-the-art supervised and semi-supervised text classification methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://gibbslda.sourceforge.net/.

  2. http://www.wechat.com/en/.

  3. http://www.ltp-cloud.com/.

  4. The p value is the probability of obtaining a test statistic result at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. We use the significance testing method proposed by Zhang et al. (2004).

    Table 4 The performance of document classification
  5. These categories are constructed using our proposed semi-automatic approach based on multi-LDA models. Totally, we obtain 83 categories.

References

  • Blei DM, Griffiths TL, Jordan MI, Tenenbaum JB (2003a) Hierarchical topic models and the nested Chinese restaurant process. In: NIPS, vol 16

  • Blei DM, Ng AY, Jordan MI (2003b) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Blei DM, McAuliffe JD (2007) Supervised topic models. NIPS 7:121–128

    Google Scholar 

  • Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of COLT, pp 92–100

  • Brown PF, Desouza PV, Mercer RL, Della Pietra VJ, Lai JC (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–479

    Google Scholar 

  • Carletta J (1996) Assessing agreement on classification tasks: the kappa statistic. Comput Linguist 22(2):249–254

    Google Scholar 

  • Carlson A, Betteridge J, Wang RC, Hruschka Jr ER, Mitchell TM (2010) Coupled semi-supervised learning for information extraction. In: Proceedings of the third ACM international conference on Web search and data mining, pp 101–110

  • Che W, Li Z, Liu T (2010) Ltp: a Chinese language technology platform. In: Coling 2010: demonstrations, pp 13–16

  • Cheng SJ, Huang QC, Liu JF, Tang XL (2013) A novel inductive semi-supervised SVM with graph-based self-training. In: Intelligent science and intelligent data engineering. Springer, Berlin Heidelberg, pp 82–89

  • Collins M, Singer Y (1999) Unsupervised models for named entity classification. In: Proceedings of EMNLP, pp 100–110

  • Danesh A, Moshiri B, Fatemi O (2007) Improve text classification accuracy based on classifier fusion methods. 10th international conference on information fusion, pp 1–6

  • Donghui C, Zhijing L (2010) A new text categorization method based on HMM and SVM. In: 2nd international conference on computer engineering and technology (ICCET), vol 7, pp 383–386

  • Fu JH, Lee SL (2012) A multi-class SVM classification system based on learning methods from indistinguishable chinese official documents. Expert Syst Appl 39(3):3127–3134

    Article  Google Scholar 

  • Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 50–57

  • Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of ECML-98, 10th European conference on machine learning (Chemnitz, DE), pp 137–142

  • Johnson DE, Oles FJ, Zhang T, Goetz T (2002) A decision-tree-based symbolic rule induction system for text categorization. IBM Syst J 41(3):428–437

    Article  Google Scholar 

  • Kim S-B, Rim H-C, Yook DS, Lim H-S (2002) Effective methods for improving naive bayes text classifiers. LNAI 2417:414–423

    Google Scholar 

  • Li CH, Park SC (2009) n efficient document classification model using an improved back propagation neural network and singular value decomposition. Expert Syst Appl 36(2):3208–3215

    Article  MathSciNet  Google Scholar 

  • Lin Y (2002) Support vector machines and the Bayes rule in classification. Data Min Knowl Discov 6:259–275

    Article  MathSciNet  Google Scholar 

  • Mao X-L, Ming Z-Y, Chua T-S, Li S, Yan H, Li X (2012) SSHLDA: a semi-supervised hierarchical topic model. In: Proceedings of EMNLP-CoNLL, pp 800–809

  • McClosky D,Charniak E, Johnson M (2006) Effective self-training for parsing. In: Proceedings of NAACL, pp 152–159

  • Ng HT, Goh WB, Low KL (1997) Feature selection, perception learning, and a usability case study for text categorization. In: Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, Philadelphia PA, pp 67–73

  • Petinot Y, McKeown K, Thadani K (2011) A hierarchical model of web summaries. Proc ACL HLT Short Pap Vol 2:670–675

    Google Scholar 

  • Pham DT, Dimov SS, Nguyen CD (2005) Selection of K in K-means clustering. Proc Inst Mech Eng Part C J Mech Eng Sci 219(1):103–109

    Article  Google Scholar 

  • Qin Y-P, Wang X-K (2009) Study on multi-label text classification based on SVM. Sixth international conference on fuzzy systems and knowledge discovery, pp 300–304

  • Salton G, Wong A, Yan C-S (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  • Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47

    Article  Google Scholar 

  • Trappey AJC, Hsu F-C, Trappey CV, Lin C-I (2006) Development of a patent document classification and search platform using a back-propagation network. Expert Syst Appl 31(4):755–765

    Article  Google Scholar 

  • Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the ACL, pp 384–394

  • Ueffing N (2006) Self-training for machine translation. In: Proceedings of NIPS workshop on machine learning for multilingual information access

  • Vateekul P, Kubat M (2009) Fast induction of multiple decision trees in text categorization from large scale, imbalanced, and multi-label data. IEEE International Conference on Data Mining Workshops, pp 320–325

  • Yang Y (1999) An evaluation of statistical approaches to text categorization. Inf Retr 1(1–2):69–90

    Article  Google Scholar 

  • Zhang Y, Vogel S, Waibel A (2004) Interpreting bleu/nist scores: how much improvement do we need to have a better system. In: Proceedings of the 2004 international conference on language resources and evaluation. pp 2051–2054

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China (NSFC) via Grant 61133012, 61273321 and the National 863 Leading Technology Research Project via grant 2012AA011102. Special thanks to Jianfei Guo and Xiaocheng Feng for their help in the experiments..

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ting Liu.

Additional information

Communicated by L. Xie.

Appendix: the categorization system of WeChat subscription accounts

Appendix: the categorization system of WeChat subscription accounts

Footnote 5

  • finance and economics

  1. 1.

    banking institutions

  2. 2.

    business

  3. 3.

    financing

  4. 4.

    insurance

  5. 5.

    marketing

  6. 6.

    realty

  7. 7.

    start-ups

    • shopping

  8. 8.

    automobile

  9. 9.

    commodity

  10. 10.

    decoration

  11. 11.

    discount shopping

  12. 12.

    dresses

  13. 13.

    electronic products

  14. 14.

    luxuries

  15. 15.

    online shopping

  16. 16.

    purchasing agents

  17. 17.

    sports equipments

  18. 18.

    wholesale

  19. 19.

    health care

  20. 20.

    maternal and infant

  21. 21.

    nourishing of life

  22. 22.

    dating

    • communication platform

  23. 23.

    friends making

  24. 24.

    job hunting

    • education

  25. 25.

    art schools

  26. 26.

    business administration

  27. 27.

    driving schools

  28. 28.

    foreign language training

  29. 29.

    raining for study abroad

  30. 30.

    tutoring

    • military affairs

  31. 31.

    military affairs

    • science and technology

  32. 32.

    IT

  33. 33.

    mobile internet applications

    • media

  34. 34.

    news media

  35. 35.

    print media

  36. 36.

    TV and radio

  37. 37.

    we-media

  38. 38.

    cosmetic surgery

  39. 39.

    hairdressing

  40. 40.

    skin protection

    • food and drink

  41. 41.

    green food

  42. 42.

    restaurants

  43. 43.

    tea

  44. 44.

    western-style pastry

  45. 45.

    wine

    • services for life

  46. 46.

    air tickets booking

  47. 47.

    Campus

  48. 48.

    car rental

  49. 49.

    community

  50. 50.

    design

  51. 51.

    emotion

  52. 52.

    environmental protection

  53. 53.

    Express delivery

  54. 54.

    homemaking

  55. 55.

    hot lines

  56. 56.

    hotel booking

  57. 57.

    law works

  58. 58.

    life assistants

  59. 59.

    lotteries

  60. 60.

    public good

  61. 61.

    recharging

  62. 62.

    tourism

  63. 63.

    weddings

    • culture

  64. 64.

    art

  65. 65.

    culture

  66. 66.

    originality

  67. 67.

    popularization of science

  68. 68.

    reading

    • entertainment

  69. 69.

    adult entertainment

  70. 70.

    caricatures

  71. 71.

    entertainment stars

  72. 72.

    entertainment venues

  73. 73.

    fashion

  74. 74.

    games

  75. 75.

    image show

  76. 76.

    jokes

  77. 77.

    movies

  78. 78.

    music

  79. 79.

    pets

    • sports

  80. 80.

    sports clubs

  81. 81.

    sports news

    • others

  82. 82.

    brand

  83. 83.

    government

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fu, R., Qin, B. & Liu, T. Open-categorical text classification based on multi-LDA models. Soft Comput 19, 29–38 (2015). https://doi.org/10.1007/s00500-014-1374-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-014-1374-x

Keywords

Navigation