Multi-label dataless text classification with topic modeling

Abstract

Manually labeling documents is tedious and expensive, but it is essential for training a traditional text classifier. In recent years, a few dataless text classification techniques have been proposed to address this problem. However, existing works mainly center on single-label classification problems, that is, each document is restricted to belonging to a single category. In this paper, we propose a novel Seed-guided Multi-label Topic Model, named SMTM. With a few seed words relevant to each category, SMTM conducts multi-label classification for a collection of documents without any labeled document. In SMTM, each category is associated with a single category-topic which covers the meaning of the category. To accommodate with multi-label documents, we explicitly model the category sparsity in SMTM by using spike and slab prior and weak smoothing prior. That is, without using any threshold tuning, SMTM automatically selects the relevant categories for each document. To incorporate the supervision of the seed words, we propose a seed-guided biased GPU (i.e., generalized Pólya urn) sampling procedure to guide the topic inference of SMTM. Experiments on two public datasets show that SMTM achieves better classification accuracy than state-of-the-art alternatives and even outperforms supervised solutions in some scenarios.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. 1.

    Category and category-topic are considered equivalent and exchangeable in this work when the context has no ambiguity.

  2. 2.

    https://github.com/WHUIR/SMTM.

  3. 3.

    http://disi.unitn.it/moschitti/corpora.htm.

  4. 4.

    http://nlp.uned.es/social-tagging/delicioust140/.

  5. 5.

    https://nlp.stanford.edu/software/tmt/tmt-0.4/.

  6. 6.

    NLTK is used to split the documents into sentences.

  7. 7.

    https://github.com/hsoleimani/MLTM.

  8. 8.

    https://code.google.com/archive/p/word2vec/.

References

  1. 1.

    Belanger D, McCallum A (2016) Structured prediction energy networks. In: Proceedings of the 36th annual international conference on machine learning, pp 983–992

  2. 2.

    Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  3. 3.

    Chang MW, Ratinov LA, Roth D, Srikumar V (2008) Importance of semantic representation: dataless classification. In: Proceedings of the 23rd AAAI conference on artificial intelligence, pp 830–835

  4. 4.

    Chemudugunta C, Smyth P, Steyvers M (2007) Modeling general and specific aspects of documents with a probabilistic topic model. In: NIPS, pp 241–248

  5. 5.

    Chen G, Ye D, Xing Z, Chen J, Cambria E (2017) Ensemble application of convolutional and recurrent neural networks for multi-label text categorization. In: Proceedings of the 2017 international joint conference on neural networks, pp 2377–2383

  6. 6.

    Chen X, Xia Y, Jin P, Carroll J (2015) Dataless text classification with descriptive lda. In: Proceedings of the 29th AAAI conference on artificial intelligence, pp 2224–2231

  7. 7.

    Chen Z, Liu B (2014) Mining topics in documents: standing on the shoulders of big data. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1116–1125

  8. 8.

    Chen Z, Mukherjee A, Liu B, Hsu M, Castellanos M, Ghosh R (2013) Leveraging multi-domain prior knowledge in topic models. In: Proceedings of the 23rd international joint conference on artificial intelligence, pp 2071–2077

  9. 9.

    Cissé M, Al-Shedivat M, Bengio S (2016) Adios: architectures deep in output space. In: Proceedings of the 36th annual international conference on machine learning, pp 2770–2779

  10. 10.

    Druck G, Mann G, McCallum A (2008) Learning from labeled features using generalized expectation criteria. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp 595–602

  11. 11.

    Fan RE, Lin CJ (2007) A study on threshold selection for multi-label classification. Department of Computer Science, National Taiwan University, pp 1–23

  12. 12.

    Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th international joint conference on artificial intelligence, pp 1606–1611

  13. 13.

    Ghamrawi N, McCallum A (2005) Collective multi-label classification. In: Proceedings of the 14th ACM international conference on information and knowledge management, ACM, pp 195–200

  14. 14.

    Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101:5228–5235 (suppl 1)

    Article  Google Scholar 

  15. 15.

    Heinrich G (2004) Parameter estimation for text analysis. Technical report

  16. 16.

    Ishwaran H, Rao JS (2005) Spike and slab variable selection: Frequentist and Bayesian strategies. Ann Stat 33:730–773

    MathSciNet  Article  MATH  Google Scholar 

  17. 17.

    Ji S, Tang L, Yu S, Ye J (2008) Extracting shared subspace for multi-label classification. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 381–389

  18. 18.

    Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Mach Learn ECML–98:137–142

    Google Scholar 

  19. 19.

    Ko Y, Seo J (2004) Learning with unlabeled data for text categorization using bootstrapping and feature projection techniques. In: Proceedings of the 42nd annual meeting on association for computational linguistics, p 255

  20. 20.

    Kusner M, Sun Y, Kolkin N, Weinberger K (2015) From word embeddings to document distances. In: Proceedings of the 35th annual international conference on machine learning, pp 957–966

  21. 21.

    Lacoste-Julien S, Sha F, Jordan MI (2009) Disclda: discriminative learning for dimensionality reduction and classification. In: Proceedings of the 23rd annual conference on neural information processing systems, pp 897–904

  22. 22.

    Li C, Wang B, Pavlu V, Aslam J (2016a) Conditional bernoulli mixtures for multi-label classification. In: International conference on machine learning, pp 2482–2491

  23. 23.

    Li C, Wang H, Zhang Z, Sun A, Ma Z (2016b) Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR conference on research and development in information retrieval, pp 165–174

  24. 24.

    Li C, Xing J, Sun A, Ma Z (2016c) Effective document labeling with very few seed words: a topic model approach. In: Proceedings of the 25th ACM international on conference on information and knowledge management, pp 85–94

  25. 25.

    Li C, Duan Y, Wang H, Zhang Z, Sun A, Ma Z (2017) Enhancing topic modeling for short texts with auxiliary word embeddings. ACM Trans Inf Syst 36(2):11:1–11:30

    Article  Google Scholar 

  26. 26.

    Li C, Zhou W, Ji F, Duan Y, Chen H (2018a) A deep relevance model for zero-shot document filtering. In: Proceedings of the 56th annual meeting of the association for computational linguistics, ACL 2018, Melbourne, Australia, July 15–20, 2018, vol 1, Long Papers, pp 2300–2310

  27. 27.

    Li X, Guo Y (2013) Active learning with multi-label svm classification. In: Proceedings of the 23rd international joint conference on artificial intelligence, pp 1479–1485

  28. 28.

    Li X, Yang B (2018) A pseudo label based dataless Naive Bayes algorithm for text classification with seed words. In: Proceedings of the 27th international conference on computational linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pp 1908–1917

  29. 29.

    Li X, Li C, Chi J, Jihong O, Li C (2018b) Dataless text classification: A topic modeling approach with document manifold. In: Proceedings of the 27th ACM international on conference on information and knowledge management

  30. 30.

    Lin T, Tian W, Mei Q, Cheng H (2014) The dual-sparse topic model: mining focused topics and focused terms in short text. In: Proceedings of the 23rd international conference on world wide web, pp 539–550

  31. 31.

    Liu B, Li X, Lee WS, Yu PS (2004) Text classification by labeling words. In: Proceedings of the 19th AAAI conference on artificial intelligence, pp 425–430

  32. 32.

    Liu J, Chang WC, Wu Y, Yang Y (2017) Deep learning for extreme multi-label text classification. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, pp 115–124

  33. 33.

    Mahmoud H (2008) Pólya urn models. CRC Press, Boca Raton

    Google Scholar 

  34. 34.

    Mcauliffe JD, Blei DM (2008) Supervised topic models. In: Proceedings of the 22nd annual conference on neural information processing systems, pp 121–128

  35. 35.

    Mei Q, Ling X, Wondra M, Su H, Zhai C (2007) Topic sentiment mixture: modeling facets and opinions in weblogs. In: WWW, pp 171–180

  36. 36.

    Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th annual conference on neural information processing systems, pp 3111–3119

  37. 37.

    Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the 2011 conference on empirical methods in natural language processing, pp 262–272

  38. 38.

    Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 conference on empirical methods in natural language processing, pp 248–256

  39. 39.

    Ramage D, Manning CD, Dumais S (2011) Partially labeled topic models for interpretable text mining. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, pp 457–465

  40. 40.

    Read J, Pfahringer B, Holmes G, Frank E (2011) Classifier chains for multi-label classification. Mach Learn 85(3):333–359

    MathSciNet  Article  Google Scholar 

  41. 41.

    Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88(1):157–208

    MathSciNet  Article  MATH  Google Scholar 

  42. 42.

    Soleimani H, Miller DJ (2016) Semi-supervised multi-label topic models for document classification and sentence labeling. In: Proceedings of the 25th ACM international on conference on information and knowledge management, pp 105–114

  43. 43.

    Song Y, Roth D (2014) On dataless hierarchical text classification. In: Proceedings of the 28th AAAI conference on artificial intelligence, pp 2224–2231

  44. 44.

    Song Y, Upadhyay S, Peng H, Roth D (2016) Cross-lingual dataless classification for many languages. In: Proceedings of the 25th international joint conference on artificial intelligence, pp 2901–2907

  45. 45.

    Sun YY, Zhang Y, Zhou ZH (2010) Multi-label learning with weak label. In: Proceedings of the 24th AAAI conference on artificial intelligence, pp 593–598

  46. 46.

    Tao X, Li Y, Lau RY, Wang H (2012) Unsupervised multi-label text classification using a world knowledge ontology. In: Proceedings of the 2012 Pacific-Asia conference on knowledge discovery and data mining, pp 480–492

  47. 47.

    Tsoumakas G, Katakis I (2006) Multi-label classification: an overview. Int J Data Warehous Min 3(3):1–13

    Article  Google Scholar 

  48. 48.

    Tsoumakas G, Katakis I, Vlahavas I (2009) Mining multi-label data. In: Data mining and knowledge discovery handbook. Springer, pp 667–685

  49. 49.

    Wang B, Li C, Pavlu V, Aslam J (2017) Regularizing model complexity and label structure for multi-label text classification. arXiv preprint arXiv:1705.00740

  50. 50.

    Wang C, Blei DM (2009) Decoupling sparsity and smoothness in the discrete hierarchical Dirichlet process. In: Proceedings of the 23rd annual conference on neural information processing systems, pp 1982–1989

  51. 51.

    Wang S, Chen Z, Fei G, Liu B, Emery S (2016) Targeted topic modeling for focused analysis. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1235–1244

  52. 52.

    Yang B, Sun JT, Wang T, Chen Z (2009) Effective multi-label active learning for text classification. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 917–926

  53. 53.

    Zhu J, Ahmed A, Xing EP (2009) Medlda: maximum margin supervised topic models for regression and classification. In: Proceedings of the 26th annual international conference on machine learning, pp 1257–1264

  54. 54.

    Zubiaga A, García-Plaza AP, Fresno V, Martínez R (2009) Content-based clustering for tag cloud visualization. In: Proceedings of the 2009 international conference on advances in network analysis and mining, pp 316–319

Download references

Acknowledgements

This research was supported by National Natural Science Foundation of China (Nos. 61872278, 61502344), Natural Science Foundation of Hubei Province (No. 2017CFB502), Natural Scientific Research Program of Wuhan University (No. 2042017kf0225). Chenliang Li is the corresponding author.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Chenliang Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

A. Seed words for evaluation

We manually label some seed words for Delicious and Ohsumed based on standard LDA model. The seed words for Delicious are listed as follows:

Category Seed words
Politics Politics, government, political, democracy, senate
Design Design, css, gallery, designers, designer, graphic
Programming Programming, php, javascript, python, ruby
java java, eclipse, tomcat, applet
Reference Reference
internet internet, traffic
Computer Computer, mac, drive, desktop, screen, hardware
Education Education, students, learning, school, teachers
web web, html, ajax
Language Language, languages, French
Science Science, scientific, brain, scientists, researchers
Writing Writing, fiction, tales
Culture Culture, art, music
History History, collections, historical, ancient
Philosophy Philosophy, ethics
Books Books, book, chapter, reading, authors, readers
English English
Religion Religion, Christian, church, religious, fathers, testament, Jesus
Grammar Grammar, idioms, verbs, verb, sentence, clause, punctuation
Style Style

And the seed words for Ohsumed are listed as follows:

Category Seed words
Bacterial Infections and Mycoses Bacterial, infections, mycoses, sepsis
Virus Diseases Virus, viral, measles, herpes, influenza
Parasitic Diseases Parasite, parasites, malaria, falciparum, leishmaniasis
Neoplasms Neoplasms, neoplasm, cancer, carcinoma, tumor
Musculoskeletal Diseases Musculoskeletal, spine, osteomyelitis
Digestive System Diseases Digestive, gastric, hepatitis, bowel, biliary
Stomatognathic Diseases Stomatitis, teeth, parotid, periodontal
Respiratory Tract Diseases Respiratory, lung, pneumonia, bronchial
Otorhinolaryngologic Diseases Otolaryngologist, ear, hearing, otitis
Nervous System Diseases Nervous, nerve, neurologic, dementia, neurological
Eye Diseases Eye, eyes, cataract
Urologic and Male Genital diseases Urologic, urological, genital, bladder, prostate, prostatic
Female Genital Diseases and pregnancy Complications Genital, pregnancy, endometrial, endometriosis
Cardiovascular Diseases Cardiovascular, ventricular, heart, cardiac, hypertension
Hemic and Lymphatic Diseases Lymphadenopathy, anemia, sickle, thrombocytopenia
Neonatal Diseases and Abnormalities Neonatal, neonates, abnormalities, congenital, anomalies
Skin and Connective Tissue Diseases Skin, connective, tissue, rheumatoid, psoriasis, dermal
Nutritional and Metabolic Diseases Nutritional, nutrition, metabolic, glucose, insulin, diabetes, diabetic
Endocrine Diseases Endocrine, thyroid, parathyroid
Immunologic Diseases Immunologic, immunodeficiency, leukemia
Disorders of Environmental Origin Disorders, injuries, trauma, fracture
Animal Diseases Animal animals
pathological Conditions, Signs and Symptoms Pathological postoperative

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zha, D., Li, C. Multi-label dataless text classification with topic modeling. Knowl Inf Syst 61, 137–160 (2019). https://doi.org/10.1007/s10115-018-1280-0

Download citation

Keywords

  • Dataless text classification
  • Topic model
  • Multi-label text classification
  • Spike and slab prior