Skip to main content
Log in

Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signals

  • Data analytics and machine learning
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Dataless text classification, i.e., a new paradigm of weakly supervised learning, refers to the task of learning with unlabeled documents and a few predefined representative words of categories, known as seed words. The recent generative dataless methods construct document-specific category priors by using seed word occurrences only; however, such category priors often contain very limited and even noisy supervised signals. To remedy this problem, in this paper, we propose a novel formulation of category prior. First, for each document, we consider its label membership degree by not only counting seed word occurrences, but also using a novel prototype scheme, which captures pseudo-nearest neighboring categories. Second, for each label, we consider its frequency prior knowledge of the corpus, which is also a discriminative knowledge for classification. By incorporating the proposed category prior into the previous generative dataless method, we suggest a novel generative dataless method, namely Weakly Supervised Prototype Topic Model. The experimental results on real-world datasets demonstrate that Wsptm outperforms the existing baseline methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data availability statement

Enquiries about data availability should be directed to the authors.

Notes

  1. Following (Li et al. 2018b), we retain the top-5 neighbors for each document in \(\Pi (d)\).

  2. Details of datasets and seed word sets are shown in Sect. 5.1

  3. Kindly reminding that, the statistics of seed word occurrences (when \(P=0\)) with \(S^L\) (i.e., label descriptions) are exactly the statistics that have already been shown in Table 1.

  4. Here, word co-occurrence means two words co-occur in the same document.

  5. That means once a word token \(w_{dn}\) is observed, this word is considered as occurring \(\pi (w_{dn})\) times.

  6. http://kdd.ics.uci.edu/database/reuters21578/reuters21578.html

  7. http://qwone.com/~jason/20Newsgroups/

  8. https://countwordsfree.com/stopwords

  9. https://github.com/ly233/Seed-Guided-Topic-Model

  10. https://github.com/yumeng5/WeSTClass

  11. http://scikit-learn.org/stable/

  12. http://ml.cs.tsinghua.edu.cn/~jun/gibbs-medlda.shtml

  13. During testing, the manifold regularization is not applied for both models.

References

  • Al-Salemi B, Ayob M, Kendall G, MohdNoah SA (2019) Multi-label arabic text categorization: a benchmark and baseline comparison of multi-label learning algorithms. Inf Process Manage 56:212–227

  • Blei DM (2012) Probabilistic topic models. Commun ACM 55:77–84

    Article  Google Scholar 

  • Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Chang MW, Ratinov L, Roth D, Srikumar V (2008) Importance of semantic representation: dataless classification. In: AAAI conference on artificial intelligence, pp 830–835

  • Chen X, Xia Y, Jin1 P, Carroll J, (2015) Dataless text classification with descriptive LDA. In: AAAI conference on artificial intelligence, pp 2224–2231

  • Costa G, Ortale R (2020) Document clustering meets topic modeling with word embeddings. In: SIAM international conference on data mining, pp 244–252

  • Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  • Dieng AB, Ruiz FJR, Blei DM (2020) Topic modeling in embedding spaces. Trans Assoc Comput Linguist 8:439–453

    Article  Google Scholar 

  • Downey D, Etzioni O (2008) Look ma, no hands: analyzing the monotonic feature abstraction for text classification. Neural Inf Process Syst, pp 393–400

  • Druck G, Mann G, McCallum A, (2008) Learning from labeled features using generalized expectation criteria. In: International ACM SIGIR conference on research and development in information retrieval, pp 595–602

  • Fu X, Sun X, Wu H, Cui L, Huang JZ (2018) Weakly supervised topic sentiment joint model with word embeddings. Knowl Based Syst 147:43–54

    Article  Google Scholar 

  • Guan H, Zhou J, Guo M (2009) A class-feature-centroid classifier for text categorization. In: International conference on world wide web, pp 201–210

  • Hingmire S, Chakraborti S (2014) Topic labeled text classification: a weakly supervised approach. In: International ACM SIGIR conference on research and development in information retrieval, pp 385–394

  • Hingmire S, Chougule S, Palshikar GK (2013) Document classification by topic labeling. In: International ACM SIGIR conference on research and development in information retrieval, pp 877–880

  • Kim D, Kim S, Oh A (2012) Dirichlet process with mixed random measures: a nonparametric topic model for labeled data. In: International conference on machine learning, pp 675–682

  • Li C, Xing J, Sun A, Ma Z (2016) Effective document labeling with very few seed words: a topic modeling approach. In: ACM international on conference on information and knowledge management, pp 85–94

  • Li X, Li C, Chi J, Ouyang J (2018) Short text topic modeling by exploring original documents. Knowl Inf Syst 56:443–462

    Article  Google Scholar 

  • Li X, Li C, Chi J, Ouyang J, Li C (2018b) Dataless text classification: a topic modeling approach with document manifold. In: ACM international conference on information and knowledge management, pp 973–982

  • Li X, Ouyang J, Lu Y, Zhou X, Tian T (2015) Group topic model: organizing topics into groups. Inf Retrieval 18:1–25

    Article  Google Scholar 

  • Li X, Ouyang J, Zhou X (2015) Supervised topic models for multi-label classification. Neurocomputing 149:811–819

    Article  Google Scholar 

  • Li X, Yang B (2018) A pseudo label based dataless naive bayes algorithm for text classification with seed words. In: International conference on computational linguistics, pp 1908–1917

  • Li X, Zhang A, Li C, Ouyang J, Cai Y (2018) Exploring coherent topics by topic modeling with term weighting. Inf Process Manage 54:1345–1358

    Article  Google Scholar 

  • Liu B, Li X, Lee WS, Yu PS (2004) Text classification by labeling words. In: AAAI conference on artificial intelligence, pp 425–430

  • Mcauliffe JD, Blei DM, (2007) Supervised topic models. Neural Inf Process Syst, pp 121–128

  • Meng Y, Shen J, Zhang C, Han J (2018) Weakly-supervised neural text classification. In: International conference on information and knowledge management, pp 983–992

  • Meng Y, Zhang Y, Huang J, Xiong C, Ji H, Zhang C, Han J (2020) Text classification using label names only: a language model self-training approach. In: Empirical methods in natural language processing, pp 9006–9017

  • Miao Y, Grefenstette E, Blunsom P (2017) Discovering discrete latent topics with neural variational inference. In: International conference on machine learning, pp 2410–2419

  • Momtazi S (2018) Unsupervised latent dirichlet allocation for supervised question classification. Inf Process Manage 53:380–393

    Article  Google Scholar 

  • Ouyang J, Wang Y, Li X, Li C, (2022) Weakly-supervised text classification with wasserstein barycenters regularization. In: International joint conference on artificial intelligence, pp 3373–3379

  • Pereira RB, Plastino A, Zadrozny B, Merschmann LHC (2018) Correlation analysis of performance measures for multi-label classification. Inf Process Manage 54:359–369

    Article  Google Scholar 

  • Pergola G, Gui L, He Y (2019) Tdam: a topic-dependent attention model for sentiment analysis. Inf Process Manage 56:102084

    Article  Google Scholar 

  • Rodrigues F, Ribeiro MLB, Pereira FC (2017) Learning supervised topic models for classification and regression from crowds. IEEE Trans Pattern Anal Mach Intell 30:2409–2422

    Article  Google Scholar 

  • Shalaby W, Zadrozny W (2019) Learning concept embeddings for dataless classification via efficient bag-of-concepts densification. Knowl Inf Syst 61:1047–1070

    Article  Google Scholar 

  • Wilson AT, Chew PA, (2010) Term weighting schemes for latent Dirichlet allocation. In: North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 465–473

  • Xie P, Xing EP (2013) Integrating document clustering and topic modeling. Uncertainty Artif Intell, pp 694–703

  • Yang M, Zhu D, Chow K (2014) A topic model for building fine-grained domain-specific emotion lexicon. In: Annual meeting of the association for computational linguistics, pp 421–426

  • Yang W, Boyd-Graber J, Resnik P, (2015) Birds of a feather linked together: a discriminative topic model using link-based priors. In: Conference on Empirical methods in natural language processing, pp 261–266

  • Yang W, Boyd-Graber J, Resnik P, (2016) A discriminative topic model using document network structure. In: Annual meeting of the association for computational linguistics, pp 686–696

  • Zha D, Li C (2019) Multi-label dataless text classification with topic modeling. Knowl Inf Syst 61:137–160

    Article  Google Scholar 

  • Zhu J, Ahmed A, Xing EP (2012) Medlda: maximum margin supervised topic models. J Mach Learn Res 13:2237–2278

    MathSciNet  MATH  Google Scholar 

  • Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H (2016) Topic modeling of short texts: a pseudo-document view. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 2105–2114

Download references

Acknowledgements

We would like to acknowledge support for this project from the National Key R & D Program of China (No. 2021ZD0112501, No. 2021ZD0112502), the National Natural Science Foundation of China (NSFC) (No. 62276113, No. 61876071). Fund receiver: Dr. Jihong Ouyang. This research was funded by University of Economics Ho Chi Minh City (UEH), Vietnam. Fund receiver: Dr. Dang Ngoc Hoang Thanh.

Funding

The authors have not disclosed any funding.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dang N. H. Thanh.

Ethics declarations

Conflict of interest

The authors have not disclosed any competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

In this Appendix, we derive the key update equations of Wsptm.

The posterior of topic assignment of each word token, i.e., Eqs.10and 11: By holding \(\{\theta ,{\widehat{\theta }}, \phi , {\widehat{\phi }}\}\) fixed, for each word token \(w_{dn}\), the joint distributions of the word token and topic assignments are given as follows:

$$\begin{aligned} p\left( w_{dn}, c_{dn}=1, \, z_{dn}=k\right)= & {} \theta _d^T \gamma _{w_{dn}} \, \theta _{dk} \phi _{kw_{dn}}, \end{aligned}$$
(18)
$$\begin{aligned} p\left( w_{dn},c_{dn}=0, \, {{\widehat{z}}}_{dn}=g\right)= & {} \left( 1-\theta _d^T\gamma _{w_{dn}}\right) {\widehat{\theta }}_{dg} {\widehat{\phi }}_{gw_{dn}} \end{aligned}$$
(19)

The marginal probabilities are given as follows:

$$\begin{aligned} p\left( w_{dn}, c_{dn}=1\right)= & {} \theta _d^T\gamma _{w_{dn}} \sum \limits _{i=1}^K\theta _{di}\phi _{iw_{dn}}, \end{aligned}$$
(20)
$$\begin{aligned} p\left( w_{dn},c_{dn}=0\right)= & {} \left( 1-\theta _d^T\gamma _{w_{dn}}\right) \sum \limits _{i=1}^G{\widehat{\theta }}_{di}{\widehat{\phi }}_{iw_{dn}} \end{aligned}$$
(21)

With the above formulas, we can estimate the posteriors of topic assignments by applying the Bayes rule:

$$\begin{aligned} p\left( z_{dn}=k\right)= & {} \theta _d^T\gamma _{w_{dn}} \frac{\theta _{dk}\phi _{kw_{dn}}}{\sum \nolimits _{i=1}^K \theta _{di}\phi _{iw_{dn}}} \buildrel \Delta \over = N_{dnk}, \end{aligned}$$
(22)
$$\begin{aligned} p\left( {{\widehat{z}}}_{dn}=g\right)= & {} \left( 1-\theta _d^T\gamma _{w_{dn}}\right) \frac{\widehat{\theta }_{dg}{\widehat{\phi }}_{gw_{dn}}}{\sum \nolimits _{i=1}^G \widehat{\theta }_{di}\phi _{iw_{dn}}} \buildrel \Delta \over = {{\widehat{N}}}_{dng} \end{aligned}$$
(23)

We kindly emphasize that we omit the notations \(c_{dn}=1\) and \(c_{dn}=0\) in Eqs.22 and 23 to make equations simple.

The update equations of \(\{\widehat{\theta }, \phi , {\widehat{\phi }}\}\), i.e., Eqs.12, 13, 14, and the “initialization” equation of \(\theta \), i.e., Eq. 15: Given the current \(\{N_{dnk}\}_{k=1}^K, \{\widehat{N}_{dng}\}_{g=1}^G\) of all word tokens and pre-computed word weights \(\pi \), we can use them to generate soft occurrence numbers of distribution-specific samples. We take \({\widehat{\theta }}\) as an example. For each document d, we can regard \(\{\sum \nolimits _{n=1}^{N_d}\pi (w_{dn}){{\widehat{N}}}_{dng}\}_{g=1}^G\) as the soft occurrence numbers of background topics, while we known that \({\widehat{\theta }}\) is drawn from a Dirichlet prior \({\widehat{\alpha }}\). Accordingly, we can regard this as a Dirichlet-Multinomial estimation, directly giving the update equation of \({\widehat{\theta }}\) as follows:

$$\begin{aligned} {\widehat{\theta }}_{dg}= & {} \frac{\sum \nolimits _{n=1}^{N_d}\pi (w_{dn}){{\widehat{N}}}_{dng}+\widehat{\alpha }}{\sum \nolimits _{i=1}^G \sum \nolimits _{n=1}^{N_d}\pi (w_{dn}){{\widehat{N}}}_{dni}+ G{\widehat{\alpha }}} \end{aligned}$$
(24)

The formulas of \(\{\theta , \phi , {\widehat{\phi }}\}\) are similar to that of \({\widehat{\theta }}\), so we omit details.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, X., Wang, B., Wang, Y. et al. Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signals. Soft Comput 27, 5397–5410 (2023). https://doi.org/10.1007/s00500-022-07771-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-022-07771-9

Keywords

Navigation