Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signals

Li, Ximing; Wang, Bing; Wang, Yue; Ouyang, Jihong; Garg, Harish; Thanh, Dang N. H.

doi:10.1007/s00500-022-07771-9

Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signals

Data analytics and machine learning
Published: 06 January 2023

Volume 27, pages 5397–5410, (2023)
Cite this article

Soft Computing Aims and scope Submit manuscript

Ximing Li^1,2,
Bing Wang^1,2,
Yue Wang^1,2,
Jihong Ouyang^1,2,
Harish Garg ORCID: orcid.org/0000-0001-9099-8422^3,4,6 &
…
Dang N. H. Thanh ORCID: orcid.org/0000-0003-2025-8319⁵

241 Accesses
1 Citation
Explore all metrics

Abstract

Dataless text classification, i.e., a new paradigm of weakly supervised learning, refers to the task of learning with unlabeled documents and a few predefined representative words of categories, known as seed words. The recent generative dataless methods construct document-specific category priors by using seed word occurrences only; however, such category priors often contain very limited and even noisy supervised signals. To remedy this problem, in this paper, we propose a novel formulation of category prior. First, for each document, we consider its label membership degree by not only counting seed word occurrences, but also using a novel prototype scheme, which captures pseudo-nearest neighboring categories. Second, for each label, we consider its frequency prior knowledge of the corpus, which is also a discriminative knowledge for classification. By incorporating the proposed category prior into the previous generative dataless method, we suggest a novel generative dataless method, namely Weakly Supervised Prototype Topic Model. The experimental results on real-world datasets demonstrate that Wsptm outperforms the existing baseline methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-label dataless text classification with topic modeling

Article 08 December 2018

Hetero-Labeled LDA: A Partially Supervised Topic Model with Heterogeneous Labels

Learning Beyond Predefined Label Space via Bayesian Nonparametric Topic Modelling

Data availability statement

Enquiries about data availability should be directed to the authors.

Notes

Following (Li et al. 2018b), we retain the top-5 neighbors for each document in $\Pi (d)$.
Details of datasets and seed word sets are shown in Sect. 5.1
Kindly reminding that, the statistics of seed word occurrences (when $P=0$) with $S^L$ (i.e., label descriptions) are exactly the statistics that have already been shown in Table 1.
Here, word co-occurrence means two words co-occur in the same document.
That means once a word token $w_{dn}$ is observed, this word is considered as occurring $\pi (w_{dn})$ times.
http://kdd.ics.uci.edu/database/reuters21578/reuters21578.html
http://qwone.com/~jason/20Newsgroups/
https://countwordsfree.com/stopwords
https://github.com/ly233/Seed-Guided-Topic-Model
https://github.com/yumeng5/WeSTClass
http://scikit-learn.org/stable/
http://ml.cs.tsinghua.edu.cn/~jun/gibbs-medlda.shtml
During testing, the manifold regularization is not applied for both models.

References

Al-Salemi B, Ayob M, Kendall G, MohdNoah SA (2019) Multi-label arabic text categorization: a benchmark and baseline comparison of multi-label learning algorithms. Inf Process Manage 56:212–227
Blei DM (2012) Probabilistic topic models. Commun ACM 55:77–84
Article Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Chang MW, Ratinov L, Roth D, Srikumar V (2008) Importance of semantic representation: dataless classification. In: AAAI conference on artificial intelligence, pp 830–835
Chen X, Xia Y, Jin1 P, Carroll J, (2015) Dataless text classification with descriptive LDA. In: AAAI conference on artificial intelligence, pp 2224–2231
Costa G, Ortale R (2020) Document clustering meets topic modeling with word embeddings. In: SIAM international conference on data mining, pp 244–252
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Dieng AB, Ruiz FJR, Blei DM (2020) Topic modeling in embedding spaces. Trans Assoc Comput Linguist 8:439–453
Article Google Scholar
Downey D, Etzioni O (2008) Look ma, no hands: analyzing the monotonic feature abstraction for text classification. Neural Inf Process Syst, pp 393–400
Druck G, Mann G, McCallum A, (2008) Learning from labeled features using generalized expectation criteria. In: International ACM SIGIR conference on research and development in information retrieval, pp 595–602
Fu X, Sun X, Wu H, Cui L, Huang JZ (2018) Weakly supervised topic sentiment joint model with word embeddings. Knowl Based Syst 147:43–54
Article Google Scholar
Guan H, Zhou J, Guo M (2009) A class-feature-centroid classifier for text categorization. In: International conference on world wide web, pp 201–210
Hingmire S, Chakraborti S (2014) Topic labeled text classification: a weakly supervised approach. In: International ACM SIGIR conference on research and development in information retrieval, pp 385–394
Hingmire S, Chougule S, Palshikar GK (2013) Document classification by topic labeling. In: International ACM SIGIR conference on research and development in information retrieval, pp 877–880
Kim D, Kim S, Oh A (2012) Dirichlet process with mixed random measures: a nonparametric topic model for labeled data. In: International conference on machine learning, pp 675–682
Li C, Xing J, Sun A, Ma Z (2016) Effective document labeling with very few seed words: a topic modeling approach. In: ACM international on conference on information and knowledge management, pp 85–94
Li X, Li C, Chi J, Ouyang J (2018) Short text topic modeling by exploring original documents. Knowl Inf Syst 56:443–462
Article Google Scholar
Li X, Li C, Chi J, Ouyang J, Li C (2018b) Dataless text classification: a topic modeling approach with document manifold. In: ACM international conference on information and knowledge management, pp 973–982
Li X, Ouyang J, Lu Y, Zhou X, Tian T (2015) Group topic model: organizing topics into groups. Inf Retrieval 18:1–25
Article Google Scholar
Li X, Ouyang J, Zhou X (2015) Supervised topic models for multi-label classification. Neurocomputing 149:811–819
Article Google Scholar
Li X, Yang B (2018) A pseudo label based dataless naive bayes algorithm for text classification with seed words. In: International conference on computational linguistics, pp 1908–1917
Li X, Zhang A, Li C, Ouyang J, Cai Y (2018) Exploring coherent topics by topic modeling with term weighting. Inf Process Manage 54:1345–1358
Article Google Scholar
Liu B, Li X, Lee WS, Yu PS (2004) Text classification by labeling words. In: AAAI conference on artificial intelligence, pp 425–430
Mcauliffe JD, Blei DM, (2007) Supervised topic models. Neural Inf Process Syst, pp 121–128
Meng Y, Shen J, Zhang C, Han J (2018) Weakly-supervised neural text classification. In: International conference on information and knowledge management, pp 983–992
Meng Y, Zhang Y, Huang J, Xiong C, Ji H, Zhang C, Han J (2020) Text classification using label names only: a language model self-training approach. In: Empirical methods in natural language processing, pp 9006–9017
Miao Y, Grefenstette E, Blunsom P (2017) Discovering discrete latent topics with neural variational inference. In: International conference on machine learning, pp 2410–2419
Momtazi S (2018) Unsupervised latent dirichlet allocation for supervised question classification. Inf Process Manage 53:380–393
Article Google Scholar
Ouyang J, Wang Y, Li X, Li C, (2022) Weakly-supervised text classification with wasserstein barycenters regularization. In: International joint conference on artificial intelligence, pp 3373–3379
Pereira RB, Plastino A, Zadrozny B, Merschmann LHC (2018) Correlation analysis of performance measures for multi-label classification. Inf Process Manage 54:359–369
Article Google Scholar
Pergola G, Gui L, He Y (2019) Tdam: a topic-dependent attention model for sentiment analysis. Inf Process Manage 56:102084
Article Google Scholar
Rodrigues F, Ribeiro MLB, Pereira FC (2017) Learning supervised topic models for classification and regression from crowds. IEEE Trans Pattern Anal Mach Intell 30:2409–2422
Article Google Scholar
Shalaby W, Zadrozny W (2019) Learning concept embeddings for dataless classification via efficient bag-of-concepts densification. Knowl Inf Syst 61:1047–1070
Article Google Scholar
Wilson AT, Chew PA, (2010) Term weighting schemes for latent Dirichlet allocation. In: North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 465–473
Xie P, Xing EP (2013) Integrating document clustering and topic modeling. Uncertainty Artif Intell, pp 694–703
Yang M, Zhu D, Chow K (2014) A topic model for building fine-grained domain-specific emotion lexicon. In: Annual meeting of the association for computational linguistics, pp 421–426
Yang W, Boyd-Graber J, Resnik P, (2015) Birds of a feather linked together: a discriminative topic model using link-based priors. In: Conference on Empirical methods in natural language processing, pp 261–266
Yang W, Boyd-Graber J, Resnik P, (2016) A discriminative topic model using document network structure. In: Annual meeting of the association for computational linguistics, pp 686–696
Zha D, Li C (2019) Multi-label dataless text classification with topic modeling. Knowl Inf Syst 61:137–160
Article Google Scholar
Zhu J, Ahmed A, Xing EP (2012) Medlda: maximum margin supervised topic models. J Mach Learn Res 13:2237–2278
MathSciNet MATH Google Scholar
Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H (2016) Topic modeling of short texts: a pseudo-document view. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 2105–2114

Download references

Acknowledgements

We would like to acknowledge support for this project from the National Key R & D Program of China (No. 2021ZD0112501, No. 2021ZD0112502), the National Natural Science Foundation of China (NSFC) (No. 62276113, No. 61876071). Fund receiver: Dr. Jihong Ouyang. This research was funded by University of Economics Ho Chi Minh City (UEH), Vietnam. Fund receiver: Dr. Dang Ngoc Hoang Thanh.

Funding

The authors have not disclosed any funding.

Author information

Authors and Affiliations

College of Computer Science and Technology, Jilin University, Changchun, China
Ximing Li, Bing Wang, Yue Wang & Jihong Ouyang
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China
Ximing Li, Bing Wang, Yue Wang & Jihong Ouyang
School of Mathematics, Thapar Institute of Engineering & Technology (Deemed University), Patiala, Punjab, 147004, India
Harish Garg
Department of Mathematics, Graphic Era Deemed to be University, Dehradun, Uttarakhand, India
Harish Garg
Department of Information Technology, College of Technology and Design, University of Economics Ho Chi Minh City (UEH), Ho Chi Minh City, Vietnam
Dang N. H. Thanh
Applied Science Research Center, Applied Science Private University, Amman, 11931, Jordan
Harish Garg

Authors

Ximing Li
View author publications
You can also search for this author in PubMed Google Scholar
Bing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jihong Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
Harish Garg
View author publications
You can also search for this author in PubMed Google Scholar
Dang N. H. Thanh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dang N. H. Thanh.

Ethics declarations

Conflict of interest

The authors have not disclosed any competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

In this Appendix, we derive the key update equations of Wsptm.

The posterior of topic assignment of each word token, i.e., Eqs.10and 11: By holding $\{\theta ,{\widehat{\theta }}, \phi , {\widehat{\phi }}\}$ fixed, for each word token $w_{dn}$, the joint distributions of the word token and topic assignments are given as follows:

$$\begin{aligned} p\left( w_{dn}, c_{dn}=1, \, z_{dn}=k\right)= & {} \theta _d^T \gamma _{w_{dn}} \, \theta _{dk} \phi _{kw_{dn}}, \end{aligned}$$

(18)

$$\begin{aligned} p\left( w_{dn},c_{dn}=0, \, {{\widehat{z}}}_{dn}=g\right)= & {} \left( 1-\theta _d^T\gamma _{w_{dn}}\right) {\widehat{\theta }}_{dg} {\widehat{\phi }}_{gw_{dn}} \end{aligned}$$

(19)

The marginal probabilities are given as follows:

$$\begin{aligned} p\left( w_{dn}, c_{dn}=1\right)= & {} \theta _d^T\gamma _{w_{dn}} \sum \limits _{i=1}^K\theta _{di}\phi _{iw_{dn}}, \end{aligned}$$

(20)

$$\begin{aligned} p\left( w_{dn},c_{dn}=0\right)= & {} \left( 1-\theta _d^T\gamma _{w_{dn}}\right) \sum \limits _{i=1}^G{\widehat{\theta }}_{di}{\widehat{\phi }}_{iw_{dn}} \end{aligned}$$

(21)

With the above formulas, we can estimate the posteriors of topic assignments by applying the Bayes rule:

$$\begin{aligned} p\left( z_{dn}=k\right)= & {} \theta _d^T\gamma _{w_{dn}} \frac{\theta _{dk}\phi _{kw_{dn}}}{\sum \nolimits _{i=1}^K \theta _{di}\phi _{iw_{dn}}} \buildrel \Delta \over = N_{dnk}, \end{aligned}$$

(22)

$$\begin{aligned} p\left( {{\widehat{z}}}_{dn}=g\right)= & {} \left( 1-\theta _d^T\gamma _{w_{dn}}\right) \frac{\widehat{\theta }_{dg}{\widehat{\phi }}_{gw_{dn}}}{\sum \nolimits _{i=1}^G \widehat{\theta }_{di}\phi _{iw_{dn}}} \buildrel \Delta \over = {{\widehat{N}}}_{dng} \end{aligned}$$

(23)

We kindly emphasize that we omit the notations $c_{dn}=1$ and $c_{dn}=0$ in Eqs.22 and 23 to make equations simple.

The update equations of $\{\widehat{\theta }, \phi , {\widehat{\phi }}\}$, i.e., Eqs.12, 13, 14, and the “initialization” equation of $\theta $, i.e., Eq. 15: Given the current $\{N_{dnk}\}_{k=1}^K, \{\widehat{N}_{dng}\}_{g=1}^G$ of all word tokens and pre-computed word weights $\pi $, we can use them to generate soft occurrence numbers of distribution-specific samples. We take ${\widehat{\theta }}$ as an example. For each document d, we can regard $\{\sum \nolimits _{n=1}^{N_d}\pi (w_{dn}){{\widehat{N}}}_{dng}\}_{g=1}^G$ as the soft occurrence numbers of background topics, while we known that ${\widehat{\theta }}$ is drawn from a Dirichlet prior ${\widehat{\alpha }}$. Accordingly, we can regard this as a Dirichlet-Multinomial estimation, directly giving the update equation of ${\widehat{\theta }}$ as follows:

$$\begin{aligned} {\widehat{\theta }}_{dg}= & {} \frac{\sum \nolimits _{n=1}^{N_d}\pi (w_{dn}){{\widehat{N}}}_{dng}+\widehat{\alpha }}{\sum \nolimits _{i=1}^G \sum \nolimits _{n=1}^{N_d}\pi (w_{dn}){{\widehat{N}}}_{dni}+ G{\widehat{\alpha }}} \end{aligned}$$

(24)

The formulas of $\{\theta , \phi , {\widehat{\phi }}\}$ are similar to that of ${\widehat{\theta }}$, so we omit details.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, X., Wang, B., Wang, Y. et al. Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signals. Soft Comput 27, 5397–5410 (2023). https://doi.org/10.1007/s00500-022-07771-9

Download citation

Accepted: 19 December 2022
Published: 06 January 2023
Issue Date: May 2023
DOI: https://doi.org/10.1007/s00500-022-07771-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signals

Abstract

Access this article

Similar content being viewed by others

Multi-label dataless text classification with topic modeling

Hetero-Labeled LDA: A Partially Supervised Topic Model with Heterogeneous Labels

Learning Beyond Predefined Label Space via Bayesian Nonparametric Topic Modelling

Data availability statement

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signals

Abstract

Access this article

Similar content being viewed by others

Multi-label dataless text classification with topic modeling

Hetero-Labeled LDA: A Partially Supervised Topic Model with Heterogeneous Labels

Learning Beyond Predefined Label Space via Bayesian Nonparametric Topic Modelling

Data availability statement

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation