Self-training involving semantic-space finetuning for semi-supervised multi-label document classification

Xu, Zhewei; Iwaihara, Mizuho

doi:10.1007/s00799-023-00355-4

Self-training involving semantic-space finetuning for semi-supervised multi-label document classification

Published: 11 May 2023

Volume 25, pages 25–39, (2024)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Zhewei Xu¹ &
Mizuho Iwaihara¹

283 Accesses
Explore all metrics

Abstract

Self-training is an effective solution for semi-supervised learning, in which both labeled and unlabeled data are leveraged for training. However, the application scenarios of existing self-training frameworks are mostly confined to single-label classification. There exist difficulties in applying self-training under multi-label scenario, since unlike single-label classification, there is no constraint of mutual exclusion over categories, and the vast number of possible label vectors makes discovery of credible predictions harder. For realizing effective self-training under multi-label scenario, we propose ML-DST and ML-DST+ that utilize contextualized document representations of pretrained language models. A BERT-based multi-label classifier and newly designed weighted loss functions for finetuning are proposed. Two label propagation-based algorithms SemLPA and SemLPA+ are also proposed to enhance multi-label prediction, whose similarity measure is iteratively improved through semantic-space finetuning, by which semantic space consisting of document representations is finetuned to better reflect learnt label correlations. High-confidence label predictions are recognized through examining the prediction score on each category separately, which are in turn used for both classifier finetuning and semantic-space finetuning. According to our experiment results, the performance of our approach steadily exceeds the representative baselines under different label rates, proving the superiority of our proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Integrating Semantic-Space Finetuning and Self-Training for Semi-Supervised Multi-label Text Classification

MixLab: An Informative Semi-supervised Method for Multi-label Classification

Label-Wise Document Pre-training for Multi-label Text Classification

References

Meng, Y., et al.: Weakly-supervised neural text classification. In: CIKM, pp. 983–992. (2018)
Meng, Y., et al.: Weakly-supervised hierarchical text classification. In: AAAI, pp. 6826–6833. (2019)
Xie, Q., et al.: Self-training with noisy student improves imagenet classification. In: CVPR, pp. 10687–10698. (2020)
Zou, Y., et al.: Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: ECCV Part III, pp. 297–313. (2018)
Kong, X., Ng, M.K., Zhou, Z.: Transductive multilabel learning via label set propagation. IEEE Trans. Knowl. Data Eng. 25(3), 704–719 (2011)
Article Google Scholar
Wang, L., et al.: Dual relation semi-supervised multi-label learning. In: AAAI, pp. 6227–6234. (2020)
Zhan, W., Zhang, M.: Inductive semi-supervised multi-label learning with co-training. In: SIGKDD, pp. 1305–1314. (2017)
Xing, Y., et al.: Multi-label co-training. In: IJCAI, pp. 2882–2888. (2018)
Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186. (2019)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: EMNLP-IJCNLP, pp. 3982–3992. (2019)
Thakur, N., et al.: Augmented SBERT: data augmentation method for improving Bi-encoders for pairwise sentence scoring tasks. In: NAACL, pp, 296–310. (2021)
Xu, Z., Iwaihara, M.: Integrating semantic-space finetuning and self-training for semi-supervised multi-label text classification. In: ICADL, pp. 249–263. (2021)
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint: 2004.05150 (2020).
Joshi, M., et al.: SpanBERT: improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguist. 8, 64–77 (2020)
Article Google Scholar
Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. In: ICLR. (2020)
Lan, Z., et al. ALBERT: a lite BERT for self-supervised learning of language representations. In: ICLR (2020).
Liu, Y., et al. RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint: 1907.11692 (2019)
Sanh, V., et al.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing—NeurIPS (2019)
Sun, Y., et al.: ERNIE: enhanced representation through knowledge integration. arXiv preprint:1904.09223 (2019)
Zaheer, M., et al.: Big bird: transformers for longer sequences. In: NeurIPS (2020)
Sun, C., et al.: How to fine-tune BERT for text classification?. In: CCL , pp. 194–206. (2019)
Gururangan, S., et al. Don’t stop pretraining: adapt language models to domains and tasks. In: ACL, pp. 8342–8360. (2020)
Xiong, Y., et al.: Fusing label embedding into BERT: an efficient improvement for text classification. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP, pp. 1743–1750. (2021)
Meng, Y., et al. Text classification using label names only: a language model self-training approach. In: EMNLP, pp. 9006–9017. (2020)
Schick, T., Schütze, H.: Exploiting cloze-questions for few-shot text classification and natural language inference. In: EACL, pp. 255–269. (2021)
Scudder, H.J.: Probability of error of some adaptive pattern-recognition machines. IEEE Trans. Inf. Theory 11(3), 363–371 (1965)
Article MathSciNet Google Scholar
Lee, D.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on challenges in representation learning, ICML Workshop: Challenges in Representative Learning. (2013)
Zou, Y., et al.: Confidence regularized self-training. In: ICCV, pp. 5982–5991. (2019)
Wei, C., et al.: CReST: a class-rebalancing self-training framework for imbalanced semi-supervised learning. In: CVPR, pp. 10857–10866. (2021)
Li, X., et al.: Learning to self-train for semi-supervised few-shot classification. In: NeurIPS. (2019)
Mukherjee, S., Ahmed, A.: Uncertainty-aware self-training for few-shot text classification. In: NeurIPS. (2020).
X. Zhu and Z. Ghahramani: Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02–107, Carnegie Mellon University (2002).
Kang, F., Jin, R., Sukthankar, R.: Correlated label propagation with application to multi-label learning. In: CVPR, vol. 2, pp. 1719–1726. (2006)
Wang, B., Tu, Z., Tsotsos, J.K.: Dynamic label propagation for semi-supervised multi-class multi-label classification. In: ICCV, pp. 425–432. (2013)
Liu, Y., et al.: Learning to propagate labels: transductive propagation network for few-shot learning. In: ICLR. (2019)
Iscen, A., et al.: Label propagation for deep semi-supervised learning. In: CVPR, pp. 5070–5079. (2019)
Su, J.: https://www.spaces.ac.cn/archives/7359, Blog post, last accessed 8 May 2022
Apte, C., Damerau, F., Weiss, S.M.: Towards language independent automated learning of text categorization models. In: SIGIR, pp. 23–30. (1994)
Yang, P., et al.: SGM: sequence generation model for multi-label classification. In: COLING, pp. 3915–3926. (2018)
Aly, R., Remus, S., Biemann, C.: Hierarchical multi-label classification of text with capsule networks. In: ACL: Student Research Workshop, pp. 323–330. (2019)

Download references

Funding

The funding was provided by Japan Society for the Promotion of Science (JP22J12044).

Author information

Authors and Affiliations

Graduate School of Information, Production and Systems, Waseda University, Kitakyushu, 808-0135, Japan
Zhewei Xu & Mizuho Iwaihara

Authors

Zhewei Xu
View author publications
You can also search for this author in PubMed Google Scholar
Mizuho Iwaihara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhewei Xu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xu, Z., Iwaihara, M. Self-training involving semantic-space finetuning for semi-supervised multi-label document classification. Int J Digit Libr 25, 25–39 (2024). https://doi.org/10.1007/s00799-023-00355-4

Download citation

Received: 30 July 2022
Revised: 27 February 2023
Accepted: 01 March 2023
Published: 11 May 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s00799-023-00355-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self-training involving semantic-space finetuning for semi-supervised multi-label document classification

Abstract

Access this article

Similar content being viewed by others

Integrating Semantic-Space Finetuning and Self-Training for Semi-Supervised Multi-label Text Classification

MixLab: An Informative Semi-supervised Method for Multi-label Classification

Label-Wise Document Pre-training for Multi-label Text Classification

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Self-training involving semantic-space finetuning for semi-supervised multi-label document classification

Abstract

Access this article

Similar content being viewed by others

Integrating Semantic-Space Finetuning and Self-Training for Semi-Supervised Multi-label Text Classification

MixLab: An Informative Semi-supervised Method for Multi-label Classification

Label-Wise Document Pre-training for Multi-label Text Classification

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation