Skip to main content
Log in

Self-training involving semantic-space finetuning for semi-supervised multi-label document classification

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

Self-training is an effective solution for semi-supervised learning, in which both labeled and unlabeled data are leveraged for training. However, the application scenarios of existing self-training frameworks are mostly confined to single-label classification. There exist difficulties in applying self-training under multi-label scenario, since unlike single-label classification, there is no constraint of mutual exclusion over categories, and the vast number of possible label vectors makes discovery of credible predictions harder. For realizing effective self-training under multi-label scenario, we propose ML-DST and ML-DST+ that utilize contextualized document representations of pretrained language models. A BERT-based multi-label classifier and newly designed weighted loss functions for finetuning are proposed. Two label propagation-based algorithms SemLPA and SemLPA+ are also proposed to enhance multi-label prediction, whose similarity measure is iteratively improved through semantic-space finetuning, by which semantic space consisting of document representations is finetuned to better reflect learnt label correlations. High-confidence label predictions are recognized through examining the prediction score on each category separately, which are in turn used for both classifier finetuning and semantic-space finetuning. According to our experiment results, the performance of our approach steadily exceeds the representative baselines under different label rates, proving the superiority of our proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Meng, Y., et al.: Weakly-supervised neural text classification. In: CIKM, pp. 983–992. (2018)

  2. Meng, Y., et al.: Weakly-supervised hierarchical text classification. In: AAAI, pp. 6826–6833. (2019)

  3. Xie, Q., et al.: Self-training with noisy student improves imagenet classification. In: CVPR, pp. 10687–10698. (2020)

  4. Zou, Y., et al.: Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: ECCV Part III, pp. 297–313. (2018)

  5. Kong, X., Ng, M.K., Zhou, Z.: Transductive multilabel learning via label set propagation. IEEE Trans. Knowl. Data Eng. 25(3), 704–719 (2011)

    Article  Google Scholar 

  6. Wang, L., et al.: Dual relation semi-supervised multi-label learning. In: AAAI, pp. 6227–6234. (2020)

  7. Zhan, W., Zhang, M.: Inductive semi-supervised multi-label learning with co-training. In: SIGKDD, pp. 1305–1314. (2017)

  8. Xing, Y., et al.: Multi-label co-training. In: IJCAI, pp. 2882–2888. (2018)

  9. Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186. (2019)

  10. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: EMNLP-IJCNLP, pp. 3982–3992. (2019)

  11. Thakur, N., et al.: Augmented SBERT: data augmentation method for improving Bi-encoders for pairwise sentence scoring tasks. In: NAACL, pp, 296–310. (2021)

  12. Xu, Z., Iwaihara, M.: Integrating semantic-space finetuning and self-training for semi-supervised multi-label text classification. In: ICADL, pp. 249–263. (2021)

  13. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint: 2004.05150 (2020).

  14. Joshi, M., et al.: SpanBERT: improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguist. 8, 64–77 (2020)

    Article  Google Scholar 

  15. Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. In: ICLR. (2020)

  16. Lan, Z., et al. ALBERT: a lite BERT for self-supervised learning of language representations. In: ICLR (2020).

  17. Liu, Y., et al. RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint: 1907.11692 (2019)

  18. Sanh, V., et al.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing—NeurIPS (2019)

  19. Sun, Y., et al.: ERNIE: enhanced representation through knowledge integration. arXiv preprint:1904.09223 (2019)

  20. Zaheer, M., et al.: Big bird: transformers for longer sequences. In: NeurIPS (2020)

  21. Sun, C., et al.: How to fine-tune BERT for text classification?. In: CCL , pp. 194–206. (2019)

  22. Gururangan, S., et al. Don’t stop pretraining: adapt language models to domains and tasks. In: ACL, pp. 8342–8360. (2020)

  23. Xiong, Y., et al.: Fusing label embedding into BERT: an efficient improvement for text classification. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP, pp. 1743–1750. (2021)

  24. Meng, Y., et al. Text classification using label names only: a language model self-training approach. In: EMNLP, pp. 9006–9017. (2020)

  25. Schick, T., Schütze, H.: Exploiting cloze-questions for few-shot text classification and natural language inference. In: EACL, pp. 255–269. (2021)

  26. Scudder, H.J.: Probability of error of some adaptive pattern-recognition machines. IEEE Trans. Inf. Theory 11(3), 363–371 (1965)

    Article  MathSciNet  Google Scholar 

  27. Lee, D.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on challenges in representation learning, ICML Workshop: Challenges in Representative Learning. (2013)

  28. Zou, Y., et al.: Confidence regularized self-training. In: ICCV, pp. 5982–5991. (2019)

  29. Wei, C., et al.: CReST: a class-rebalancing self-training framework for imbalanced semi-supervised learning. In: CVPR, pp. 10857–10866. (2021)

  30. Li, X., et al.: Learning to self-train for semi-supervised few-shot classification. In: NeurIPS. (2019)

  31. Mukherjee, S., Ahmed, A.: Uncertainty-aware self-training for few-shot text classification. In: NeurIPS. (2020).

  32. X. Zhu and Z. Ghahramani: Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02–107, Carnegie Mellon University (2002).

  33. Kang, F., Jin, R., Sukthankar, R.: Correlated label propagation with application to multi-label learning. In: CVPR, vol. 2, pp. 1719–1726. (2006)

  34. Wang, B., Tu, Z., Tsotsos, J.K.: Dynamic label propagation for semi-supervised multi-class multi-label classification. In: ICCV, pp. 425–432. (2013)

  35. Liu, Y., et al.: Learning to propagate labels: transductive propagation network for few-shot learning. In: ICLR. (2019)

  36. Iscen, A., et al.: Label propagation for deep semi-supervised learning. In: CVPR, pp. 5070–5079. (2019)

  37. Su, J.: https://www.spaces.ac.cn/archives/7359, Blog post, last accessed 8 May 2022

  38. Apte, C., Damerau, F., Weiss, S.M.: Towards language independent automated learning of text categorization models. In: SIGIR, pp. 23–30. (1994)

  39. Yang, P., et al.: SGM: sequence generation model for multi-label classification. In: COLING, pp. 3915–3926. (2018)

  40. Aly, R., Remus, S., Biemann, C.: Hierarchical multi-label classification of text with capsule networks. In: ACL: Student Research Workshop, pp. 323–330. (2019)

Download references

Funding

The funding was provided by Japan Society for the Promotion of Science (JP22J12044).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhewei Xu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, Z., Iwaihara, M. Self-training involving semantic-space finetuning for semi-supervised multi-label document classification. Int J Digit Libr 25, 25–39 (2024). https://doi.org/10.1007/s00799-023-00355-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-023-00355-4

Keywords

Navigation