Non-Outlier Pseudo-Labeling for Short Text Clustering

Zhou, Fangquan; Gui, Shenglin

doi:10.1007/978-3-031-44201-8_9

Fangquan Zhou¹¹ &
Shenglin Gui¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14262))

Included in the following conference series:

International Conference on Artificial Neural Networks

704 Accesses

Abstract

Instance-level correlation and cluster-level discrepancy of data are two crucial aspects of short text clustering. Current deep clustering methods, however, suffer from inaccurate estimation of either instance-level correlation or cluster-level discrepancy of data and strongly relay on the quality of the initial text representation. In this paper, we propose a Non-outlier Pseudo-labeling-based Short Text Clustering (NPLC) method, which consists of two parts. In the first part, we use Mask Language Model (MLM) to pre-train the feature model on a given dataset to enhance the initial text representation. The second part based on non-outlier pseudo-labeling is a joint training in which we first cluster the dataset and select cluster labels of outlier-free data in each cluster as pseudo labels for the next joint training based on a novel framework. The novel framework makes use of a contrastive loss to gain excellent inter-cluster separation by minimizing similarity between outlier-free and outlier data and a clustering loss to narrow intra-cluster distances by maximizing similarity among outlier-free data. Extensive experimental results demonstrate that NPLC achieves significant improvements over existing methods and advances the state-of-the-art results on most benchmark datasets with 1%–12% improvement on Accuracy and 1%–6% improvement on Normalized Mutual Information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Our code is available at https://github.com/zhoufangquan/NPLC.

References

Ahmed, M.H., Tiun, S., Omar, N., Sani, N.S.: Short text clustering algorithms, application and challenges: a survey. Appl. Sci. 13(1), 342 (2023)
Article Google Scholar
Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings. In: International Conference on Learning Representations (2017)
Google Scholar
Cheng, Z., Zou, C., Dong, J.: Outlier detection using isolation forest and local outlier factor. In: Proceedings of the Conference on Research in Adaptive and Convergent Systems, pp. 161–168 (2019)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Google Scholar
Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910. Online and Punta Cana, Dominican Republic, November 2021
Google Scholar
Hadifar, A., Sterckx, L., Demeester, T., Develder, C.: A self-training approach for short text clustering. In: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 194–199. Florence, Italy, August 2019
Google Scholar
He, M., Ma, C., Wang, R.: A data-driven approach for university public opinion analysis and its applications. Appl. Sci. 12(18), 9136 (2022)
Article Google Scholar
Jiang, T., et al.: PromptBERT: improving BERT sentence embeddings with prompts. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 8826–8837. Abu Dhabi, United Arab Emirates, December 2022
Google Scholar
Ma, E.: Nlp augmentation. https://github.com/makcedward/nlpaug (2019)
Mredula, M.S., Dey, N., Rahman, M.S., Mahmud, I., Cho, Y.Z.: A review on the trends in event detection by analyzing social media platforms & data. Sensors 22(12), 4531 (2022)
Article Google Scholar
Orăsan, C.: Automatic summarisation: 25 years on. Natural Lang. Eng. 25(6), 735–751 (2019)
Article Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Rashadul Hasan Rakib, M., Zeh, N., Jankowska, M., Milios, E.: Enhancement of short text clustering by iterative classification. arXiv e-prints pp. arXiv-2001 (2020)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, November 2019
Google Scholar
Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6382–6388. Hong Kong, China, November 2019
Google Scholar
Wu, X., Gao, C., Zang, L., Han, J., Wang, Z., Hu, S.: ESimCSE: enhanced sample building method for contrastive learning of unsupervised sentence embedding. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 3898–3907. Gyeongju, Republic of Korea, October 2022
Google Scholar
Xu, J., Xu, B., Wang, P., Zheng, S., Tian, G., Zhao, J.: Self-taught convolutional neural networks for short text clustering. Neural Netw. 88, 22–31 (2017)
Article Google Scholar
Yin, J., Wang, J.: A model-based approach for text clustering with outlier detection. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 625–636. Los Alamitos, CA, USA, May 2016
Google Scholar
Zhang, D., et al.: Supporting clustering with contrastive learning. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5419–5430, June 2021
Google Scholar
Zhang, X., LeCun, Y.: Text understanding from scratch. arXiv preprint arXiv:1502.01710 (2015)

Download references

Acknowledgements

This work is supported by the Fundamental Research Funds for the Central Universities under Grants XGBDFZ04 and ZYGX2019F005, and Sichuan Provincial Social Science Programs Project under Grants SC22EZD065.

Author information

Authors and Affiliations

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
Fangquan Zhou & Shenglin Gui

Authors

Fangquan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Shenglin Gui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shenglin Gui .

Editor information

Editors and Affiliations

Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
Lancaster University, Lancaster, UK
Plamen Angelov
Teesside University, Middlesbrough, UK
Chrisina Jayne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, F., Gui, S. (2023). Non-Outlier Pseudo-Labeling for Short Text Clustering. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14262. Springer, Cham. https://doi.org/10.1007/978-3-031-44201-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-44201-8_9
Published: 23 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44200-1
Online ISBN: 978-3-031-44201-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Non-Outlier Pseudo-Labeling for Short Text Clustering