Abstract
Instance-level correlation and cluster-level discrepancy of data are two crucial aspects of short text clustering. Current deep clustering methods, however, suffer from inaccurate estimation of either instance-level correlation or cluster-level discrepancy of data and strongly relay on the quality of the initial text representation. In this paper, we propose a Non-outlier Pseudo-labeling-based Short Text Clustering (NPLC) method, which consists of two parts. In the first part, we use Mask Language Model (MLM) to pre-train the feature model on a given dataset to enhance the initial text representation. The second part based on non-outlier pseudo-labeling is a joint training in which we first cluster the dataset and select cluster labels of outlier-free data in each cluster as pseudo labels for the next joint training based on a novel framework. The novel framework makes use of a contrastive loss to gain excellent inter-cluster separation by minimizing similarity between outlier-free and outlier data and a clustering loss to narrow intra-cluster distances by maximizing similarity among outlier-free data. Extensive experimental results demonstrate that NPLC achieves significant improvements over existing methods and advances the state-of-the-art results on most benchmark datasets with 1%–12% improvement on Accuracy and 1%–6% improvement on Normalized Mutual Information.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Our code is available at https://github.com/zhoufangquan/NPLC.
References
Ahmed, M.H., Tiun, S., Omar, N., Sani, N.S.: Short text clustering algorithms, application and challenges: a survey. Appl. Sci. 13(1), 342 (2023)
Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings. In: International Conference on Learning Representations (2017)
Cheng, Z., Zou, C., Dong, J.: Outlier detection using isolation forest and local outlier factor. In: Proceedings of the Conference on Research in Adaptive and Convergent Systems, pp. 161–168 (2019)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910. Online and Punta Cana, Dominican Republic, November 2021
Hadifar, A., Sterckx, L., Demeester, T., Develder, C.: A self-training approach for short text clustering. In: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 194–199. Florence, Italy, August 2019
He, M., Ma, C., Wang, R.: A data-driven approach for university public opinion analysis and its applications. Appl. Sci. 12(18), 9136 (2022)
Jiang, T., et al.: PromptBERT: improving BERT sentence embeddings with prompts. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 8826–8837. Abu Dhabi, United Arab Emirates, December 2022
Ma, E.: Nlp augmentation. https://github.com/makcedward/nlpaug (2019)
Mredula, M.S., Dey, N., Rahman, M.S., Mahmud, I., Cho, Y.Z.: A review on the trends in event detection by analyzing social media platforms & data. Sensors 22(12), 4531 (2022)
Orăsan, C.: Automatic summarisation: 25 years on. Natural Lang. Eng. 25(6), 735–751 (2019)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Rashadul Hasan Rakib, M., Zeh, N., Jankowska, M., Milios, E.: Enhancement of short text clustering by iterative classification. arXiv e-prints pp. arXiv-2001 (2020)
Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, November 2019
Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6382–6388. Hong Kong, China, November 2019
Wu, X., Gao, C., Zang, L., Han, J., Wang, Z., Hu, S.: ESimCSE: enhanced sample building method for contrastive learning of unsupervised sentence embedding. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 3898–3907. Gyeongju, Republic of Korea, October 2022
Xu, J., Xu, B., Wang, P., Zheng, S., Tian, G., Zhao, J.: Self-taught convolutional neural networks for short text clustering. Neural Netw. 88, 22–31 (2017)
Yin, J., Wang, J.: A model-based approach for text clustering with outlier detection. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 625–636. Los Alamitos, CA, USA, May 2016
Zhang, D., et al.: Supporting clustering with contrastive learning. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5419–5430, June 2021
Zhang, X., LeCun, Y.: Text understanding from scratch. arXiv preprint arXiv:1502.01710 (2015)
Acknowledgements
This work is supported by the Fundamental Research Funds for the Central Universities under Grants XGBDFZ04 and ZYGX2019F005, and Sichuan Provincial Social Science Programs Project under Grants SC22EZD065.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhou, F., Gui, S. (2023). Non-Outlier Pseudo-Labeling for Short Text Clustering. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14262. Springer, Cham. https://doi.org/10.1007/978-3-031-44201-8_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-44201-8_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44200-1
Online ISBN: 978-3-031-44201-8
eBook Packages: Computer ScienceComputer Science (R0)