Skip to main content

Non-Outlier Pseudo-Labeling for Short Text Clustering

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2023 (ICANN 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14262))

Included in the following conference series:

  • 704 Accesses

Abstract

Instance-level correlation and cluster-level discrepancy of data are two crucial aspects of short text clustering. Current deep clustering methods, however, suffer from inaccurate estimation of either instance-level correlation or cluster-level discrepancy of data and strongly relay on the quality of the initial text representation. In this paper, we propose a Non-outlier Pseudo-labeling-based Short Text Clustering (NPLC) method, which consists of two parts. In the first part, we use Mask Language Model (MLM) to pre-train the feature model on a given dataset to enhance the initial text representation. The second part based on non-outlier pseudo-labeling is a joint training in which we first cluster the dataset and select cluster labels of outlier-free data in each cluster as pseudo labels for the next joint training based on a novel framework. The novel framework makes use of a contrastive loss to gain excellent inter-cluster separation by minimizing similarity between outlier-free and outlier data and a clustering loss to narrow intra-cluster distances by maximizing similarity among outlier-free data. Extensive experimental results demonstrate that NPLC achieves significant improvements over existing methods and advances the state-of-the-art results on most benchmark datasets with 1%–12% improvement on Accuracy and 1%–6% improvement on Normalized Mutual Information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Our code is available at https://github.com/zhoufangquan/NPLC.

References

  1. Ahmed, M.H., Tiun, S., Omar, N., Sani, N.S.: Short text clustering algorithms, application and challenges: a survey. Appl. Sci. 13(1), 342 (2023)

    Article  Google Scholar 

  2. Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings. In: International Conference on Learning Representations (2017)

    Google Scholar 

  3. Cheng, Z., Zou, C., Dong, J.: Outlier detection using isolation forest and local outlier factor. In: Proceedings of the Conference on Research in Adaptive and Convergent Systems, pp. 161–168 (2019)

    Google Scholar 

  4. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)

    Google Scholar 

  5. Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910. Online and Punta Cana, Dominican Republic, November 2021

    Google Scholar 

  6. Hadifar, A., Sterckx, L., Demeester, T., Develder, C.: A self-training approach for short text clustering. In: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 194–199. Florence, Italy, August 2019

    Google Scholar 

  7. He, M., Ma, C., Wang, R.: A data-driven approach for university public opinion analysis and its applications. Appl. Sci. 12(18), 9136 (2022)

    Article  Google Scholar 

  8. Jiang, T., et al.: PromptBERT: improving BERT sentence embeddings with prompts. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 8826–8837. Abu Dhabi, United Arab Emirates, December 2022

    Google Scholar 

  9. Ma, E.: Nlp augmentation. https://github.com/makcedward/nlpaug (2019)

  10. Mredula, M.S., Dey, N., Rahman, M.S., Mahmud, I., Cho, Y.Z.: A review on the trends in event detection by analyzing social media platforms & data. Sensors 22(12), 4531 (2022)

    Article  Google Scholar 

  11. Orăsan, C.: Automatic summarisation: 25 years on. Natural Lang. Eng. 25(6), 735–751 (2019)

    Article  Google Scholar 

  12. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)

    Google Scholar 

  13. Rashadul Hasan Rakib, M., Zeh, N., Jankowska, M., Milios, E.: Enhancement of short text clustering by iterative classification. arXiv e-prints pp. arXiv-2001 (2020)

    Google Scholar 

  14. Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, November 2019

    Google Scholar 

  15. Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6382–6388. Hong Kong, China, November 2019

    Google Scholar 

  16. Wu, X., Gao, C., Zang, L., Han, J., Wang, Z., Hu, S.: ESimCSE: enhanced sample building method for contrastive learning of unsupervised sentence embedding. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 3898–3907. Gyeongju, Republic of Korea, October 2022

    Google Scholar 

  17. Xu, J., Xu, B., Wang, P., Zheng, S., Tian, G., Zhao, J.: Self-taught convolutional neural networks for short text clustering. Neural Netw. 88, 22–31 (2017)

    Article  Google Scholar 

  18. Yin, J., Wang, J.: A model-based approach for text clustering with outlier detection. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 625–636. Los Alamitos, CA, USA, May 2016

    Google Scholar 

  19. Zhang, D., et al.: Supporting clustering with contrastive learning. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5419–5430, June 2021

    Google Scholar 

  20. Zhang, X., LeCun, Y.: Text understanding from scratch. arXiv preprint arXiv:1502.01710 (2015)

Download references

Acknowledgements

This work is supported by the Fundamental Research Funds for the Central Universities under Grants XGBDFZ04 and ZYGX2019F005, and Sichuan Provincial Social Science Programs Project under Grants SC22EZD065.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shenglin Gui .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhou, F., Gui, S. (2023). Non-Outlier Pseudo-Labeling for Short Text Clustering. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14262. Springer, Cham. https://doi.org/10.1007/978-3-031-44201-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44201-8_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44200-1

  • Online ISBN: 978-3-031-44201-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics