Abstract
Clustering in Natural Language Processing (NLP) groups similar text phrases or documents together based on their semantic meaning or context into meaningful groups that can be useful in several information extraction tasks, such as topic modeling, document retrieval and text summarization. However, clustering documents in low-resource languages poses unique challenges due to limited linguistic resources and lack of carefully curated data. These challenges extend to the language modeling domain, where training Transformer-based Language Models (LM) requires large amounts of data in order to generate meaningful representations. To this end, we created two new corpora from Greek media sources and present a Transformer-based contrastive learning approach for document clustering tasks. We improve low-resource LMs using in-domain second phase pre-training (domain-adaption) and learn document representations by contrasting positive examples (i.e., similar documents) and negative examples (i.e., dissimilar documents). By maximizing the similarity between positive examples and minimizing the similarity between negative examples, our proposed approach learns meaningful representations that capture the underlying structure of the documents. Additionally, we demonstrate how combining language models that are optimized for different sequence lengths improve the performance and compare this approach against an unsupervised graph-based summarization method that generates concise and informative summaries for longer documents. By learning effective document representations, our proposed approach can significantly improve the accuracy of clustering tasks such as topic extraction, leading to an improved performance in downstream tasks.
This research was carried out as part of the project KMP6-0096055 under the framework of the Action “Investment Plans of Innovation” of the Operational Program “Central Macedonia 2014–2020”, that is co-funded by the European Regional Development Fund and Greece.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
Bollegala, D., Maehara, T., Kawarabayashi, K.i.: Unsupervised cross-domain word representation learning 1 (05 2015). https://doi.org/10.3115/v1/P15-1071
Chen, D., Fisch, A., Weston, J., Bordes, A.: Reading Wikipedia to answer open-domain questions. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1870–1879. Association for Computational Linguistics, Vancouver, Canada (Jul 2017). https://doi.org/10.18653/v1/P17-1171, https://aclanthology.org/P17-1171
Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. CoRR abs/1904.10509 (2019), arxiv.org/abs/1904.10509
Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does BERT look at? an analysis of BERT’s attention. In: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 276–286. Association for Computational Linguistics, Florence, Italy (Aug 2019). https://doi.org/10.18653/v1/W19-4828, https://aclanthology.org/W19-4828
Cohan, A., Feldman, S., Beltagy, I., Downey, D., Weld, D.S.: Specter: Document-level representation learning using citation-informed transformers (2020). https://doi.org/10.48550/ARXIV.2004.07180, https://arxiv.org/abs/2004.07180
Cohen, J.: A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Measur. 20(1), 37–46 (1960). https://doi.org/10.1177/001316446002000104
Curiskis, S.A., Drake, B., Osborn, T.R., Kennedy, P.J.: An evaluation of document clustering and topic modelling in two online social networks: Twitter and reddit. Inform. Process. Manage. 57(2), 102034 (2020). https://doi.org/10.1016/j.ipm.2019.04.002
Erkan, G., Radev, D.R.: Lexrank: graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22(1), 457–479 (2004). https://arxiv.org/abs/1109.2128
Faralli, S., Navigli, R.: A new minimally-supervised framework for domain word sense disambiguation. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1411–1422. Association for Computational Linguistics, Jeju Island, Korea (Jul 2012), https://aclanthology.org/D12-1129
Gururangan, S., et al.: Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks (May 2020), https://arxiv.org/abs/2004.10964, arXiv:2004.10964 [cs]
Joshi, M., Levy, O., Weld, D.S., Zettlemoyer, L.: BERT for coreference resolution: Baselines and analysis. CoRR abs/1908.09091 (2019), https://arxiv.org/abs/1908.09091
Kokate, U., Deshpande, A., Mahalle, P., Patil, P.: Data stream clustering techniques, applications, and models: Comparative analysis and discussion. Big DataCogn. Comput. 2(4) (2018). https://doi.org/10.3390/bdcc2040032, https://www.mdpi.com/2504-2289/2/4/32
Koutsikakis, J., Chalkidis, I., Malakasiotis, P., Androutsopoulos, I.: Greek-bert: The greeks visiting sesame street. In: 11th Hellenic Conference on Artificial Intelligence, pp. 110–117 (2020). https://doi.org/10.1145/3411408.3411440
Lekea, I., Karampelas, P.: Are we really that close together? tracing and discussing similarities and differences between greek terrorist groups using cluster analysis. In: 2017 European Intelligence and Security Informatics Conference (EISIC), pp. 159–162 (2017). https://doi.org/10.1109/EISIC.2017.33
Long, M., Cao, Y., Cao, Z., Wang, J., Jordan, M.I.: Transferable representation learning with deep adaptation networks. IEEE Trans. Pattern Anal. Mach. Intell. 41(12), 3071–3085 (2019). https://doi.org/10.1109/TPAMI.2018.2868685
Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37. pp. 97–105. ICML’15, JMLR.org (2015). https://arxiv.org/abs/1502.02791
M. Salih, N., Jacksi, K.: State of the art document clustering algorithms based on semantic similarity. Jurnal Informatika 14, 58–75 (05 2020). https://doi.org/10.26555/jifo.v14i2.a17513
Neelakantan, A., Shankar, J., Passos, A., McCallum, A.: Efficient non-parametric estimation of multiple embeddings per word in vector space. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1059–1069. Association for Computational Linguistics, Doha, Qatar (Oct 2014). https://doi.org/10.3115/v1/D14-1113, https://aclanthology.org/D14-1113
Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks, pp. 3982–3992 (Nov 2019). https://doi.org/10.18653/v1/D19-1410, https://aclanthology.org/D19-1410
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: A unified embedding for face recognition and clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (jun 2015). https://doi.org/10.1109/cvpr.2015.7298682
Tang, H., Sun, X., Jin, B., Wang, J., Zhang, F., Wu, W.: Improving document representations by generating pseudo query embeddings for dense retrieval, pp. 5054–5064 (Aug 2021). https://doi.org/10.18653/v1/2021.acl-long.392, https://aclanthology.org/2021.acl-long.392
Tsirakis, N., Poulopoulos, V., Tsantilas, P., Varlamis, I.: Large scale opinion mining for social, news and blog data. J. Syst. Softw. 127, 237–248 (2017). https://doi.org/10.1016/j.jss.2016.06.012
Vaswani, A., et al.: Attention is All you Need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Xie, Q., Dai, Z., Hovy, E.H., Luong, M., Le, Q.V.: Unsupervised data augmentation. CoRR abs/1904.12848 (2019). https://arxiv.org/abs/1904.12848
Ye, Z., Guo, Q., Gan, Q., Qiu, X., Zhang, Z.: Bp-transformer: Modelling long-range context via binary partitioning. CoRR abs/1911.04070 (2019). https://arxiv.org/abs/1911.04070
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zaikis, D., Kokkas, S., Vlahavas, I. (2023). DACL: A Domain-Adapted Contrastive Learning Approach to Low Resource Language Representations for Document Clustering Tasks. In: Iliadis, L., Maglogiannis, I., Alonso, S., Jayne, C., Pimenidis, E. (eds) Engineering Applications of Neural Networks. EANN 2023. Communications in Computer and Information Science, vol 1826. Springer, Cham. https://doi.org/10.1007/978-3-031-34204-2_47
Download citation
DOI: https://doi.org/10.1007/978-3-031-34204-2_47
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34203-5
Online ISBN: 978-3-031-34204-2
eBook Packages: Computer ScienceComputer Science (R0)