DACL: A Domain-Adapted Contrastive Learning Approach to Low Resource Language Representations for Document Clustering Tasks

Zaikis, Dimitrios; Kokkas, Stylianos; Vlahavas, Ioannis

doi:10.1007/978-3-031-34204-2_47

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1826))

Included in the following conference series:

International Conference on Engineering Applications of Neural Networks

664 Accesses
2 Citations

Abstract

Clustering in Natural Language Processing (NLP) groups similar text phrases or documents together based on their semantic meaning or context into meaningful groups that can be useful in several information extraction tasks, such as topic modeling, document retrieval and text summarization. However, clustering documents in low-resource languages poses unique challenges due to limited linguistic resources and lack of carefully curated data. These challenges extend to the language modeling domain, where training Transformer-based Language Models (LM) requires large amounts of data in order to generate meaningful representations. To this end, we created two new corpora from Greek media sources and present a Transformer-based contrastive learning approach for document clustering tasks. We improve low-resource LMs using in-domain second phase pre-training (domain-adaption) and learn document representations by contrasting positive examples (i.e., similar documents) and negative examples (i.e., dissimilar documents). By maximizing the similarity between positive examples and minimizing the similarity between negative examples, our proposed approach learns meaningful representations that capture the underlying structure of the documents. Additionally, we demonstrate how combining language models that are optimized for different sequence lengths improve the performance and compare this approach against an unsupervised graph-based summarization method that generates concise and informative summaries for longer documents. By learning effective document representations, our proposed approach can significantly improve the accuracy of clustering tasks such as topic extraction, leading to an improved performance in downstream tasks.

This research was carried out as part of the project KMP6-0096055 under the framework of the Action “Investment Plans of Innovation” of the Operational Program “Central Macedonia 2014–2020”, that is co-funded by the European Regional Development Fund and Greece.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
Bollegala, D., Maehara, T., Kawarabayashi, K.i.: Unsupervised cross-domain word representation learning 1 (05 2015). https://doi.org/10.3115/v1/P15-1071
Chen, D., Fisch, A., Weston, J., Bordes, A.: Reading Wikipedia to answer open-domain questions. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1870–1879. Association for Computational Linguistics, Vancouver, Canada (Jul 2017). https://doi.org/10.18653/v1/P17-1171, https://aclanthology.org/P17-1171
Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. CoRR abs/1904.10509 (2019), arxiv.org/abs/1904.10509
Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does BERT look at? an analysis of BERT’s attention. In: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 276–286. Association for Computational Linguistics, Florence, Italy (Aug 2019). https://doi.org/10.18653/v1/W19-4828, https://aclanthology.org/W19-4828
Cohan, A., Feldman, S., Beltagy, I., Downey, D., Weld, D.S.: Specter: Document-level representation learning using citation-informed transformers (2020). https://doi.org/10.48550/ARXIV.2004.07180, https://arxiv.org/abs/2004.07180
Cohen, J.: A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Measur. 20(1), 37–46 (1960). https://doi.org/10.1177/001316446002000104
Article Google Scholar
Curiskis, S.A., Drake, B., Osborn, T.R., Kennedy, P.J.: An evaluation of document clustering and topic modelling in two online social networks: Twitter and reddit. Inform. Process. Manage. 57(2), 102034 (2020). https://doi.org/10.1016/j.ipm.2019.04.002
Article Google Scholar
Erkan, G., Radev, D.R.: Lexrank: graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22(1), 457–479 (2004). https://arxiv.org/abs/1109.2128
Faralli, S., Navigli, R.: A new minimally-supervised framework for domain word sense disambiguation. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1411–1422. Association for Computational Linguistics, Jeju Island, Korea (Jul 2012), https://aclanthology.org/D12-1129
Gururangan, S., et al.: Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks (May 2020), https://arxiv.org/abs/2004.10964, arXiv:2004.10964 [cs]
Joshi, M., Levy, O., Weld, D.S., Zettlemoyer, L.: BERT for coreference resolution: Baselines and analysis. CoRR abs/1908.09091 (2019), https://arxiv.org/abs/1908.09091
Kokate, U., Deshpande, A., Mahalle, P., Patil, P.: Data stream clustering techniques, applications, and models: Comparative analysis and discussion. Big DataCogn. Comput. 2(4) (2018). https://doi.org/10.3390/bdcc2040032, https://www.mdpi.com/2504-2289/2/4/32
Koutsikakis, J., Chalkidis, I., Malakasiotis, P., Androutsopoulos, I.: Greek-bert: The greeks visiting sesame street. In: 11th Hellenic Conference on Artificial Intelligence, pp. 110–117 (2020). https://doi.org/10.1145/3411408.3411440
Lekea, I., Karampelas, P.: Are we really that close together? tracing and discussing similarities and differences between greek terrorist groups using cluster analysis. In: 2017 European Intelligence and Security Informatics Conference (EISIC), pp. 159–162 (2017). https://doi.org/10.1109/EISIC.2017.33
Long, M., Cao, Y., Cao, Z., Wang, J., Jordan, M.I.: Transferable representation learning with deep adaptation networks. IEEE Trans. Pattern Anal. Mach. Intell. 41(12), 3071–3085 (2019). https://doi.org/10.1109/TPAMI.2018.2868685
Article Google Scholar
Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37. pp. 97–105. ICML’15, JMLR.org (2015). https://arxiv.org/abs/1502.02791
M. Salih, N., Jacksi, K.: State of the art document clustering algorithms based on semantic similarity. Jurnal Informatika 14, 58–75 (05 2020). https://doi.org/10.26555/jifo.v14i2.a17513
Neelakantan, A., Shankar, J., Passos, A., McCallum, A.: Efficient non-parametric estimation of multiple embeddings per word in vector space. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1059–1069. Association for Computational Linguistics, Doha, Qatar (Oct 2014). https://doi.org/10.3115/v1/D14-1113, https://aclanthology.org/D14-1113
Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks, pp. 3982–3992 (Nov 2019). https://doi.org/10.18653/v1/D19-1410, https://aclanthology.org/D19-1410
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: A unified embedding for face recognition and clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (jun 2015). https://doi.org/10.1109/cvpr.2015.7298682
Tang, H., Sun, X., Jin, B., Wang, J., Zhang, F., Wu, W.: Improving document representations by generating pseudo query embeddings for dense retrieval, pp. 5054–5064 (Aug 2021). https://doi.org/10.18653/v1/2021.acl-long.392, https://aclanthology.org/2021.acl-long.392
Tsirakis, N., Poulopoulos, V., Tsantilas, P., Varlamis, I.: Large scale opinion mining for social, news and blog data. J. Syst. Softw. 127, 237–248 (2017). https://doi.org/10.1016/j.jss.2016.06.012
Article Google Scholar
Vaswani, A., et al.: Attention is All you Need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Xie, Q., Dai, Z., Hovy, E.H., Luong, M., Le, Q.V.: Unsupervised data augmentation. CoRR abs/1904.12848 (2019). https://arxiv.org/abs/1904.12848
Ye, Z., Guo, Q., Gan, Q., Qiu, X., Zhang, Z.: Bp-transformer: Modelling long-range context via binary partitioning. CoRR abs/1911.04070 (2019). https://arxiv.org/abs/1911.04070

Download references

Author information

Authors and Affiliations

School of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
Dimitrios Zaikis, Stylianos Kokkas & Ioannis Vlahavas

Authors

Dimitrios Zaikis
View author publications
You can also search for this author in PubMed Google Scholar
Stylianos Kokkas
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis Vlahavas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dimitrios Zaikis .

Editor information

Editors and Affiliations

Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
University of Piraeus, Piraeus, Greece
Ilias Maglogiannis
University of Leon, León, Spain
Serafin Alonso
Teesside University, Middlesbrough, UK
Chrisina Jayne
University of the West of England, Bristol, UK
Elias Pimenidis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zaikis, D., Kokkas, S., Vlahavas, I. (2023). DACL: A Domain-Adapted Contrastive Learning Approach to Low Resource Language Representations for Document Clustering Tasks. In: Iliadis, L., Maglogiannis, I., Alonso, S., Jayne, C., Pimenidis, E. (eds) Engineering Applications of Neural Networks. EANN 2023. Communications in Computer and Information Science, vol 1826. Springer, Cham. https://doi.org/10.1007/978-3-031-34204-2_47

Download citation

DOI: https://doi.org/10.1007/978-3-031-34204-2_47
Published: 07 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34203-5
Online ISBN: 978-3-031-34204-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DACL: A Domain-Adapted Contrastive Learning Approach to Low Resource Language Representations for Document Clustering Tasks