Abstract
Text categorization (TC) is one of the most useful automatic tools in today’s world to organize huge text data automatically. It is widely used by practitioners to classify texts automatically for different purposes, including sentiment analysis, authorship detection, spam detection, and so on. However, studying TC task for different fields can be challenging since it is required to train a separate model on a labeled and large data set specific to that field. This is very time-consuming, and creating a domain-specific large and labeled data is often very hard. In order to overcome this problem, language models are recently employed to transfer learned information from a large data to another downstream task. Bidirectional Encoder Representations from Transformer (BERT) is one of the most popular language models and has been shown to provide very good results for TC tasks. Hence, in this study, we use four pretrained BERT models trained on formal text data as well as our own BERT models trained on Facebook messages. We then fine-tuned BERT models on different downstream data sets collected from different domains such as Twitter, Instagram, and so on. We aim to investigate whether fine-tuned BERT models can provide satisfying results on different downstream tasks of different domains via transfer learning. The results of our extensive experiments show that BERT models provide very satisfying results and selecting both the BERT model and downstream tasks’ data from the same or similar domain is akin to improve the performance in a further direction. This shows that a well-trained language model can remove the need for a separate training process for each different downstream TC task within the OSN domain.
Similar content being viewed by others
Notes
Note that two of these models are also employed in [34] only for two downstream tasks.
References
Baykara, B.; Güngör, T.: Turkish abstractive text summarization using pretrained sequence-to-sequence models. Nat. Lang. Eng. 1–30 (2022)
Birim, A.; Erden, M.; Arslan, L.M.: Zero-shot Turkish text classification. In: IEEE, pp. 1–4 (2021)
Guven, Z.A.: The comparison of language models with a novel text filtering approach for turkish sentiment analysis. Trans. Asian Low Resour. Lang. Inf. Process. (2022)
Kavi, D.: Turkish text classification: from Lexicon analysis to bidirectional transformer. arXiv preprint arXiv:2104.11642 (2020)
Çelıkten, A.; Bulut, H.: Turkish Medical text classification using BERT. In: IEEE, pp. 1–4 (2021)
Köksal, Ö.; Yılmaz, E.H.: Improving automated Turkish text classification with learning-based algorithms. Concurr. Comput. Pract. Exp. 34(11), e6874 (2022)
Akça, O.; Bayrak, G.; Issifu, A.M.; Ganiz, M.: Traditional machine learning and deep learning-based text classification for Turkish law documents using transformers and domain adaptation. In: IEEE, pp. 1–6 (2022)
Siğirci, İO.; Özgür, H.; Oluk, A., et al.: Sentiment analysis of Turkish reviews on google play store. In: IEEE, pp. 314–315 (2020)
Toraman, C.; Yilmaz, E.H.; Šahinuč, F.; Ozcelik, O.: Impact of tokenization on language models: an analysis for Turkish. ACM Trans. Asian Low Resour. Lang. Inf. Process. 22(4), 1–21 (2023)
Çavuşoğlu, I.; Pielka, M.; Sifa, R.: Adapting established text representations for predicting review sentiment in Turkish. In: IEEE, pp. 755–756 (2020)
Banan, A.; Nasiri, A.; Taheri-Garavand, A.: Deep learning-based appearance features extraction for automated carp species identification. Aquac. Eng. 89, 102053 (2020)
Fan, Y.; Xu, K.; Wu, H.; Zheng, Y.; Tao, B.: Spatiotemporal modeling for nonlinear distributed thermal processes based on KL decomposition, MLP and LSTM network. IEEE Access 8, 25111–25121 (2020)
Wang, W.C.; Du, Y.J.; Chau, K.W.; Xu, D.M.; Liu, C.J.; Ma, Q.: An ensemble hybrid forecasting model for annual runoff based on sample entropy, secondary decomposition, and long short-term memory neural network. Water Resour. Manag. 35, 4695–4726 (2021)
Chen, C.; Zhang, Q.; Kashani, M.H.; Jun, C.; Bateni, S.M.; Band, S.S.; Chau, K.W.: Forecast of rainfall distribution based on fixed sliding window long short-term memory. Eng. Appl. Comput. Fluid Mech. 16(1), 248–261 (2022)
Afan, H.A.; Osman, Ibrahem Ahmed A.; Essam, Y.; Ahmed, A.N.; Huang, Y.F.; Kisi, O., et al.: Modeling the fluctuations of groundwater level by employing ensemble deep learning techniques. Eng. Appl. Comput. Fluid Mech. 15(1), 1420–1439 (2021)
Chen, W.; Sharifrazi, D.; Liang, G.; Band, S.S.; Chau, K.W.; Mosavi, A.: Accurate discharge coefficient prediction of streamlined weirs by coupling linear regression and deep convolutional gated recurrent unit. Eng. Appl. Comput. Fluid Mech. 16(1), 965–976 (2022)
Rukhsar, L.; Bangyal, W.H.; Ali Khan, M.S.; Ag Ibrahim, A.A.; Nisar, K.; Rawat, D.B.: Analyzing RNA-seq gene expression data using deep learning approaches for cancer classification. Appl. Sci. 12(4), 1850 (2022)
Qasim, R.; Bangyal, W.H.; Alqarni, M.A.; Ali Almazroi, A.: A fine-tuned BERT-based transfer learning approach for text classification. J. Healthc. Eng. (2022)
Shah, M.A.; Iqbal, M.J.; Noreen, N.; Ahmed, I.: An automated text document classification framework using BERT. Int. J. Adv. Comput. Sci. Appl. 14(3)
Jayaraman, A.K.; Murugappan, A.; Trueman, T.E.; Ananthakrishnan, G.; Ghosh, A.: Imbalanced aspect categorization using bidirectional encoder representation from transformers. Proc. Comput. Sci. 218, 757–765 (2023)
ElKafrawy, P.; Mahgoub, A.; Atef, H.; Nasser, A.; Yasser, M.; Medhat, W.M.; Darweesh, M.S.: Sentiment analysis: Amazon electronics reviews using BERT and Textblob (2023)
Alruily, M.; Manaf Fazal, A.; Mostafa, A.M.; Ezz, M.: Automated Arabic long-tweet classification using transfer learning with BERT. Appl. Sci. 13(6), 3482 (2023)
Kaur, K.; Kaur, P.: An automatic identification and classification of requirements from App reviews: a transfer learning approach. Available at SSRN 4384162
Patel, A.; Oza, P.; Agrawal, S.: Sentiment analysis of customer feedback and reviews for airline services using language representation model. Proc. Comput. Sci. 218, 2459–2467 (2023)
Mala, J.B.; SJ, A.A.; SM, A.R.; Rajan, R.: Efficacy of ELECTRA-based language model in sentiment analysis. In: 2023 International Conference on Intelligent Systems for Communication, IoT and Security (ICISCoIS), pp. 682–687. IEEE (2023)
Türkmen, H.; Dikenelli, O.; Eraslan, C.; Čiall, M.C.; Özbek, S.S.: Harnessing the power of BERT in the Turkish clinical domain: pretraining approaches for limited data scenarios. arXiv preprint arXiv:2305.03788 (2023)
Kohavi, R.; others.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: 14. Montreal, Canada, pp. 1137–1145 (1995)
Okur, H.I.; Sertbaş, A.: Pretrained neural models for Turkish text classification. In: IEEE, pp. 174–179 (2021)
Şahin, G.; Diri, B.: The effect of transfer learning on Turkish text classification. In: IEEE, pp. 1–4 (2021)
Karayiğit, H.; Akdagli, A.; Aci, Ç.İ: Homophobic and hate speech detection using multilingual-BERT model on Turkish social media. Inf. Technol. Control 51(2), 356–375 (2022)
Özdil, U.; Arslan, B.; Taşar, DE.; Polat, G.; Ozan, Ş.: Ad text classification with bidirectional encoder representations. In: IEEE, pp. 169–173 (2021)
Sonmezoz, K.; Amasyali, M.F.: Same sentence prediction: a new pre-training task for BERT. In: IEEE, pp. 1–6 (2021)
Aytan, B.; Sakar, C.O.: Comparison of transformer-based models trained in Turkish and different languages on Turkish natural language processing problems. In: IEEE, pp. 1–4 (2022)
Coban, O.; Ozel, S.A.; Inan, A.: Detection and cross-domain evaluation of cyberbullying in Facebook activity contents for Turkish. ACM Trans. Asian Low Resour. Lang. Inf. Process. (2023)
Dündar, E.B.; Kiliç, O.F.; Çekiç, T.; Manav, Y.; Deniz, O.: Large scale intent detection in Turkish short sentences with contextual word embeddings, pp. 187–192 (2020)
Toraman, C.; Yilmaz, E.H.; Şahinuç, F.; Ozcelik, O.: Impact of tokenization on language models: an analysis for Turkish. arXiv preprint arXiv:2204.08832 (2022)
Türkmen, H.; Dikenelli, O.; Eraslan ,C.; Çalli, M.C.; Ozbek, S.S.: Developing pretrained language models for Turkish biomedical domain. In: IEEE, pp. 597–598 (2022)
Toraman, C.; Şahinuç, F.; Yılmaz, E.H.: Large-scale hate speech detection with cross-domain transfer. arXiv preprint arXiv:2203.01111 (2022)
Turkmen, H.; Dikenelli, O.; Eraslan, C.; Callı, M.C.: Bioberturk: exploring Turkish biomedical language model development strategies in low resource setting (2022)
Mutlu, M.M.; Özgür, A.: A dataset and BERT-based models for targeted sentiment analysis on Turkish texts. arXiv preprint arXiv:2205.04185 (2022)
Toraman, Ç.: Event-related microblog retrieval in Turkish. Turk. J. Electr. Eng. Comput. Sci. 30(3), 1067–1083 (2022)
Karayiğit, H.; Akdagli, A.; Acı, Ç.İ: BERT-based transfer learning model for COVID-19 sentiment analysis on Turkish Instagram comments. Inf. Technol. Control 51(3), 409–428 (2022)
Polat, H.; Korpe, M.: Estimation of demographic traits of the deputies through parliamentary debates using machine learning. Electronics 11(15), 2374 (2022)
Özberk, A.; Çiçekli, İ.: Offensive language detection in Turkish tweets with BERT models. In: 2021 6th International Conference on Computer Science and Engineering (UBMK), pp. 517–521. IEEE (2021)
Tokgoz, M.; Turhan, F.; Bolucu, N.; Can, B.: Tuning language representation models for classification of Turkish news, pp. 402–407 (2021)
Coban, O.; Ali, I.; Ozel, S.A.: Towards the design and implementation of an OSN crawler: a case of Turkish Facebook users. Int. J. Inf. Secur. Sci. 9(2), 76–93 (2020)
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Özel, S.A.; Saraç, E.; Akdemir, S.; Aksu, H.: Detection of cyberbullying on social media messages in Turkish. In: IEEE, pp. 366–370 (2017)
Karayiğit, H.; Acı, Ç.İ; Akdağlı, A.: Detecting abusive Instagram comments in Turkish using convolutional neural network and machine learning methods. Expert Syst. Appl. 174, 114802 (2021)
Çoban, Ö.; Özel, S.A.; İnan, A.: Deep learning-based sentiment analysis of Facebook data: the case of Turkish users. Comput. J. 64(3), 473–499 (2021)
Çoban, Ö.; İnan, A.; Özel, S.A.: Facebook tells me your gender: an exploratory study of gender prediction for Turkish Facebook users. Trans. Asian Low Resour. Lang. Inf. Process. 20(4), 1–38 (2021)
Hayran, A.; Sert, M.: Sentiment analysis on microblog data based on word embedding and fusion techniques. In: IEEE, pp. 1–4 (2017)
Loper, E.; Bird, S.: Nltk: The natural language toolkit. arXiv preprint arXiv:cs/0205028 (2002)
Torunoğlu, D.; Çakirman, E.; Ganiz, M.C.; Akyokuş, S.; Gürbüz, M.Z.: Analysis of preprocessing methods on classification of Turkish texts. In: IEEE, pp. 112–117 (2011)
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Shim, H.; Luca, S.; Lowet, D.; Vanrumste, B.: Data augmentation and semi-supervised learning for deep neural networks-based text classifier. In: Proceedings of the 35th Annual ACM Symposium on Applied Computing, pp. 1119–1126 (2020)
Schweter, S.: Berturk-BERT models for Turkish. Online https://doi.org/10.5281/zenodo 3770924 (2020)
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Gao, Z.; Feng, A.; Song, X.; Wu, X.: Target-dependent sentiment classification with BERT. IEEE Access 7, 154290–154299 (2019)
Chollet, F.; others.: Keras, GitHub repository (2015). https://keras.io/examples/generative/vae (2020)
Coban, O.: A new modification and application of item response theory-based feature selection for different machine learning tasks. Concurr. Comput. Pract. Exp. 34(26), e7282 (2022)
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
The summary of our MLM-based BERT model (Fig. 3).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Coban, O., Yağanoğlu, M. & Bozkurt, F. Domain Effect Investigation for Bert Models Fine-Tuned on Different Text Categorization Tasks. Arab J Sci Eng 49, 3685–3702 (2024). https://doi.org/10.1007/s13369-023-08142-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-023-08142-8