Skip to main content
Log in

Domain Effect Investigation for Bert Models Fine-Tuned on Different Text Categorization Tasks

  • Research Article--Computer Engineering and Computer Science
  • Published:
Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Abstract

Text categorization (TC) is one of the most useful automatic tools in today’s world to organize huge text data automatically. It is widely used by practitioners to classify texts automatically for different purposes, including sentiment analysis, authorship detection, spam detection, and so on. However, studying TC task for different fields can be challenging since it is required to train a separate model on a labeled and large data set specific to that field. This is very time-consuming, and creating a domain-specific large and labeled data is often very hard. In order to overcome this problem, language models are recently employed to transfer learned information from a large data to another downstream task. Bidirectional Encoder Representations from Transformer (BERT) is one of the most popular language models and has been shown to provide very good results for TC tasks. Hence, in this study, we use four pretrained BERT models trained on formal text data as well as our own BERT models trained on Facebook messages. We then fine-tuned BERT models on different downstream data sets collected from different domains such as Twitter, Instagram, and so on. We aim to investigate whether fine-tuned BERT models can provide satisfying results on different downstream tasks of different domains via transfer learning. The results of our extensive experiments show that BERT models provide very satisfying results and selecting both the BERT model and downstream tasks’ data from the same or similar domain is akin to improve the performance in a further direction. This shows that a well-trained language model can remove the need for a separate training process for each different downstream TC task within the OSN domain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://huggingface.co.

  2. https://huggingface.co/models?search=turkish.

  3. https://simpletransformers.ai.

  4. https://keras.io/examples/nlp/masked_language_modeling/.

  5. Note that two of these models are also employed in [34] only for two downstream tasks.

References

  1. Baykara, B.; Güngör, T.: Turkish abstractive text summarization using pretrained sequence-to-sequence models. Nat. Lang. Eng. 1–30 (2022)

  2. Birim, A.; Erden, M.; Arslan, L.M.: Zero-shot Turkish text classification. In: IEEE, pp. 1–4 (2021)

  3. Guven, Z.A.: The comparison of language models with a novel text filtering approach for turkish sentiment analysis. Trans. Asian Low Resour. Lang. Inf. Process. (2022)

  4. Kavi, D.: Turkish text classification: from Lexicon analysis to bidirectional transformer. arXiv preprint arXiv:2104.11642 (2020)

  5. Çelıkten, A.; Bulut, H.: Turkish Medical text classification using BERT. In: IEEE, pp. 1–4 (2021)

  6. Köksal, Ö.; Yılmaz, E.H.: Improving automated Turkish text classification with learning-based algorithms. Concurr. Comput. Pract. Exp. 34(11), e6874 (2022)

    Article  Google Scholar 

  7. Akça, O.; Bayrak, G.; Issifu, A.M.; Ganiz, M.: Traditional machine learning and deep learning-based text classification for Turkish law documents using transformers and domain adaptation. In: IEEE, pp. 1–6 (2022)

  8. Siğirci, İO.; Özgür, H.; Oluk, A., et al.: Sentiment analysis of Turkish reviews on google play store. In: IEEE, pp. 314–315 (2020)

  9. Toraman, C.; Yilmaz, E.H.; Šahinuč, F.; Ozcelik, O.: Impact of tokenization on language models: an analysis for Turkish. ACM Trans. Asian Low Resour. Lang. Inf. Process. 22(4), 1–21 (2023)

    Article  Google Scholar 

  10. Çavuşoğlu, I.; Pielka, M.; Sifa, R.: Adapting established text representations for predicting review sentiment in Turkish. In: IEEE, pp. 755–756 (2020)

  11. Banan, A.; Nasiri, A.; Taheri-Garavand, A.: Deep learning-based appearance features extraction for automated carp species identification. Aquac. Eng. 89, 102053 (2020)

    Article  Google Scholar 

  12. Fan, Y.; Xu, K.; Wu, H.; Zheng, Y.; Tao, B.: Spatiotemporal modeling for nonlinear distributed thermal processes based on KL decomposition, MLP and LSTM network. IEEE Access 8, 25111–25121 (2020)

    Article  Google Scholar 

  13. Wang, W.C.; Du, Y.J.; Chau, K.W.; Xu, D.M.; Liu, C.J.; Ma, Q.: An ensemble hybrid forecasting model for annual runoff based on sample entropy, secondary decomposition, and long short-term memory neural network. Water Resour. Manag. 35, 4695–4726 (2021)

    Article  Google Scholar 

  14. Chen, C.; Zhang, Q.; Kashani, M.H.; Jun, C.; Bateni, S.M.; Band, S.S.; Chau, K.W.: Forecast of rainfall distribution based on fixed sliding window long short-term memory. Eng. Appl. Comput. Fluid Mech. 16(1), 248–261 (2022)

    Google Scholar 

  15. Afan, H.A.; Osman, Ibrahem Ahmed A.; Essam, Y.; Ahmed, A.N.; Huang, Y.F.; Kisi, O., et al.: Modeling the fluctuations of groundwater level by employing ensemble deep learning techniques. Eng. Appl. Comput. Fluid Mech. 15(1), 1420–1439 (2021)

  16. Chen, W.; Sharifrazi, D.; Liang, G.; Band, S.S.; Chau, K.W.; Mosavi, A.: Accurate discharge coefficient prediction of streamlined weirs by coupling linear regression and deep convolutional gated recurrent unit. Eng. Appl. Comput. Fluid Mech. 16(1), 965–976 (2022)

    Google Scholar 

  17. Rukhsar, L.; Bangyal, W.H.; Ali Khan, M.S.; Ag Ibrahim, A.A.; Nisar, K.; Rawat, D.B.: Analyzing RNA-seq gene expression data using deep learning approaches for cancer classification. Appl. Sci. 12(4), 1850 (2022)

    Article  CAS  Google Scholar 

  18. Qasim, R.; Bangyal, W.H.; Alqarni, M.A.; Ali Almazroi, A.: A fine-tuned BERT-based transfer learning approach for text classification. J. Healthc. Eng. (2022)

  19. Shah, M.A.; Iqbal, M.J.; Noreen, N.; Ahmed, I.: An automated text document classification framework using BERT. Int. J. Adv. Comput. Sci. Appl. 14(3)

  20. Jayaraman, A.K.; Murugappan, A.; Trueman, T.E.; Ananthakrishnan, G.; Ghosh, A.: Imbalanced aspect categorization using bidirectional encoder representation from transformers. Proc. Comput. Sci. 218, 757–765 (2023)

    Article  Google Scholar 

  21. ElKafrawy, P.; Mahgoub, A.; Atef, H.; Nasser, A.; Yasser, M.; Medhat, W.M.; Darweesh, M.S.: Sentiment analysis: Amazon electronics reviews using BERT and Textblob (2023)

  22. Alruily, M.; Manaf Fazal, A.; Mostafa, A.M.; Ezz, M.: Automated Arabic long-tweet classification using transfer learning with BERT. Appl. Sci. 13(6), 3482 (2023)

    Article  CAS  Google Scholar 

  23. Kaur, K.; Kaur, P.: An automatic identification and classification of requirements from App reviews: a transfer learning approach. Available at SSRN 4384162

  24. Patel, A.; Oza, P.; Agrawal, S.: Sentiment analysis of customer feedback and reviews for airline services using language representation model. Proc. Comput. Sci. 218, 2459–2467 (2023)

    Article  Google Scholar 

  25. Mala, J.B.; SJ, A.A.; SM, A.R.; Rajan, R.: Efficacy of ELECTRA-based language model in sentiment analysis. In: 2023 International Conference on Intelligent Systems for Communication, IoT and Security (ICISCoIS), pp. 682–687. IEEE (2023)

  26. Türkmen, H.; Dikenelli, O.; Eraslan, C.; Čiall, M.C.; Özbek, S.S.: Harnessing the power of BERT in the Turkish clinical domain: pretraining approaches for limited data scenarios. arXiv preprint arXiv:2305.03788 (2023)

  27. Kohavi, R.; others.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: 14. Montreal, Canada, pp. 1137–1145 (1995)

  28. Okur, H.I.; Sertbaş, A.: Pretrained neural models for Turkish text classification. In: IEEE, pp. 174–179 (2021)

  29. Şahin, G.; Diri, B.: The effect of transfer learning on Turkish text classification. In: IEEE, pp. 1–4 (2021)

  30. Karayiğit, H.; Akdagli, A.; Aci, Ç.İ: Homophobic and hate speech detection using multilingual-BERT model on Turkish social media. Inf. Technol. Control 51(2), 356–375 (2022)

    Article  Google Scholar 

  31. Özdil, U.; Arslan, B.; Taşar, DE.; Polat, G.; Ozan, Ş.: Ad text classification with bidirectional encoder representations. In: IEEE, pp. 169–173 (2021)

  32. Sonmezoz, K.; Amasyali, M.F.: Same sentence prediction: a new pre-training task for BERT. In: IEEE, pp. 1–6 (2021)

  33. Aytan, B.; Sakar, C.O.: Comparison of transformer-based models trained in Turkish and different languages on Turkish natural language processing problems. In: IEEE, pp. 1–4 (2022)

  34. Coban, O.; Ozel, S.A.; Inan, A.: Detection and cross-domain evaluation of cyberbullying in Facebook activity contents for Turkish. ACM Trans. Asian Low Resour. Lang. Inf. Process. (2023)

  35. Dündar, E.B.; Kiliç, O.F.; Çekiç, T.; Manav, Y.; Deniz, O.: Large scale intent detection in Turkish short sentences with contextual word embeddings, pp. 187–192 (2020)

  36. Toraman, C.; Yilmaz, E.H.; Şahinuç, F.; Ozcelik, O.: Impact of tokenization on language models: an analysis for Turkish. arXiv preprint arXiv:2204.08832 (2022)

  37. Türkmen, H.; Dikenelli, O.; Eraslan ,C.; Çalli, M.C.; Ozbek, S.S.: Developing pretrained language models for Turkish biomedical domain. In: IEEE, pp. 597–598 (2022)

  38. Toraman, C.; Şahinuç, F.; Yılmaz, E.H.: Large-scale hate speech detection with cross-domain transfer. arXiv preprint arXiv:2203.01111 (2022)

  39. Turkmen, H.; Dikenelli, O.; Eraslan, C.; Callı, M.C.: Bioberturk: exploring Turkish biomedical language model development strategies in low resource setting (2022)

  40. Mutlu, M.M.; Özgür, A.: A dataset and BERT-based models for targeted sentiment analysis on Turkish texts. arXiv preprint arXiv:2205.04185 (2022)

  41. Toraman, Ç.: Event-related microblog retrieval in Turkish. Turk. J. Electr. Eng. Comput. Sci. 30(3), 1067–1083 (2022)

    Article  Google Scholar 

  42. Karayiğit, H.; Akdagli, A.; Acı, Ç.İ: BERT-based transfer learning model for COVID-19 sentiment analysis on Turkish Instagram comments. Inf. Technol. Control 51(3), 409–428 (2022)

    Article  Google Scholar 

  43. Polat, H.; Korpe, M.: Estimation of demographic traits of the deputies through parliamentary debates using machine learning. Electronics 11(15), 2374 (2022)

    Article  Google Scholar 

  44. Özberk, A.; Çiçekli, İ.: Offensive language detection in Turkish tweets with BERT models. In: 2021 6th International Conference on Computer Science and Engineering (UBMK), pp. 517–521. IEEE (2021)

  45. Tokgoz, M.; Turhan, F.; Bolucu, N.; Can, B.: Tuning language representation models for classification of Turkish news, pp. 402–407 (2021)

  46. Coban, O.; Ali, I.; Ozel, S.A.: Towards the design and implementation of an OSN crawler: a case of Turkish Facebook users. Int. J. Inf. Secur. Sci. 9(2), 76–93 (2020)

    Google Scholar 

  47. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  48. Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

  49. Özel, S.A.; Saraç, E.; Akdemir, S.; Aksu, H.: Detection of cyberbullying on social media messages in Turkish. In: IEEE, pp. 366–370 (2017)

  50. Karayiğit, H.; Acı, Ç.İ; Akdağlı, A.: Detecting abusive Instagram comments in Turkish using convolutional neural network and machine learning methods. Expert Syst. Appl. 174, 114802 (2021)

    Article  Google Scholar 

  51. Çoban, Ö.; Özel, S.A.; İnan, A.: Deep learning-based sentiment analysis of Facebook data: the case of Turkish users. Comput. J. 64(3), 473–499 (2021)

    Article  Google Scholar 

  52. Çoban, Ö.; İnan, A.; Özel, S.A.: Facebook tells me your gender: an exploratory study of gender prediction for Turkish Facebook users. Trans. Asian Low Resour. Lang. Inf. Process. 20(4), 1–38 (2021)

    Article  Google Scholar 

  53. Hayran, A.; Sert, M.: Sentiment analysis on microblog data based on word embedding and fusion techniques. In: IEEE, pp. 1–4 (2017)

  54. Loper, E.; Bird, S.: Nltk: The natural language toolkit. arXiv preprint arXiv:cs/0205028 (2002)

  55. Torunoğlu, D.; Çakirman, E.; Ganiz, M.C.; Akyokuş, S.; Gürbüz, M.Z.: Analysis of preprocessing methods on classification of Turkish texts. In: IEEE, pp. 112–117 (2011)

  56. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  57. Shim, H.; Luca, S.; Lowet, D.; Vanrumste, B.: Data augmentation and semi-supervised learning for deep neural networks-based text classifier. In: Proceedings of the 35th Annual ACM Symposium on Applied Computing, pp. 1119–1126 (2020)

  58. Schweter, S.: Berturk-BERT models for Turkish. Online https://doi.org/10.5281/zenodo 3770924 (2020)

  59. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  60. Gao, Z.; Feng, A.; Song, X.; Wu, X.: Target-dependent sentiment classification with BERT. IEEE Access 7, 154290–154299 (2019)

    Article  Google Scholar 

  61. Chollet, F.; others.: Keras, GitHub repository (2015). https://keras.io/examples/generative/vae (2020)

  62. Coban, O.: A new modification and application of item response theory-based feature selection for different machine learning tasks. Concurr. Comput. Pract. Exp. 34(26), e7282 (2022)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ferhat Bozkurt.

Appendix

Appendix

The summary of our MLM-based BERT model (Fig. 3).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Coban, O., Yağanoğlu, M. & Bozkurt, F. Domain Effect Investigation for Bert Models Fine-Tuned on Different Text Categorization Tasks. Arab J Sci Eng 49, 3685–3702 (2024). https://doi.org/10.1007/s13369-023-08142-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13369-023-08142-8

Keywords

Navigation