Skip to main content

Comprehensive study of pre-trained language models: detecting humor in news headlines


The ability to automatically understand and analyze human language attracted researchers and practitioners in the Natural Language Processing (NLP) field. Detecting humor is an NLP task needed in many areas, including marketing, politics, and news. However, such a task is challenging due to the context, emotion, culture, and rhythm. To address this problem, we have proposed a robust model called BFHumor, a BERT-Flair-based Humor detection model that detects humor through news headlines. It is an ensemble model of different state-of-the-art pre-trained models utilizing various NLP techniques. We used public humor datasets from the SemEval-2020 workshop to evaluate the proposed model. As a result, the model achieved outstanding performance with 0.51966 as Root Mean Squared Error (RMSE) and 0.62291 as accuracy. In addition, we extensively investigated the underlying reasons behind the high accuracy of the BFHumor model in humor detection tasks. To that end, we conducted two experiments on the BERT model: vocabulary level and linguistic capturing level. Our investigation shows that BERT can capture surface knowledge in the lower layers, syntactic in the middle, and semantic in the higher layers.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Data Availability

Not applicable.


  1.,,,, and

  2. and


  • Abdullah M, Shaikh S (2018) Teamuncc at semeval-2018 task 1: Emotion detection in english and arabic tweets using deep learning. In: Proceedings of the 12th international workshop on semantic evaluation, pp 350–357

  • Abedalla A, Al-Sadi A, Abdullah M (2019) A closer look at fake news detection: a deep learning perspective. In: Proceedings of the 2019 3rd international conference on advances in artificial intelligence, pp 24–28

  • Akbik A, Bergmann T, Blythe D, et al (2019) FLAIR: an easy-to-use framework for state-of-the-art NLP. In: Ammar W, Louis A, Mostafazadeh N (eds) Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Demonstrations. Association for Computational Linguistics, pp 54–59,

  • Alhabbash MI, Mahdi AO, Naser SSA (2016) An intelligent tutoring system for teaching grammar english tenses. Eur Acad Res 4(9):7743–7757

    Google Scholar 

  • Annamoradnejad I (2020) Colbert: Using BERT sentence embedding for humor detection. CoRR abs/2004.12765. arxiv:2004.12765,

  • Beltagy I, Lo K, Cohan A (2019) Scibert: A pretrained language model for scientific text. In: Inui K, Jiang J, Ng V, et al (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019. Association for Computational Linguistics, pp 3613–3618,

  • Bertero D, Fung P (2016) A long short-term memory framework for predicting humor in dialogues. In: Knight K, Nenkova A, Rambow O (eds) NAACL HLT 2016, The 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, San Diego California, USA, June 12–17, 2016. The Association for Computational Linguistics, pp 130–135,

  • Chiu B, Crichton GKO, Korhonen A, et al (2016) How to train good word embeddings for biomedical NLP. In: Cohen KB, Demner-Fushman D, Ananiadou S, et al (eds) Proceedings of the 15th workshop on biomedical natural language processing, BioNLP@ACL 2016, Berlin, Germany, August 12, 2016. Association for Computational Linguistics, pp 166–174,

  • Conneau A, Kruszewski G, Lample G, et al (2018) What you can cram into a single vector: probing sentence embeddings for linguistic properties. CoRR abs/1805.01070. arxiv:1805.01070,

  • Devlin J, Chang M, Lee K, et al (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. arxiv:1810.04805,

  • Djambaska A, Petrovska I, Bundalevska E (2016) Is humor advertising always effective? parameters for effective use of humor in advertising. J Manag Res 8(1):1–19

    Google Scholar 

  • Elshindy S (2019) Pragmatic functions of political humor used by egyptian facebook users. MSc thesis, The American University in Cairo

  • Fan X, Lin H, Yang L et al (2020) Humor detection via an internal and external neural network. Neurocomputing 394:105–111.

    Article  Google Scholar 

  • Faraj D, Abdullah M (2021) Sarcasmdet at sarcasm detection task 2021 in arabic using arabert pretrained model. In: Proceedings of the sixth Arabic natural language processing workshop, pp 345–350

  • Farías DIH, Benedí J, Rosso P (2015) Applying basic features from sentiment analysis for automatic irony detection. In: Paredes R, Cardoso JS, Pardo XM (eds) Pattern recognition and image analysis-7th Iberian conference, IbPRIA 2015, Santiago de Compostela, Spain, June 17-19, 2015, Proceedings, Lecture Notes in computer science, vol 9117. Springer, pp 337–344,

  • Hewitt J, Manning CD (2019) A structural probe for finding syntax in word representations. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019,Volume 1 (Long and Short Papers). Association for Computational Linguistics, pp 4129–4138,

  • Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  • Hossain N, Krumm J, Gamon M (2019) “president vows to cut taxes hair”: dataset and analysis of creative text editing for humorous headlines. arXiv preprint arXiv:1906.00274

  • Hossain N, Krumm J, Gamon M, et al (2020a) Semeval-2020 task 7: assessing humor in edited news headlines. CoRR abs/2008.00304. arxiv:2008.00304,

  • Hossain N, Krumm J, Sajed T, et al (2020b) Stimulating creativity with funlines: a case study of humor generation in headlines. CoRR abs/2002.02031. arxiv:2002.02031,

  • Ismail A, Ahmad MK, Mustaffa CS (2017) Investigative journalism in Malaysia: the battle between outside and inside newsroom challenges. SHS Web Conf 33:1–5

    Article  Google Scholar 

  • Jawahar G, Sagot B, Seddah D (2019) What does BERT learn about the structure of language? In: Korhonen A, Traum DR, Màrquez L (eds) Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, Italy, July 28–August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics, pp 3651–3657,

  • Jiang Z, El-Jaroudi A, Hartmann W, et al (2020) Cross-lingual information retrieval with BERT. CoRR abs/2004.13005. arxiv:2004.13005,

  • Le H, Vial L, Frej J, et al (2019) Flaubert: Unsupervised language model pre-training for french. CoRR abs/1912.05372. arxiv:1912.05372,

  • Lee J, Yoon W, Kim S et al (2020) Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4):1234–1240

    Article  Google Scholar 

  • Liu NF, Gardner M, Belinkov Y, et al (2019a) Linguistic knowledge and transferability of contextual representations. CoRR abs/1903.08855. arxiv:1903.08855,

  • Liu Y, Liu Z, Chua T, et al (2015) Topical word embeddings. In: Bonet B, Koenig S (eds) Proceedings of the twenty-ninth AAAI conference on artificial intelligence, January 25–30, 2015, Austin, Texas, USA. AAAI Press, pp 2418–2424,

  • Liu Y, Ott M, Goyal N, et al (2019b) Roberta: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692. arxiv:1907.11692,

  • Mahurkar S, Patil R (2020) LRG at semeval-2020 task 7: assessing the ability of BERT and derivative models to perform short-edits based humor grading. CoRR abs/2006.00607. arxiv:2006.00607,

  • Mallamma VR, Hanumanthappa M (2014) Semantical and syntactical analysis of nlp. Int J Comput Sci Inform Technol 5(3):3236–3238

    Google Scholar 

  • Manning CD, Surdeanu M, Bauer J, et al (2014) The stanford corenlp natural language processing toolkit. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, System Demonstrations. The Association for Computer Linguistics, pp 55–60,

  • Mao J, Liu W (2019) A bert-based approach for automatic humor detection and scoring. In: Cumbreras MÁG, Gonzalo J, Cámara EM, et al (eds) Proceedings of the Iberian languages evaluation forum co-located with 35th conference of the Spanish society for natural language processing, IberLEF@SEPLN 2019, Bilbao, Spain, September 24th, 2019, CEUR Workshop Proceedings, vol 2421., pp 197–202,

  • Martin L, Müller B, Suárez PJO, et al (2019) Camembert: a tasty french language model. CoRR abs/1911.03894. arxiv:1911.03894,

  • Martin RA, Kuiper NA, Olinger LJ et al (1993) Humor, coping with stress, self-concept, and psychological well-being. Humor Int J Hum Res 6(1):89–104

    Article  Google Scholar 

  • Mayo M, Frank E (2020) Improving naive bayes for regression with optimized artificial surrogate data. Appl Artif Intell 34(6):484–514

    Article  Google Scholar 

  • Mikolov T, Chen K, Corrado G, et al (2013) Efficient estimation of word representations in vector space. In: 1st international conference on learning representations, ICLR 2013, Scottsdale, Arizona, USA, May 2–4, 2013, Workshop Track Proceedings arxiv:1301.3781

  • Morishita T, Morio G, Ozaki H, et al (2020) Hitachi at semeval-2020 task 7: Stacking at scale with heterogeneous language models for humor recognition. In: Herbelot A, Zhu X, Palmer A, et al (eds) Proceedings of the fourteenth workshop on semantic evaluation, SemEval@COLING 2020, Barcelona (online), December 12–13, 2020. International Committee for Computational Linguistics, pp 791–803, 10.18653/v1/2020.semeval-1.101,

  • Najadat HM, Alzu’bi AA, Shatnawi F, et al (2020) Analyzing social media opinions using data analytics. In: 2020 11th international conference on information and communication systems (ICICS), IEEE, pp 266–271

  • Núñez-Barriopedro E, Klusek KG, Tobar-Pesántez L (2019) The effectiveness of humor in advertising: analysis from an international scope. Acad Strateg Manag J 18(4):1–11

    Google Scholar 

  • Pant K, Dadu T (2020) Sarcasm detection using context separators in online discourse. CoRR abs/2006.00850. arxiv:2006.00850,

  • Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  • Qiu X, Sun T, Xu Y, et al (2020) Pre-trained models for natural language processing: a survey. CoRR abs/2003.08271. arxiv:2003.08271,

  • Rogers A, Kovaleva O, Rumshisky A (2020) A primer in bertology: what we know about how BERT works. CoRR abs/2002.12327. arxiv:2002.12327,

  • Rukmawan S, Aszhari F, Rustam Z et al (2021) Cerebral infarction classification using the k-nearest neighbor and naive bayes classifier. J Phys Conf Ser 2:012045

    Article  Google Scholar 

  • Sanh V, Debut L, Chaumond J, et al (2019) Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108. arxiv:1910.01108,

  • Stajner S, Mitkov R (2012) Diachronic changes in text complexity in 20th century english language: An NLP approach. In: Calzolari N, Choukri K, Declerck T, et al (eds) Proceedings of the eighth international conference on language resources and evaluation, LREC 2012, Istanbul, Turkey, May 23–25, 2012. European Language Resources Association (ELRA), pp 1577–1584,

  • Swamy SD, Laddha S, Abdussalam B, et al (2020) Nit-agartala-nlp-team at semeval-2020 task 8: building multimodal classifiers to tackle internet humor. CoRR abs/2005.06943. arxiv:2005.06943,

  • Tenney I, Das D, Pavlick E (2019a) BERT rediscovers the classical NLP pipeline. CoRR abs/1905.05950. arxiv:1905.05950,

  • Tenney I, Xia P, Chen B, et al (2019b) What do you learn from context? probing for sentence structure in contextualized word representations. CoRR abs/1905.06316. arxiv:1905.06316,

  • Van den Beukel S, Aroyo L (2018) Homonym detection for humor recognition in short text. In: Balahur A, Mohammad SM, Hoste V, et al (eds) Proceedings of the 9th workshop on computational approaches to subjectivity, sentiment and social media analysis, WASSA@EMNLP 2018, Brussels, Belgium, October 31, 2018. Association for Computational Linguistics, pp 286–291,

  • Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. CoRR abs/1706.03762. arxiv:1706.03762,

  • Wang Y, Sun Y, Ma Z, et al (2020) Application of pre-training models in named entity recognition. CoRR abs/2002.08902. arxiv:2002.08902,

  • Weller O, Seppi KD (2019) Humor detection: a transformer gets the last laugh. CoRR abs/1909.00252. arxiv:1909.00252,

  • Weller O, Seppi KD (2020) The rjokes dataset: a large scale humor collection. In: Calzolari N, Béchet F, Blache P, et al (eds) Proceedings of The 12th language resources and evaluation conference, LREC 2020, Marseille, France, May 11–16, 2020. European Language Resources Association, pp 6136–6141,

  • Whisonant RD (1998) The effects of humor on cognitive learning in a computer-based environment. PhD thesis, Virginia Tech

  • Wiedemann G, Yimam SM, Biemann C (2020) UHH-LT & LT2 at semeval-2020 task 12: fine-tuning of pre-trained transformer networks for offensive language detection. CoRR abs/2004.11493. arxiv:2004.11493,

  • Wolf T, Debut L, Sanh V, et al (2019) Huggingface’s transformers: State-of-the-art natural language processing. CoRR abs/1910.03771. arxiv:1910.03771,

  • Young DG (2017) Theories and effects of political humor: discounting cues, gateways, and the impact of incongruities. Oxford Handbook Polit Commun 871:884

    Google Scholar 

  • Zahrotun L (2016) Comparison jaccard similarity, cosine similarity and combined both of the data clustering with shared nearest neighbor method. Comput Eng Appl J 5(1):11–18

    Google Scholar 

  • Zhang C, Kudo M, Yamana H (2020) Evaluation of bert and xlnet models on irony detection in english tweets. In: DEIM Forum, pp 1–7

  • Zhu Y, Kiros R, Zemel RS, et al (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. CoRR abs/1506.06724. arxiv:1506.06724,

Download references


Not applicable.


Not applicable.

Author information

Authors and Affiliations



Farah Shatnawi and Malak Abdullah applied all the experiments. Malak Abdullah and Mahmoud Hammad supervised the work and validated the experiments. Malak Abdullah and Mahmoud Al-Ayyoub analyzed the results.

Corresponding author

Correspondence to Malak Abdullah.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethics approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Shatnawi, F., Abdullah, M., Hammad, M. et al. Comprehensive study of pre-trained language models: detecting humor in news headlines. Soft Comput 27, 2575–2599 (2023).

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Humor
  • Pre-trained models
  • BERT
  • Flair
  • BERT knowledge
  • BERT vocabulary