Testing the Generalization of Neural Language Models for COVID-19 Misinformation Detection

Wahle, Jan Philip; Ashok, Nischal; Ruas, Terry; Meuschke, Norman; Ghosal, Tirthankar; Gipp, Bela

doi:10.1007/978-3-030-96957-8_33

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 13192))

Included in the following conference series:

International Conference on Information

1910 Accesses
10 Citations
2 Altmetric

Abstract

A drastic rise in potentially life-threatening misinformation has been a by-product of the COVID-19 pandemic. Computational support to identify false information within the massive body of data on the topic is crucial to prevent harm. Researchers proposed many methods for flagging online misinformation related to COVID-19. However, these methods predominantly target specific content types (e.g., news) or platforms (e.g., Twitter). The methods’ capabilities to generalize were largely unclear so far. We evaluate fifteen Transformer-based models on five COVID-19 misinformation datasets that include social media posts, news articles, and scientific papers to fill this gap. We show tokenizers and models tailored to COVID-19 data do not provide a significant advantage over general-purpose ones. Our study provides a realistic assessment of models for detecting COVID-19 misinformation. We expect that evaluating a broad spectrum of datasets and models will benefit future research in developing misinformation detection systems.

J. P. Wahle and N. Ashok—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://coronavirus.jhu.edu/map.html.
2.
We collectively refer to fake news, disinformation, and misinformation as false information.
3.
https://github.com/ag-gipp/iConference22_COVID_misinformation.
4.
https://tinyurl.com/86cpx6u2.
5.
https://tinyurl.com/9w24pc93.
6.
https://tinyurl.com/86cpx6u2.
7.
https://tinyurl.com/9w24pc93.
8.
https://tinyurl.com/kebysw.
9.
https://tinyurl.com/4xx9vdkm.
10.
https://tinyurl.com/4ne9vtzu.
11.
General-purpose refers to the tokenizers released with the pre-trained models.
12.
Pre-Trained tokenizer provided by HuggingFace.
13.
https://pubmed.ncbi.nlm.nih.gov/.
14.
https://www.biorxiv.org/.
15.
https://www.medrxiv.org/.
16.
https://www.who.int/.
17.
https://tinyurl.com/4mryzj5k.
18.
https://www.poynter.org/ifcn/.
19.
https://www.semanticscholar.org/.

References

Alsentzer, E., et al.: Publicly Available Clinical BERT Embeddings. arXiv:1904.03323 [cs], June 2019. http://arxiv.org/abs/1904.03323
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3613–3618. Association for Computational Linguistics, Hong Kong, China (2019). 10/ggcgtm
Google Scholar
Benkler, Y., Farris, R., Roberts, H.: Network Propaganda, vol. 1. Oxford University Press, October 2018. https://doi.org/10.1093/oso/9780190923624.001.0001
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017). 10/gfw9cs
Google Scholar
Cinelli, M., et al.: The COVID-19 social media infodemic. Sci. Rep. 10(1), 16598 (2020). https://doi.org/10.1038/s41598-020-73510-5
Article MathSciNet Google Scholar
Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv:2003.10555 [cs], March 2020. http://arxiv.org/abs/2003.10555
Cui, L., Lee, D.: CoAID: COVID-19 Healthcare Misinformation Dataset. arXiv:2006.00885 [cs], August 2020. http://arxiv.org/abs/2006.00885
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805, May 2019. http://arxiv.org/abs/1810.04805
Dror, R., Baumer, G., Shlomov, S., Reichart, R.: The hitchhiker’s guide to testing statistical significance in natural language processing. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1383–1392. Association for Computational Linguistics, Melbourne, Australia, July 2018. https://doi.org/10.18653/v1/P18-1128
Hele, T., et al.: A global panel database of pandemic policies (oxford COVID-19 government response tracker). Nat. Hum. Behav. 5(4), 529–538 (2021). https://doi.org/10.1038/s41562-021-01079-8
Article Google Scholar
He, P., Liu, X., Gao, J., Chen, W.: DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arXiv:2006.03654 [cs], January 2021. http://arxiv.org/abs/2006.03654
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339. Association for Computational Linguistics, Melbourne, Australia, July 2018. https://doi.org/10.18653/v1/P18-1031
Johnson, A.E., et al.: MIMIC-III, a freelyaccessible critical care database. Sci. Data 3, 160035 (2016). https://doi.org/10.1038/sdata.2016.35
Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, pp. 1–7 (2019). https://doi.org/10.1093/bioinformatics/btz682
Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880. Association for Computational Linguistics, Online, July 2020. https://doi.org/10.18653/v1/2020.acl-main.703
Liu, Y., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs], July 2019. http://arxiv.org/abs/1907.11692
Memon, S.A., Carley, K.M.: Characterizing COVID-19 Misinformation Communities Using a Novel Twitter Dataset. arXiv:2008.00791 [cs], September 2020. http://arxiv.org/abs/2008.00791
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546 [cs, stat], October 2013. http://arxiv.org/abs/1310.45464
Mutlu, E.C., et al.: A stance data set on polarized conversations on Twitter about the efficacy of hydroxychloroquine as a treatment for COVID-19. Data in Brief 33, 106401 (2020). https://doi.org/10.1016/j.dib.2020.106401
Müller, M., Salathé, M., Kummervold, P.E.: COVID-twitter-bert: a natural language processing model to analyse COVID-19 content on twitter. arXiv:2005.07503 [cs], May 2020. http://arxiv.org/abs/2005.07503
Nguyen, D.Q., Vu, T., Tuan Nguyen, A.: BERTweet: a pre-trained language model for English tweets. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 9–14. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-demos.2
Ostendorff, M., Ruas, T., Blume, T., Gipp, B., Rehm, G.: Aspect-based document similarity for research papers. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 6194–6206. International Committee on Computational Linguistics, Barcelona, Spain (Online) (2020). https://doi.org/10.18653/v1/2020.coling-main.545
Pennycook, G., McPhetres, J., Zhang, Y., Lu, J.G., Rand, D.G.: Fighting COVID-19 misinformation on social media: experimental evidence for a scalable accuracy-nudge intervention. Psychol. Sci. 31(7), 770–780 (2020). https://doi.org/10.1177/0956797620939054
Article Google Scholar
Press, O., Smith, N.A., Lewis, M.: Shortformer: better language modeling using shorter inputs. arXiv:2012.15832 [cs], December 2020. http://arxiv.org/abs/2012.15832
Ruas, T., Ferreira, C.H.P., Grosky, W., de França, F.O., de Medeiros, D.M.R.: Enhanced word embeddings using multi-semantic representation through lexical chains. Inf. Sci. 532, 16–32 (2020). https://doi.org/10.1016/j.ins.2020.04.048
Article Google Scholar
Ruas, T., Grosky, W., Aizawa, A.: Multi-sense embeddings through a word sense disambiguation process. Expert Syst. Appl. 136, 288–303 (2019). https://doi.org/10.1016/j.eswa.2019.06.026
Article Google Scholar
Shu, K., Sliva, A., Wang, S., Tang, J., Liu, H.: Fake news detection on social media: a data mining perspective. ACM SIGKDD Explor. Newslett. 19(1), 22–36 (2017). https://doi.org/10.1145/3137597.3137600
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010. NIPS 2017, Curran Associates Inc., Red Hook, NY, USA (2017). https://arxiv.org/abs/1706.03762
Wahle, J.P., Ruas, T., Foltynek, T., Meuschke, N., Gipp, B.: Identifying machine-paraphrased plagiarism. In: Proceedings of the iConference, February 2022
Google Scholar
Wahle, J.P., Ruas, T., Meuschke, N., Gipp, B.: Are neural language models good plagiarists? a benchmark for neural paraphrase detection. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, Washington, USA, September 2021
Google Scholar
Wang, A., et al.: SuperGLUE: a stickier benchmark for general-purpose language understanding systems. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’ Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 3266–3280. Curran Associates, Inc. (2019). https://arxiv.org/abs/1905.00537
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv:1804.07461 [cs], February 2019. https://arxiv.org/abs/1804.0746
Wang, L.L., et al.: CORD-19: The COVID-19 Open Research Dataset. arXiv:2004.10706 [cs], July 2020. http://arxiv.org/abs/2004.10706
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. arXiv:1906.08237 [cs], June 2019. https://arxiv.org/abs/1804.0746
Zarocostas, J.: How to fight an infodemic. Lancet 395(10225), 676 (2020). https://doi.org/10.1016/S0140-6736(20)30461-X
Article Google Scholar
Zhou, X., Mulay, A., Ferrara, E., Zafarani, R.: ReCOVery: A multimodal repository for COVID-19 news credibility research, pp. 3205–3212. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3340531.3412880

Download references

Author information

Authors and Affiliations

University of Wuppertal, Rainer -Gruenter -Straße, 42119, Wuppertal, Germany
Jan Philip Wahle, Terry Ruas, Norman Meuschke & Bela Gipp
Indian Institute of Technology Patna, Bihar, 801106, India
Nischal Ashok
Charles University, Malostranské Náměstí 25, 118 00, Praha, Czech Republic
Tirthankar Ghosal

Authors

Jan Philip Wahle
View author publications
You can also search for this author in PubMed Google Scholar
Nischal Ashok
View author publications
You can also search for this author in PubMed Google Scholar
Terry Ruas
View author publications
You can also search for this author in PubMed Google Scholar
Norman Meuschke
View author publications
You can also search for this author in PubMed Google Scholar
Tirthankar Ghosal
View author publications
You can also search for this author in PubMed Google Scholar
Bela Gipp
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jan Philip Wahle .

Editor information

Editors and Affiliations

Humboldt-Universität zu Berlin, Berlin, Germany
Malte Smits

A Appendix

1.1 A.1 Dataset Details

COVID-19 Open Research Dataset (CORD-19) [33] is the largest open source dataset about COVID-19 and coronavirus-related research (e.g. SARS, MERS). CORD-19 is composed of more than 280K scholarly articles from PubMed,^{Footnote 13} bioRxiv,^{Footnote 14} medRxiv,^{Footnote 15} and other resources maintained by the WHO.^{Footnote 16} We use this dataset to extend the general pre-training from selected neural language models (cf. Sect. 3) into the COVID-specific vocabulary and features.

Covid-19 heAlthcare mIsinformation Dataset (CoAID) [7] focuses on healthcare misinformation, including fake news on websites, user engagement, and social media. CoAID is composed of 5 216 news articles, 296 752 related user engagements, and 958 posts about COVID-19, which are broadly categorized under the labels true and false.

Twitter Stance Dataset (COVID-CQ) [19] is a dataset of user-generated Twitter content in the context of COVID-19. More than 14K tweets were processed and annotated regarding the use of Chloroquine and Hydroxychloroquine as a valid treatment or prevention against the coronavirus. COVID-CQ is composed of 14 374 tweets from 11 552 unique users labeled as neutral, against, or favor.

ReCOVery [36] explores the low credibility of information on COVID-19 (e.g., bleach can prevent COVID-19) by allowing a multimodal investigation of news and their spread on social media. The dataset is composed of 2 029 news articles on the coronavirus and 140 820 related tweets labeled as reliable or unreliable.

CMU-MisCov19 [17] is a Twitter dataset created by collecting posts from unknowingly misinformed users, users who actively spread misinformation, and users who disseminate facts or call out misinformation. CMU-MisCov19 is composed of 4 573 annotated tweets divided into 17 classes (e.g., conspiracy, fake cure, news, sarcasm). The high number of classes and their imbalanced distribution make CMU-MisCov19 a challenging dataset.

COVID19FN^{Footnote 17} is composed of approximately 2 800 news articles extracted mainly from Poynter^{Footnote 18} categorized as either real or fake.

1.2 A.2 Model Details

General-Purpose Baselines. BERT [8] mainly captures general language characteristics using a bidirectional Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks. RoBERTa [16] improves BERT with additional data, compute budgets, and hyperparameter optimizations. RoBERTa also drops the NSP as it contributes little to the model representation. BART [15] optimizes an auto-regressive forward-product and auto-encoding MLM objective simultaneously. DeBERTa [11] improves the attention mechanism using a disentanglement of content and position.

Intermediate Pre-Trained. SciBERT [2] optimizes the MLM for 1.14M randomly selected papers from Semantic Scholar^{Footnote 19}. BioClinicalBERT [1] specializes on 2M notes in the MIMIC-III database [13], a collection of disidentified clinical data. BERTweet [21] optimizes BERT on 850M tweets each containing between 10 and 64 tokens.

COVID-19 Intermediate Pre-Trained. COVID-Twitter-BERT [20] (CT-BERT) uses a corpus of 160M tweets for domain-specific pre-training and evaluates the resulting model’s capabilities in sentiment analysis, such as for tweets about vaccines. BioClinicalBERT [1] fine-tunes BioBERT [14] into clinical narratives in the hope to incorporate linguistic characteristics from both the clinical and biomedical domains.

Cui et al. [7] propose CoAID and investigate the misinformation detection task by comparing traditional machine learning (e.g., logistic regression, random forest) and deep learning techniques (e.g., GRU). In a similar layout, Zhou et al. [36] compare traditional statistical learners, such as SVM and neural networks (e.g., CNN), to classify news as credible or not. In both studies, the results show deep learning architectures as the most prominent options.

1.3 A.3 Evaluation Details

Pre-Training. We use the data from the abstracts of the CORD-19 dataset for pre-training. For pre-processing the CORD-19 abstract data, we consider only alphanumerical characters. We use a sequence length of 128 tokens, which reduces training time while being competitive to longer sequence lengths when fine-tuning [24]. We mask words randomly with a probability of .15, a common configuration for Transformers [8, 11], and perform the MLM with the following remaining parameters: a batch size of 16 for all the base models, and eight for the large models, the Adam Optimizer (\(\alpha = 2e-5\), \(\beta _1 = .9\), \(\beta _2 = .999\), \(\epsilon = 1e-8\)), and a maximum of five epochs. All experiments were performed on a single NVIDIA GeForce GTX 1080 Ti GPU with 11 GB of memory.

Fine-Tuning. The classification model applies a randomly initialized fully-connected layer to the aggregate representation of the underlying Transformer (e.g., [CLS] for BERT) with dropout (\(p=.1\)) to learn the annotated target classes with cross-entropy loss for five epochs and with a sequence length of 200 tokens. We use the same configuration of the optimizer as in pre-training.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wahle, J.P., Ashok, N., Ruas, T., Meuschke, N., Ghosal, T., Gipp, B. (2022). Testing the Generalization of Neural Language Models for COVID-19 Misinformation Detection. In: Smits, M. (eds) Information for a Better World: Shaping the Global Future. iConference 2022. Lecture Notes in Computer Science(), vol 13192. Springer, Cham. https://doi.org/10.1007/978-3-030-96957-8_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-96957-8_33
Published: 23 February 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-96956-1
Online ISBN: 978-3-030-96957-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Testing the Generalization of Neural Language Models for COVID-19 Misinformation Detection

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Dataset Details

1.2 A.2 Model Details

1.3 A.3 Evaluation Details

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation