Skip to main content

Machine Learning Reveals Adaptive COVID-19 Narratives in Online Anti-Vaccination Network

Part of the Springer Proceedings in Complexity book series (SPCOM)


The COVID-19 pandemic sparked an online “infodemic” of potentially dangerous misinformation. We use machine learning to quantify COVID-19 content from opponents of establishment health guidance, in particular vaccination. We quantify this content in two different ways: number of topics and evolution of keywords. We find that, even in the early stages of the pandemic, the anti-vaccination community had the infrastructure to more effectively garner support than their pro-vaccination counterparts by exhibiting a broader array of discussion topics. This provided an advantage in terms of attracting new users seeking COVID-19 guidance online. We also find that our machine learning framework can pick up on the adaptive nature of discussions within the anti-vaccination community, tracking distrust of authorities, opposition to lockdown orders, and an interest in early vaccine trials. Our approach is scalable and hence tackles the urgent problem facing social media platforms of having to analyze huge volumes of online health misinformation. With vaccine booster shots being approved and vaccination rates stagnating, such an automated approach is key in understanding how to combat the misinformation that slows the eradication of the pandemic.


  • Topic modeling
  • Anti-vaccination network
  • Machine learning

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-96188-6_12
  • Chapter length: 12 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   169.00
Price excludes VAT (USA)
  • ISBN: 978-3-030-96188-6
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Hardcover Book
USD   219.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.


  1. 1.


  1. Community standards enforcement report, second quarter 2021. About Facebook, 18 August 2021. Accessed 15 Sep 2021

  2. Sear, R.F., et al.: Quantifying COVID-19 content in the online health opinion war using machine learning. IEEE Access 8, 91886–91893 (2020).

    CrossRef  Google Scholar 

  3. Larson, H.J.: Blocking information on COVID-19 can fuel the spread of misinformation. Nature 580(7803), 306–306 (2020).

    ADS  CrossRef  Google Scholar 

  4. Kata, A.: A postmodern Pandora’s box: anti-vaccination misinformation on the internet. Vaccine 28(7), 1709–1716 (2010).

    CrossRef  Google Scholar 

  5. Coronavirus: scientists brand 5G claims ‘complete rubbish,’ BBC News, 15 April 2020. Accessed 03 Sep 2021

  6. Mythbusters. Accessed 03 Sep 2021

  7. A man thought aquarium cleaner with the same name as the anti-viral drug chloroquine would prevent coronavirus. It killed him. Washington Post. Accessed 16 Sep 2021

  8. Frenkel, S., Alba, D., Zhong, R.: Surge of virus misinformation stumps Facebook and Twitter. The New York Times. 08 Mar 2020. Accessed 03 Sep 2021

  9. Iyengar, R.: The coronavirus is stretching Facebook to its limits CNN. Accessed 03 Sep 2021

  10. Broniatowski, D.A., et al.: Weaponized health communication: twitter bots and Russian trolls amplify the vaccine debate. Am. J. Public Health 108(10), 1378–1384 (2018).

    CrossRef  Google Scholar 

  11. Lama, Y., Chen, T., Dredze, M., Jamison, A., Quinn, S.C., Broniatowski, D.A.: Discordance between human papillomavirus Twitter images and disparities in human papillomavirus risk and disease in the United States: mixed-methods analysis. J. Med. Internet Res. 20(9), e10244 (2018).

    CrossRef  Google Scholar 

  12. Ammari, T., Schoenebeck, S.: Thanks for your interest in our Facebook group, but it’s only for dads’: social roles of stay-at-home dads. In: Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work and Social Computing, New York, pp. 1363–1375, February 2016.

  13. Johnson, N.F., et al.: Hidden resilience and adaptive dynamics of the global online hate ecology. Nature 573(7773), 261–265 (2019).

    ADS  CrossRef  Google Scholar 

  14. Johnson, N.F., et al.: New online ecology of adversarial aggregates: ISIS and beyond. Science, June 2016. Accessed 03 Sep 2021

  15. Facebook. Accessed 03 Sep 2021

  16. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation, 30 (2003)

    Google Scholar 

  17. Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning - ICML 2006, Pittsburgh, Pennsylvania, pp. 113–120 (2006).

  18. Syed, S., Spruit, M.: Full-text or abstract? examining topic coherence scores using latent dirichlet allocation. In: 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 165–174, October 2017.

  19. CDC newsroom, CDC, 01 January 2016. Accessed 03 Sep 2021

  20. Johnson, N.F., et al.: The online competition between pro-and anti-vaccination views. Nature 582(7811), 230–233 (2020).

    ADS  CrossRef  Google Scholar 

Download references


CrowdTangle data are made available through The George Washington University. We are grateful for funding for this research from the U.S. Air Force Office of Scientific Research under award numbers FA9550-20-1-0382 and FA9550-20-1-0383.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Richard Sear .

Editor information

Editors and Affiliations

6 Appendix

6 Appendix

As mentioned in the main text, the methodology starts with a seed of manually identified Facebook Pages discussing either vaccines, public policies about vaccination, or the pro-vs-anti vaccination debate. Then their connections to other fan pages are indexed. At each step, new findings are vetted through a combination of human coding and computer assisted filters. This snowball process is continued, noting that new links can often lead back to members already in the list and hence some form of closure can in principle be achieved. This process leads to a set containing many hundreds of pages for both the anti-vaccination and pro-vaccination communities. Before training the LDA models, several steps are employed to clean the content of these pages in a similar way to other LDA analyses in the literature:

  1. 1.

    Mentions of URL shorteners are removed, such as “”. These are fragments output by Facebook’s CrowdTangle API.

  2. 2.

    Many of the posts link to external websites. The fact that these specific websites were mentioned could itself be an interesting component of the COVID-19 conversation. Hence instead of removing them completely, the pieces “.gov”, “.com”, and “.org” were replaced with “__gov”, “__com”, and “__org”, respectively. This operation effectively concatenates domains into a form that will not be filtered out by the later preprocessing steps.

  3. 3.

    The posts are then run through Gensim’s simple_preprocess function, which tokenizes the post on spaces and removes tokens that are only 1 or 2 characters long. This step also removes numeric and punctuation characters.

  4. 4.

    Tokens that are in Gensim’s list of stopwords, are removed. For example, “the” is not a good indication of a topic.

  5. 5.

    Tokens are lemmatized using the WordNetLemmatizer from the Natural Language Toolkit NLTK, which converts all words to singular form and/or present tense.

  6. 6.

    Tokens are stemmed using the SnowballStemmer from NLTK, which removes affixes on words.

  7. 7.

    Any remaining fragments of URLs (other than domain) that are left over after stemming, such as “http” and “www”, are removed.

Steps 5 and 6 help ensure that words are compared fairly during the training process, and that if a particular word is a strong indicator of a topic, its signal is not lost just because it is used in many different forms. These steps rely on words existing in NLTK’s pretrained vocabulary. Any word not in this vocabulary is left unchanged. After this preprocessing, we then train the LDA models on the cleaned data. We refer to [2] for a complete discussion of the standard LDA models employed. 8 dynamic LDA models were trained with their “number of topics” parameter ranging from 3–10 (inclusive) and each time frame consisting of the data gathered from the anti-vaccination groups in 1-week periods. While the amount of data available in each time frame is not uniform, we believe there is sufficient data in each time frame for the model to make useful inferences.

The code used to run our experiments is available and documented here: It is meant as a framework that can be used to run similar experiments on any text dataset.

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Sear, R., Leahy, R., Restrepo, N.J., Lupu, Y., Johnson, N. (2022). Machine Learning Reveals Adaptive COVID-19 Narratives in Online Anti-Vaccination Network. In: Yang, Z., von Briesen, E. (eds) Proceedings of the 2021 Conference of The Computational Social Science Society of the Americas. CSSSA 2021. Springer Proceedings in Complexity. Springer, Cham.

Download citation