As mentioned in the main text, the methodology starts with a seed of manually identified Facebook Pages discussing either vaccines, public policies about vaccination, or the pro-vs-anti vaccination debate. Then their connections to other fan pages are indexed. At each step, new findings are vetted through a combination of human coding and computer assisted filters. This snowball process is continued, noting that new links can often lead back to members already in the list and hence some form of closure can in principle be achieved. This process leads to a set containing many hundreds of pages for both the anti-vaccination and pro-vaccination communities. Before training the LDA models, several steps are employed to clean the content of these pages in a similar way to other LDA analyses in the literature:
Mentions of URL shorteners are removed, such as “bit.ly”. These are fragments output by Facebook’s CrowdTangle API.
Many of the posts link to external websites. The fact that these specific websites were mentioned could itself be an interesting component of the COVID-19 conversation. Hence instead of removing them completely, the pieces “.gov”, “.com”, and “.org” were replaced with “__gov”, “__com”, and “__org”, respectively. This operation effectively concatenates domains into a form that will not be filtered out by the later preprocessing steps.
The posts are then run through Gensim’s simple_preprocess function, which tokenizes the post on spaces and removes tokens that are only 1 or 2 characters long. This step also removes numeric and punctuation characters.
Tokens that are in Gensim’s list of stopwords, are removed. For example, “the” is not a good indication of a topic.
Tokens are lemmatized using the WordNetLemmatizer from the Natural Language Toolkit NLTK, which converts all words to singular form and/or present tense.
Tokens are stemmed using the SnowballStemmer from NLTK, which removes affixes on words.
Any remaining fragments of URLs (other than domain) that are left over after stemming, such as “http” and “www”, are removed.
Steps 5 and 6 help ensure that words are compared fairly during the training process, and that if a particular word is a strong indicator of a topic, its signal is not lost just because it is used in many different forms. These steps rely on words existing in NLTK’s pretrained vocabulary. Any word not in this vocabulary is left unchanged. After this preprocessing, we then train the LDA models on the cleaned data. We refer to  for a complete discussion of the standard LDA models employed. 8 dynamic LDA models were trained with their “number of topics” parameter ranging from 3–10 (inclusive) and each time frame consisting of the data gathered from the anti-vaccination groups in 1-week periods. While the amount of data available in each time frame is not uniform, we believe there is sufficient data in each time frame for the model to make useful inferences.
The code used to run our experiments is available and documented here: https://github.com/gwdonlab/topic-modeling. It is meant as a framework that can be used to run similar experiments on any text dataset.