Introduction

Pre Exposure Prophylaxis (PrEP) is a daily oral or injectable medication for groups at highest risk of contracting the Human Immunodeficiency Virus (HIV), including men who have sex with men (MSM), transgender women, and people who inject drugs [1]. When taken as prescribed, PrEP is shown to reduce HIV-1 infection by 75% or more when coupled with safe-sex and/or clean needle sharing practices [2, 3]. To date, about 1 million adults globally have begun PrEP uptake with representative adoption rates across gender, race, and ethnicities [4].

Prior to PrEP’s advent, HIV awareness, prevention, and outreach materialized as public health education campaigns on HIV mitigation strategies [2]. Though the success of these programs varied, PrEP’s Food and Drug Administration (FDA) approval during the social media era also introduced novel mediums to rapidly disseminate information about HIV and HIV prophylaxis as prevention [5]. Indeed, digital and e-health (broad categories of multidisciplinary public health science at the intersection of technology and healthcare) have played a pivotal role in creating online and mobile knowledge and awareness campaigns to spread information about PrEP and its benefits rapidly [1, 5,6,7]. Much of these campaigns are naturally tailored to appeal groups at higher risk of HIV exposure including younger audiences (ages18-29) and racially diverse MSM [5, 8]. The success of social media campaigns led to an 880% increase in PrEP adoption among key groups since 2012 [9].

A natural consequence of online, digital health, e-health, and mHealth interventions and marketing campaigns is a paper trail of online discourse among social media users sharing and/or discussing PrEP in diverse contexts. This allows, beyond obvious analyses of intervention effectiveness online, the opportunity to mine these data for deeper, niche discussions about PrEP that may not be apparent in surveys or data derived from interventions themselves, including potential misinformation about PrEP usage and safety, ongoing lawsuits against Gilead Pharmaceuticals, the manufacturer of the two FDA-approved PrEP medications (Truvada® and Descovy®) among other potential determinants and barriers to PrEP uptake [10].

Social Media Mining and Public health

Beyond PrEP, social media’s inescapable role in the public lexicon has shifted approaches to studying health behavior. Over two-thirds of the US population use at least one social media platform daily, resulting in billions of unique data points that are spontaneous, open-source, diachronic, and open-ended [11]. A multitude of studies have conclusively demonstrated that markers of health behavior can be meaningfully extracted from social media data collected en-masse. This includes extracting mental health markers, such as cognitive distortions [12], subjective well-being indicators from individual social media timelines [13], constructing ego-networks from friendship lists [14], and even simply identifying common topics or themes within millions of tweets (i.e., posts on Twitter), or postings on other social media platforms [15].

To bridge the fields of computational informatics and public health, specific calls for interdisciplinary collaborations between these fields have been proposed. Valdez et al., (2021) highlighted several strategies for bridging computational informatics and public health with Natural Language Processing (NLP) methods. Valdez, Patterson, and Prochnow (2021) have also called for the harmonious application of public health and social network theory to lend context and/or qualitative prediction to social media studies. Collectively, this body of work argues that computational informatics methods coupled with public health frameworks can yield rich and nuanced findings about a given event– including tracking online discourse amid medical innovations and ascertaining belief systems for them.

Social media mining and PrEP: an interdisciplinary analysis

As an opportunity to study information dissemination and synthesis from a joint public health/computer science perspective, PrEP’s evolution from a medical novelty into life-saving necessity and LGBTQIA + cultural phenomena may be particularly impactful. As a semi-novel yet revolutionary medication poised to continue reshaping HIV mitigation, insights from such an interdisciplinary analysis can identify salient discussion points about PrEP, gaps in PrEP knowledge, and points of intervention from a policy perspective. Therefore, the purpose of this study is to explore online discourse about PrEP using an interdisciplinary Public health and Computational informatics approach. Our study is guided by two research questions:

  1. 1.

    Can we meaningfully consolidate PrEP related tweets into emerging themes or ideas?

  2. 2.

    How can interdisciplinary frameworks add additional nuance to online conversations about PrEP and other Public health topics more broadly?

Insights from this study will contribute to reshaping our approach to mining social media data for medical novelties. By leveraging tools and strategies from multiple fields– in this case Public health and Computational informatics– we will ultimately glean deeper understanding of the how medical novelties of meaningfully communicated in online spaces, both in positive and negative contexts.

Methods

Data

Data germane to this study were collected from tweets posted between June 1, 2018 and May 31, 2021. All data were obtained via a data repository that continuously queries Twitter’s Application Programming Interface (API). Twitter’s API allows developers to query and archive an estimated 1% of total daily tweet volume for a given search term. As these tweets are only available in real-time, this 1% sample constitutes the maximum data available for analysis. However, this rate of collection remains the standard for data derived from Twitter. For our study, specifically, we collected a series of tweets containing key words relating to Pre-Exposure Prophylaxis (PrEP), PrEP usage, and associated PrEP medications: Truvada, Descovy, #PrEP, “pre-exposure prophylaxis”, #truvada, #descovy, #truvadaprep, #descovyprep, #truvadaforprep, #descovyforprep. We filtered our data to remove duplicate and non-English tweets. Our total collection of tweets (i.e., a corpus) was a random sample of N = 4,020 tweets, deemed to be representative of PrEP-related discourse yet analyzable by our domain experts. All data were saved into a single repository where they were scrubbed of identifying information. Our use of these data conformed to the Institutional Review Board (IRB) standards for data security and privacy for secondary data analysis.

Analyses

We leveraged three broad classes of computational informatics methods and algorithms to analyze our data: [1] The Sentence Bi-Directional Encoder Representations from Transformers (S-BERT); [2] Principal Component Analysis (PCA) with Uniform Manifold Approximation and Projection (UMAP); and [3] K-means Clustering. These methods have been extensively used to analyze social media data (see Karisani & Karisani, (2021)for an example of S-BERT and social media mining). These tools have also been applied in public health contexts [19].

Gauging tweet similarity with Sentence Bidirectional Encoder Representations from Transformers (S-BERT). S-BERT is an extension of the state-of-the-art Bidirectional Encoder Representations from Transformers (BERT). The BERT approach uses neural networks to detect and map patterns in large-scale text data [20, 21]. BERT is trained with large-scale text data from which it learns numerical representations of text semantics by analyzing matching sequences of words [22]. The resulting representations then allow quantitative comparisons of the similarity of two texts.

Likewise, as shown in Fig. 1, S-BERT generates a numerical vector representation of each tweet (taking as a “sentence”) that represents its semantics. This vector can be numerically compared to the similarly generated S-BERT vectors of other tweets using standard distance or similarity metrics, e.g., cosine-similarity. The latter is commonly used to gauge the degree of alignment between two vectors: which varies from zero (representing orthogonality (dissimilarity)) to one (representing collinearity (similarity)).

Fig. 1
figure 1

 S-BERT transforms tweets or sentences into 384 × 1 vectors whose similarity or distance can be compared using common distance metrics such as cosine similarity. The resulting similarity quantitatively expresses the semantic similarity between the respective tweets by gauging how well-aligned their S-BERT vectors are in a 384 dimensional semantic space (here depicted as 2-dimensional for visual simplicity). Prototypical tweet texts are displayed to exemplify similar and dissimilar classifications per our analysis

Provided S-BERT vectors represent the semantics of 2 respective tweets, we can thus determine the degree of semantic similarity of these tweets by calculating the cosine similarity of their respective S-BERT vectors. For example, a tweet “I am concerned about the side-effects of my PrEP treatment” would be translated to a specific 384 × 1 vector representing its content (384 × 1 is the default dimensionality for pre-trained S-BERT), while another tweet “I don’t think my PrEP pills are safe because of the effects it has on my mood” would be translated to another 384 × 1 vector. The cosine similarity between the two respective vectors is 0.624, indicating they are moderately well-aligned, and thus similar in meaning (both describe concerns about PrEP treatment), regardless of whether they use or do not use the exact same wording. On the other hand, the cosine similarity between “I am concerned about the side-effects of my PrEP treatment” and “I can’t afford the costs of my monthly PrEP treatment” would be 0.502 (the 2 tweets describe different concerns about PreP treatment).

Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). The S-BERT vectors that are retrieved for each tweet are highly dimensional (D = 384) numerical indicators of the semantics of their content, and thus need to be projected to a lower-dimensional space (D = 2) for visualization. PCA and UMAP are common techniques for dimensionality reduction that were employed to facilitate analysis of sentence embeddings [23,24,25]. PCA extracts the principal components, a set of variables which successively capture a greater degree of variance between data points. The original data can be projected onto the most significant components, thereby optimally retaining the most significant variation of the original data in new, lower-dimensional space, assigning to each data point a new coordinate in this reduced space. Similarly, UMAP reduces data to a 2-dimensional visualization that preserves the similarities between each data point and its nearest neighbors. The operation of UMAP can be modified according to two parameters, namely the number of neighbors of a data point it takes into consideration and how tightly clustered neighboring data points are visualized in the resulting graph. By varying these parameters, one may control how much local versus global structure is preserved in the final projection. Here we apply a PCA to the 384-dimensional S-BERT vectors retrieved for each PrEP tweet such that they can be positioned in a lower-dimensional semantic space spanned by the respective PCA components that explain the greatest amount of variance in the original data, followed by a UMAP procedure that positions each tweet in a two-dimensional visualization.

K-Means Clustering. We divide the tweets in the UMAP visualization into a set of visually distinct clusters using k-means clustering. The k-means clustering algorithm [26] partitions a dataset into k number of highly dense sets of data points by adjusting a set of cluster centers such that the assigned clusters minimize the distance between the cluster center and the data points that are assigned to the cluster. K-means clustering requires the value k to be specified, which in the case of two-dimensional data may be determined by visually or qualitatively analyzing the data set for natural visually compelling groupings, such as distinct groupings of data points on a plot or logical divisions of topics in a set of text samples. Integrating dimensionality reduction (PCA and UMAP) with k-means clustering supports the visual analysis of (any) topical clusters occurring within the set of tweets analyzed.

Procedure

The first step in our process consisted of retrieving S-BERT vectors for each tweet. These high-dimensional (n = 384) sentence vectors were then reduced to two dimensions using a combination of PCA and UMAP such that each tweet could be placed on a 2-dimensional map according to their semantic similarity allowing for a visual analysis of the data. A wide range of parameter values were tested for PCA and UMAP, producing two-dimensional mappings of the corpus for visual analysis. These two-dimensional maps were subject to a k-means clustering procedure which assigned each tweet to a cluster, thereby codifying the visual clustering of the set of 4,020 tweets in the map to an explicit partitioning.

The approach that was determined to be the most effective used PCA to reduce the initial S-BERT embeddings to 40 dimensions. We then further reduced to 2 dimensions using UMAP with parameters of 20 nearest neighbors and a minimum distance of 0.1. These parameter values direct UMAP to prioritize local groupings of data points more heavily than global structure, and were best able to preserve the structure of topical clusters out of all parameters tested.

Visual analysis of this data indicated that roughly 25 distinct topical clusters were present, which informed the application of k-means clustering to partition the data set into 25 clusters. Two of our authors, serving in part as domain experts, analyzed the resulting tweet clusters that the k-means algorithm identified. They independently generated topic summaries of the content of the tweet clusters they examined. See Fig. 2. Their summaries were compared by the lead author of this study, and overlap between summaries was deemed as sufficient agreement for interpretation.

Fig. 2
figure 2

Diagram showing how we produced a visual map and topical clustering of N = 4,020 PrEP-relevant tweets, partitioned into 25 clusters revealing relevant online PrEP topics, the content of which were validated by a team of experts

Results

We mined n = 4,020 tweets that matched the mentioned PrEP relevant terms and translated their semantics in a visualization using S-BERT, UMAP, and K-Means clustering. Broadly, we identified several observations regarding the myriad contexts in which PrEP is discussed online. We present our findings briefly without comment.

RQ1: Can we meaningfully consolidate PrEP related tweets into themes or ideas?

Using the data processing pipeline shown in Fig. 1, we identified 25 unique themes that occurred among the N = 4,020 tweets in our sample. Tables I and II outline all themes, delineated by the name of each theme, a brief definition, and a list of ten words most associated with each theme. Though most of clusters were clear and interpretable, we observed one cluster with a string of unclear terms and phrases (see Cluster 18). Lists of associated terms were generated by frequency of terms within each cluster, and the presence of apparently non-topical terms such as “I” or “re-tweet” is a natural artifact of this process.

Table I List of Clusters and Associated Words per Cluster [1–13]
Table II List of Clusters and Associated Words Per Cluster [14–25]

Cluster content spanned a wide range of relevant topics, reflecting a diversity of context that ran the gamut from the quotidian pre-occupations of PrEP users. For example, clusters emerged relative to general PrEP uptake and PrEP information. We also identified several topics believed to discuss some of the side effects associated with PrEP including chronic fatigue syndrome and renal/hepatic issues. As a likely consequence of reported side effects, we also observed topics related to lawsuits against Gilead Sciences, the maker of Truvada, and Truvada alternatives (i.e., Descovy, a monthly injectable PrEP medication). Lastly, we observed several topics related to condomless sex or MSM hookups among those on an active PrEP regimen.

RQ2: How can interdisciplinary frameworks add additional nuance to online conversations about PrEP and other Public health topics more broadly?

By leveraging the visualization tools described above we mapped the corpus of tweets into 25 topical clusters, as depicted in Fig. 3. These clusters represent bodies of tweets that are semantically and contextually similar, and are placed in the map such that their position reflects their relative similarity to other clusters. As a result of the K-means clustering algorithm, several outlying groups of tweets were unavoidably assigned their own clusters. These clusters (6, 10, 12, 13, and 20) comprised less than 10 tweets each.

Fig. 3
figure 3

 A Vector map depicting 25 clusters identified in a collection of (n = 4,020) tweets. Clusters in close proximation are semantically and contextually similar; clusters that are distal are semantically and contextually unrelated. Clusters on the right of the figure depict outlying clusters not visible in this map

Clusters in the center (i.e., topics 25, 11, 19, 2, and others) are general PrEP topics, which in some capacities are similar to all clusters located in the vector map. For example, cluster 25 contains tweets discussing Truvada costs and cluster 11 discusses PrEP medication relative to other methods of HIV prevention. Clusters further from the center are typically responses to news or specific events related to PrEP medications.

Close proximity among clusters indicate tweets aligned with themes identified in our analysis that are similar or overlap in content. For example, clusters related to PrEP cost, PrEP generic alternatives, and Gilead price gouging are likely to have close vector representation and appear close to one another in the vector map given the likely similarity of these bodies of tweets.

Distal clusters (e.g. Topics 9 and 7 compared with 21 and 1) indicate bodies of tweets that are semantically and contextually different from each other. For example, the previously mentioned clusters regarding PrEP costs and generic alternatives are most distal from topics/themes about condomless sex and MSM hookups.

Discussion

The purpose of this study is to explore online discourse about PrEP using an interdisciplinary public health and computational informatics approach. By analyzing a diverse array of PrEP related tweets, we uncovered several allusions to PrEP promotion efforts and how PrEP is positively and negatively contextualized online.

PrEP as a Medical and Cultural Social Media Phenomenon

PrEP altered the scientific community’s approach to preventing HIV exposure and transmission [27]. As one of the first medications approved during the social media era [28], scientists leveraged such online spaces to promote PrEP uptake and adherence. A natural consequence of using online mediums for information dissemination is the diachronic and public domain nature of these data [29]. Over time online information and interventions for PrEP have created a nine-year trail of online discourse including how PrEP is broadly communicated online.

Most clusters uncovered by our analysis aligned with PrEP information dissemination. This includes topics related to general information about PrEP, medical costs, and various options for PrEP including Truvada (a once-daily oral medication) and Descovy (a once monthly injectable medication). Tweets associated with these clusters suggest sincere efforts to promote and disseminate medically accurate information about PrEP (e.g., TWEET: Talking to your doctor about HIV prevention treatment can be intimidating– but it doesn’t have to be). Similar tweets also addressed ways insurers, companies, and/or advocacy groups could mitigate the cost of PrEP (e.g., TWEET: PrEP to prevent HIV is expensive. We are here to help!). We also identified several topics that compared Descovy and Truvada and the associated pros and cons for each medicine (e.g., TWEET: Truvada, a once daily oral pill; or Descovy, a once monthly injection: Which is right for you?).

We also highlighted several clusters that expressly referred to how PrEP is communicated among the MSM community. These cultural aspects about PrEP are reflected in our findings, which yielded two topics related to MSM hookups, LGBTQIA + identity, and condomless sex (Topic 15 (MSM community and PrEP) and Topic 16 (Condomless sex and MSM hookups)). Indeed, since PrEP’s inception the prophylactic treatment forged new identities among gay and bisexual men with regard to casual dating, hookups, and perceptions of PrEP users and non-users [30]. For example, users of popular MSM internet hookup sites (i.e., Grindr, Scruff, and others) are increasingly disclosing HIV serostatus and PrEP use as part of their bios or personal profiles [31]. Yet, disclosing one’s serostatus or PrEP use may come at the cost of social stigmatization among others who choose not to follow these practices. Indeed, evidence suggests there is a sharp divide in attitudes and perceptions of MSM who use PrEP, versus those who do not, and discrepancy among each persons’ choice. For example, tweets associated with these topics alluded to PrEP use disclosure and how PrEP users perceive themselves and others. (TWEET: If it’s not queer shaming, it’s slut shaming à la “Truvada whore” getting bandied around amongst our own like we don’t have a modern medical miracle sitting in our goddamn laps). These tweets, and others, suggest a certain degree of skepticism in sexual practices adopted by PrEP users, or even uncertainty about the safety of PrEP to reduce HIV infection. This observation is further supported by increases in STIs among MSM engaging in unprotected sex (i.e., barebacking), prompting concerns of drug resistant STI strains [32]. Ongoing research on identity formation should examine the effects of PrEP on gay and bisexual social circles, including how PrEP regimens alter personal social networks in a hookup context.

Nuanced PrEP Topics Reveal Social and Medical Barriers Inhibiting Uptake

Since the FDA approved PrEP in 2012, an estimated 1 million global (and 100,000 US) adults regularly use PrEP to prevent HIV-1 infection [33]. Yet, estimates indicate that PrEP remains underutilized among all eligible groups [34]. In the United States, only one quarter of eligible adults (i.e., MSM, trans women, and people who inject drugs) are on a PrEP regimen [35]. Global estimates highlight similarly low uptake rates in Sub-Saharan Africa, Asia, and Latin America [36,37,38]. Due to poor uptake and adherence, there have been efforts to identify barriers that may inhibit PrEP use and access. Though barriers are numerous, and often context and country-specific, recurring concerns among US and global populations include: lack of knowledge/awareness about HIV and PrEP, perceived HIV risk, social stigma, healthcare mistrust, lack of access, financial burden, among others [39].

Several topics along the periphery of our vector map alluded to complications and barriers of PrEP uptake. Topics along the periphery suggest these conversations are not the unilateral focus of the corpus but represent side conversations tangentially related to central topics located at the center of the vector map (i.e., PrEP information dissemination). Some barriers alluded to in these topics are highly documented including associated PrEP costs and insurance coverage. However, amongst topics alluding to documented barriers to PrEP use, we also identified additional barriers that are well-known in legal circles yet somewhat empirically understudied, including PrEP effectiveness , PrEP side effects, exorbitant costs and price-gouging, and lawsuits against Gilead Sciences, the maker of Truvada . These topics strongly suggest that at least some discourse about PrEP is framed negatively– particularly calling into question the long-term effects of continued PrEP uptake. These concerns are not new and highly documented [40]. Indeed, beginning in 2019, several state and federal lawsuits (and one class action lawsuit) were filed against Gilead Sciences. These lawsuits allege Gilead knew, or should have known, the active medication in Truvada could lead to serious side-effects including bone density loss, renal damage, and liver failure if not carefully monitored by medical providers [41]. Financial complaints similarly allege Gilead’s patent on Truvada (which would prevent the creation of a generic alternative) only served profiteering purposes by increasing the monthly cost of PrEP between $1200 and $2000 USD [10, 42], though generic alternatives for Truvada have since become available.

Independently, PrEP related concerns and/or barriers identified in our model should not affect PrEP uptake. However, given low PrEP uptake and adherence, there is evidence these controversies may be adversely affecting PrEP adoption and maintenance [43]. Indeed, persistent concerns in e-health and digital health science are short and long-term effects of misinformation, disinformation, and ideological echo chambers on individual health beliefs and behaviors [44]. Social media’s sordid history of ideological polarization further supports these concerns with regard to information seeking and dissemination among likely groups. Regarding PrEP, in 2019, an influx of social media ads likely targeting anti-PrEP groups began seeking plaintiffs in lawsuits targeted at Gilead. A study on the effects of such ads concluded that nearly half of participants who viewed the ads would either never start PrEP or discontinue their current PrEP regimen [10]. This, coupled with conflicted perceptions of PrEP users among MSM suggests that tailored messaging is needed to mitigate the effects of misinformation and disinformation campaigns. Future studies should continue studying online PrEP discourse, including identifying sources of misinformation and how to counter it. Interventionists should also consider leveraging pockets of disinformation as viable insights/sources for interventions promoting accurate medical information.

Insights into Interdisciplinary Public Health & Computational Informatics Collaborations

Public health has historically borrowed methods, tools, and algorithms from computer science and computational informatics to mine social media data. As shown in our study, the synergistic use of public health frameworks with computer science tools uncovered nuanced portrayals about PrEP. This includes PrEP’s evolution from medical novelty into a cultural phenomenon in addition to controversies that may be weaponized via misinformation campaigns.

Collectively these findings suggest that data derived from social media can provide a more comprehensive portrait of PrEP and other medical interventions more broadly. However, limited understanding of all available tools and how to best apply them may create uncertainty about the validity of these data. In many scientific disciplines, including public health, social media data bears an unfortunate reputation for being unreliable and non-efficacious. Mistrust of social media data may stem from the radical departure of social media analytics from traditional quantitative (i.e., multiple regression) and qualitative (i.e., focus-groups, interviews) analyses; or the secondary data collection nature of scraping social media data that may be days, months, and in some cases, years old. However, the uniqueness of social media data necessitates methods to facilitate extraction, analysis, and synthesis of individual posts, timelines, and collections of timelines. Inherently, these methods must also understand the time-variant nature of social media and how that data contributes to a larger narrative, regardless of the data’s age. Refined algorithms in Computer Science, and other similarly technical fields, have afforded the opportunity to visualize the scope, scale, and precision of social media data, including methods undertaken herein. Indeed, the combined S-BERT, UMAP, and K-means approach can group similar tweets in a corpus into clusters that represent pockets of dialogue that, when mapped by vector representation, illustrate the semantic and contextual ways medical necessities are communicated online.

This study has timely implications for computational informatics and public health. From an informatics perspective, this study contributes to a body of research on social media discourse and misinformation. From a public health perspective there are numerous implications associated with our findings, including how these data can be leveraged even further for interventions. First, our analysis identified pockets of possible misinformation (and how it may impact PrEP uptake and maintenance). These bodies of tweets capitalized on controversies associated with Gilead science, possibly creating concerns about Truvada’s safety, effectiveness, and unintended consequences of PrEP use. We also identified PrEP uptake as a form of judgement among people with differing views, for example, MSM PrEP users versus non PrEP users and conflicts between them. This is not the only instance of communicative tensions online regarding medical interventions— COVID-19 vaccination status has played out similarly. While these insights are, by themselves, informative, we can increase granularity by leveraging metadata associated with tweets and clusters of tweets. Indeed, by leveraging metadata, we can determine whether tweets sharing information about Truvada’s adverse effects come from social media users or bot accounts, defined as automated accounts that are not operated by humans. If these tweets originate from individual users, then it is possible to mine individual timelines to determine personal characteristics about that user, including markers of mental illness and other patterns of social media behavior that may be facilitating problematic online posting habits. We can also examine the relative popularity of these tweets determined by frequency in which they are shared among social networks. From a public health standpoint, these represent potential intervention targets– namely groups that may be misinformed about Truvada– to promote accurate PrEP information.

Ethical Considerations for Mining Social Media Data Among Vulnerable Populations

Although social media analysis represents an important potential resource for developing insight into attitudes and behaviors relating to public health concerns, caution must be urged when undertaking such investigations. Indeed, ethical concerns related to social media mining, including consent, exploiting data, and unknowingly scraping personal social media feeds represent persistent challenges in this area of research[45]. These challenges are particularly noteworthy for at-risk and vulnerable populations. Indeed, we must remember that the sero-status of individuals who are at-risk or currently living with HIV represents a deeply personal and sensitive characteristic. As such, studies relating to PrEP interventions, and for studies involving vulnerable populations more broadly, the utmost care with ensuring the anonymity and security of personal data pertaining to vulnerable populations must be observed. This study adhered closely to ethical principals for social media mining, including anonymized data and deleting all personally identifiable account information. We highly encourage the adoption of these practices for any study involving social media data related to sensitive topics, including HIV dialogue. We also strongly discourage the analysis of a select, or few, number of tweets/accounts with deep learning tools, as more data typically afford greater degrees of anonymity and accuracy of post classification.

Limitations

This work is subject to limitations we hope to address in future research. First, our study was limited by Twitter’s API, which allows users to only collect 1% of total tweet volume for a given search query. This threshold resulted in a relatively small sample of N = 4,020 tweets, which were likely intended for a highly specific population (i.e., men who have sex with men, people who inject drugs, or other persons at risk of contracting HIV). As a consequence of the limited populations for which PrEP is intended, it is likely there was less social media discourse relative to other medical interventions intended for the general population such as the COVID-19 vaccine. Second, our study relied on tweets matching a limited set of PrEP-related search terms, thus excluding a wide range of tweets that may be relevant to PrEP but did not contain these specific terms. In other words, our tweet sample inclusion criterion focused on high precision, including tweets in our sample exactly matched a small set of PrEP-relevant terms, but likely yielded low recall. The outcomes of this analysis however can point to more sophisticated methods that identify a wider range of PrEP-relevant tweets and may do so in a manner that adapts to the changing landscape of online communities of interest. Third, our study was also limited given that we did not perform a full qualitative analysis of tweet content. Future studies should consider validating our findings by conducting a full qualitative analysis of this corpus.

Conclusions

Our findings demonstrate that leveraging interdisciplinary collaboration between computational informatics and public health can provide insight into discourse surrounding complex issues such as PrEP. Social media data contains a wealth of information regarding public attitudes towards health issues but extracting the nuances of these narratives requires the analysis of large amounts of unstructured data. By developing a research framework utilizing deep learning neural networks and pattern recognition tools to prepare data for qualitative analysis grounded in public health research, we were able to distill large data corpora into more coherent topical groupings for exploratory interpretation. The findings of this study indicate a need for deeper analysis into PrEP discourse on social media, as well as an opportunity to extend our research framework towards better understanding other public health issues.