FormalPara Key Points

Social media data can provide information on medication intake and birth defects; however, the information obtained cannot replace pregnancy registries at this time. At present, these data are incomplete but may still be useful to supplement pregnancy registry data.

Future research is necessary to refine efforts and uses of social media data to support regulatory decision making regarding pregnancy outcomes with recently approved drugs used in women of child-bearing age.

1 Introduction

New pharmaceutical products undergo rigorous testing prior to approval. However, data on safety during pregnancy are sparse, particularly for new products [1], as clinical trials often exclude women who are pregnant or breastfeeding because of ethical implications. Up to 80% of pregnant women take at least one prescribed or over-the-counter (OTC) medication during their pregnancy [2]. Further, many pregnancies are unplanned, for example in USA, almost half of pregnancies are unplanned [3], thus unintended fetal exposure to medications during critical periods of development is likely to occur [4]. Therefore, evidence of safety during pregnancy relies on other sources of data, such as observational studies in the post-market setting.

The most common type of pregnancy surveillance studies are traditional pregnancy registries [5], which have been the standard for post-marketing pregnancy surveillance. There are many known disadvantages with registries, including a lack of an appropriate comparator group to estimate background rates, selective loss to follow-up, and low rates of recruitment [6,7,8,9,10]. The routine use of ultrasounds and early prenatal screening challenges true prospective enrollment into traditional registries and may bias results of the registry [9]. Even with efforts made to enroll women as early as possible in the pregnancy (seventh or eighth week of gestation) [1], any adverse drug effects during early pregnancy may be missed [1]. Finally, the frequent inability to have significant statistical associations within pregnancy registries is primarily owing to poor patient enrollment and the few birth defects recorded in these observational studies. Hence, although pregnancy registries have adequate statistical power to detect signals of major-risk teratogenicity (birth defect rate of 25%), they are not powered to detect signals of moderate-risk teratogens [11, 12] or specific birth defects [9]. These limitations, together with the knowledge that one data source is unlikely to be sufficient to provide enough information on potentially rare outcomes, have led researchers and regulatory agencies to identify supplementary sources of data for evaluating the safety of medicines in pregnancy.

Alternative sources for pregnancy surveillance include population-based surveillance registers, electronic healthcare records, and administrative claims databases, and studies within these databases have provided key evidence of drug exposures, pregnancy outcomes, and birth defects [13,14,15,16,17,18,19,20]. However, these sources lack data on OTC medicines and lifestyle factors and prescription fills are used as a surrogate to medication intake. Further, population-based surveillance registers with linkage capabilities between the mother and baby are not available in USA. Therefore industry-sponsored voluntary registries focused on single drugs or group of drugs associated with a disease (e.g., human immunodeficiency virus, epilepsy) remain the primary source of pregnancy safety surveillance [21, 22].

Social media are another potential emerging source of data for use in pregnancy surveillance. Social media data include information on lifestyle factors collected in women prior to pregnancy and early in the first trimester, when the risk of congenital abnormalities is highest [23]. Other advantages could include prospective data collection in real time and throughout pregnancy and the capture of information on OTC medicines and lifestyle factors, such as smoking and alcohol usage that may be associated with deleterious pregnancy outcomes. In unverified claims, it is postulated that the power of social media to link rare, strikingly unanticipated fetal abnormalities as seen during the “thalidomide storm” to drug usage would have taken only 5–7 days [24].

We selected Twitter for this pilot study as Twitter is a very popular social media source, is publically available, and our prior work has assessed the feasibility of identifying pregnant women who actively use Twitter [25]. For this study, we proposed that publicly available tweets throughout the full timeline of a pregnancy could be annotated, and potential useful information on drug utilization and birth defects obtained. We hypothesized that much of the data routinely collected in registries, such as basic demographics, medicine intake, and birth defects, could be obtained from the social media posts throughout the full term of a pregnancy, and that this annotation could be performed automatically in the future. With data from these timelines, we assessed the feasibility of constructing a nested case-control study within a cohort of pregnant women to quantify the association between pregnancy-related exposure and birth defects.

2 Methods

The methodology followed for this study needed to address the problems of identifying women having a baby with a birth defect using primarily automated methods. Given the rare occurrence of birth defects, case-control studies nested within large populations are the preferred approach for the evaluation of specific pregnancy outcomes [6]. Thus, to identify the case and control groups from social media, we first needed to identify pregnant women amongst the millions of Twitter users. Our initial work was focused on this, detecting users to add to our pregnancy database via a single tweet announcement [25]. Our automatic classification system achieves an F1-score of 0.88 for identifying pregnancies. The F1-score is computed as 2 × (Precision × Recall)/(Precision + Recall), where precision is True Positives/(True Positives + False Positives), and recall is True Positives/(True Positives + False Negatives). Once pregnant users are identified, all of their publicly available tweets (their “timeline”) were collected. A total of 112,429 users were identified and their timelines collected. A method for estimating the number of timelines that encompasses the user’s pregnancy was developed [26], resulting in a total of 44,825 timelines in our database.

2.1 Selection of the Cohorts

A birth defects cohort (cases) was created by retrieving and annotating tweets from the pregnancy database that mention birth defects. This method, which we summarize in the remainder of this sub-section, is described in further detail in another publication [26]. As Fig. 1 illustrates, we manually compiled a lexicon of approximately 650 terms referring to birth defects (Penn Social Media Lexicon of Birth Defects), based on published reports, guidelines, and the Unified Medical Language System [27,28,29,30,31], and semi-automatically generated lexical variants of these terms (e.g., misspellings). To retrieve tweets containing (variants of) the terms, we implemented hand-crafted regular expressions in a series of database queries. We post-processed the retrieved tweets by removing those containing user names and URLs matched by the regular expressions. With this retrieval method, a total of 16,822 tweets (posted by 5923 users) were collected, with a recall of 0.95. The tweets were annotated by two annotators. We developed annotation guidelines to distinguish three classes of tweets, summarized as follows:

Fig. 1
figure 1

Workflow of tweet collection, tweet annotation, and timeline analysis for selecting the birth defects (case) cohort

Defect (+):

The tweet refers to a person who has a birth defect and identifies that person as the Twitter user’s child

Possible Defect (?):

The tweet is ambiguous about whether a person referred to has a birth defect and/or is the Twitter user’s child

Non-defect (−):

The tweet does not indicate that a person referred to has or may have a birth defect and is or may be the Twitter user’s child

The annotators’ inter-annotator agreement (Cohen’s kappa) was high (κ = 0.79). In total, 765 (4.55%) tweets were annotated as “defect,” 877 (5.21%) tweets were annotated as “possible defect,” and 15,180 (90.24%) tweets were annotated as “non-defect.” The annotations directed us to the timelines of the users who posted them, for an inclusion/exclusion analysis to determine a final cohort. Users were excluded from the cohort if we could not determine if they were the parent of a child with a birth defect, or if there were no tweets available during the pregnancy with a birth defect outcome. First, we analyzed the timelines of the 359 users who posted a “possible defect” tweet (without also posting a “defect” tweet), and determined that 142 (39.55%) of them were indeed the parent of a child with a birth defect. Then, we analyzed the timelines of these 142 users and the 287 users who posted a “defect” tweet, and determined that 196 (45.69%) of the 429 timelines encompass tweets from the timeframe of the pregnancy with a birth defect outcome. Thus, we identified 196 users for our birth defects (case) cohort. For this study, the timelines of the 196 women reporting a birth defect (cases) were matched on timing of pregnancy to timelines of 196 women not reporting any birth defects (controls) in the pregnancy database.

2.2 Data Preparation

All tweets mentioning birth defects were automatically identified using the lexical approach presented in [26]. For this project, we retrieved the timelines of the users corresponding to these tweets and tagged recognizable medication names to facilitate the manual annotation process of the timelines. A set of 37 drug names, including variants and misspellings, was already annotated in our timelines and manually classified into intake, possible intake, or no intake categories. For greater coverage of medications, we extended this initial set of drugs with the list of drug names published in the Drugs@FDA database.Footnote 1 We added lexical variants (possible misspellings) of these drug names and tagged the names in the timelines. All drug mentions found in the tweets during this last process were then automatically classified as intake, possible intake, or no intake using our in-house classifier [32]; this pre-annotating step speeds up manual curation. In our past work [32], inter-annotator agreement (Cohen’s kappa) for manually identifying medication intake was very high (κ = 0.88). Finally, all mentions of gestational ages were automatically pre-annotated and tagged in the timelines [33].

2.2.1 Annotation of Exposures of Interest

To analyze the data for the cases and controls, we first needed to manually annotate the timelines for exposures of interest whenever we did not have any automatic method of doing so, and to corroborate any information tagged automatically. Exposures of interest were; maternal age, due date, place of residence, race/ethnicity, medicine intake at first, second, and third trimester, and birth defects. We created an annotation guideline with examples, and selected the General Architecture for Text Engineering environment for annotation [34].

We annotated all tweets in the timeline of the pregnancy, defined as the time of pregnancy plus 1 month before and 1 month after. Within this timeframe, we annotated all mentions of gestational ages, any indications of the due date of delivery, the pregnancy outcome, and the date of birth of the child. We also annotated each tweet mentioning a drug name listed in the Drugs@FDA database and if the drug was taken (or possibly taken) by trimester of the pregnancy. We annotated all mentions of birth defects and then classified them under their corresponding Medical Dictionary for Regulatory Activities categories. Annotation guidelines are provided as a supplement.

Maternal age was often given in reference to a birthday such as “I’m 24 on Friday” or ‘Only 2 hours until I’m 21 and legal!’ Others mentioned their age in passing “I’m 22 but look 26” or “Gosh you would think I would know that at 20.” Where only approximations of age were given (e.g., women indicated that they were in their 20s or 30s), we categorized these as missing data.

The country of residence of the woman was often present in the profile information (e.g., “Proud Colombian,” “Texas,” or “Bangor, Wales”) or stated in a post. Race was also sometimes explicitly stated: “Just because I’m Hispanic …” or “I’m not African-American I’m black American.”.

Medications were categorized based on the available evidence of risks associated with taking particular medicines while pregnant as per the Australian categorization system.Footnote 2 We selected this categorization system as there is no language barrier, it has greater granularity of classifications (with seven categories A, B1, B2, B3, C, D, X), is easy to use, and is up to date. Some comments were not possible to classify as there was insufficient detail to identify the medication, such as “My pain meds aren’t working anymore,” or “Got to take antibiotics for …”. Medications were grouped into ‘probably safe’ or ‘potentially risky’ to help facilitate the analysis and compare with previous studies. [35] The ‘probably safe’ category consisted of A, B1, and B2 classifications and the ‘potentially risky’ category consisted of B3, C, D, and X.

Although in most instances the medications were named in the tweets, we made a concerted effort not to publish the individual drug names in this article. This was because the study was a feasibility study to test the methodologies using social media. Given the exploratory nature of these methods, we chose not to study individual drug products and raise concern over spurious safety signals without further evidence of causality. To gather additional information on drug classes would require additional data mining and natural language processing information that was not collected initially and is beyond the scope of this project.

Most of the comments on medication intake refer to when the medication was consumed either directly or indirectly: “4 mg of X four hours ago and still got a headache,” “taken X and now off to bed.” We were able to ascertain the timing of the medication intake for the users by the date of the post and then assess whether the intake was from the first, second, or third trimester. Many women gave the actual due date for their baby or provided information from which the due date could be calculated “I am due on the 24th of February” or “I’m due a week today.” There were many references to the length of term of the pregnancy (either by how far into the pregnancy they were or by how long the pregnancy had left). For instance, “I’m 24 weeks today” or “6 more weeks to my due date.” From gestational age annotations, the annotator could calculate an estimated pregnancy conception date using an Internet pregnancy calculator.Footnote 3 From this information, exposure to medications could be categorised as in the first, second, or third trimester.

2.3 Statistical Methods

Using the cc command in Stata, we estimated the odds ratios for each risk factor. The confidence intervals were estimated by the exact method. For the type of medication, we calculated an overall p value using a chi-squared test for a contingency table, as these are not independent variables but different categories of the same variable. A p value < 0.05 was considered statistically significant.

To check whether matching was informative, the analysis for any medication use was also carried out by logistic regression, both ignoring the matching and allowing for matching using robust standard errors. Results were not appreciably altered and, therefore, we conducted an analysis without matching to minimize the impact of missing data.

To check whether other available risk factors (age, ethnicity, country of residence) could explain the relationship between birth defects and medication, logistic regression was used. We excluded women for whom any of these variables were missing and for ethnicity and residence categories with fewer than ten women, we used the “other” category.

3 Results

The mean number of posts per pregnancy timeline in the cases was 2903, varying from 70 to 15,271 posts. The average number of posts per woman in the control group was lower at 2582 (range 19–9142). For comparison, in the entire database of 112,429 timelines, the mean number of posts per person was 3850 (range: 2–80,023 posts). Annotations of each of the 196 cases and 196 control pregnancy timeline took an average of 2 h.

The rate of birth defects in our cohort of pregnant women (cases) was 0.44%. We calculated this rate by taking the ratio between the number of pregnant women reporting a birth defect and the estimated total number of timelines in our database in which we had access to the user’s tweets during pregnancy (i.e., 196/44,825). Examples of birth defects included cleft lip, club foot, congenital heart defects, and Down’s syndrome. The frequencies of the consolidated birth defects reported in our cases are presented in Table 1. The categories used were based on Medical Dictionary for Regulatory Activities sub-classes.

Table 1 Birth defects by the Medical Dictionary for Regulatory Activities (MedDRA)

3.1 Characteristics of Women

Women who gave birth to a baby with a birth defect had a different demographic profile than women who gave birth to baby without a birth defect. Cases were older, more likely to be Caucasian, and less likely to live in USA; cases were also more likely to have missing information on race and less likely to have missing data on age. The distributions of age in both cases and controls are presented in Table 2 and Fig. 2.

Table 2 Characteristics of the cases and controls among women who gave birth
Fig. 2
figure 2

Age of the women who gave birth to a baby with a birth defect (cases) and without a birth defect (controls). CM women who gave birth to a baby with a malformation

3.2 Timing and Type of Medication Intake

Cases reported taking some form of medication during pregnancy more frequently than the controls (35% vs. 17%) (Table 3). Many women, particularly in the cases, mentioned more than one medication intake or taking the same medication on more than one occasion (Table 4).

Table 3 Medication intake, timing, and type in the cases and controls
Table 4 Percentage of instances of intake of ‘probably safe’, ‘potentially risky’, and ‘unclassified’ medications

In the first and third trimesters, the number of women taking medication among the cases was significantly higher than in the controls [odds ratio (OR) = 3.59 (1.44–10.13); p = 0.002 and OR = 2.22 (1.23–4.08); p = 0.004, respectively]. In the second trimester, although a higher number of women among the cases took medications than the controls (22 vs. 12), this was not significantly different [OR = 1.94 (0.89–4.43); p = 0.07].

When the analysis was restricted to women who reported taking medications during pregnancy, the pattern in timing of intake among the cases and controls was similar (Table 3). This is demonstrated by the fact that there is no statistically significant difference between the women taking any medication in the timing of their intake between the cases and controls (p = 0.1, p = 0.7, p = 0.9 for the first, second, and third trimester respectively) (Table 3).

There were 53 different medications reported as taken in the timelines in the cases and 24 different medications mentioned in the control timelines. The number of women taking ‘probably safe medications only’, ‘at least one potentially risky medication’, or ‘at least one unclassified medication’ was higher in the cases than in the controls (42/196, 21% vs. 22/196, 11%, 14/196, 14% vs. 6/196, 3%, 12/196, 6% vs. 6/196, 3%, respectively) (Table 3). If we limit our analysis to only those women who reported taking at least one medication, we find that the pattern of intake by type of medication is very similar in the cases and controls (62% vs. 65%, 21% vs. 18%, 18% vs. 18%) [p = 0.9] (Table 3).

3.3 Predictors of Birth Defects

Using logistic regression, medication use was associated with a greater risk of birth defects [OR = 2.53; p < 0.001, 95% confidence interval (CI) 1.58–4.06]. This result was not appreciably altered after adjusting for age, ethnicity, and country of residence. In multivariable models, the association between any medication use and the risk of birth defects was slightly reduced (OR = 2.34; p = 0.004, 95% CI 1.24–4.44), but it remained highly significant (Table 5).

Table 5 Odd ratios (95% confidence intervals) [ORs (95% CIs)] for birth defects by various demographic and lifestyle factors

Conducting a one-factor analysis for age, ethnicity, and residence, we found that older women were more likely to report birth defects [age (per year): OR = 1.10, p < 0.001, 95% CI 1.05–1.15] (Table 5). However, when ethnicity and residence were included in the model as categorical factors, ethnicity was statistically significant (p = 0.008), country of residence was not (p = 0.3). For both categorical variables, missing was included as a category, and caution must be taken in interpreting these results.

4 Discussion

We have demonstrated that there is a large amount of data publicly available on social media, specifically Twitter, from women during their pregnancy and on their pregnancy outcomes, with many women posting on a daily basis. From these data, we created a prospective timeline for women posting on social media regarding their pregnancies. We were also able to extract information on birth defects, lifestyle factors, and medication intake, including the frequency, timing, and type of medication use before and during the gestational period. However, the main results of the pilot study demonstrated a rate of malformations lower (0.44%) than the rate reported in the general population (3%), highlighting incompleteness and bias in social media data with respect to sensitive medical information such as birth defects.

Our analytic approach to social media data included a nested case-control study comparing exposures among women who gave birth to babies with a birth defect to women whose baby did not have a birth defect. We found that women who gave birth to babies with a birth defect were more likely to be older, Caucasian, and live outside of USA. Even after accounting for age, race, and place of residence between cases and controls, a higher medication intake was observed in pregnancies that reported birth defects. However, women who gave birth to babies with a birth defect also had a higher rate of missing data, limiting the causal inferences that can be made from this analysis. As automated methods for annotation of key demographic, medical and social data are further refined and validated, a nested case-control study design will be the ideal study design to assess pregnancy outcomes from social media data sources because of the rarity of birth defects in the population [6].

There are a number of potential benefits of social media data as an alternative to pregnancy registries. First, even if the women may be identified later in their pregnancies, data are collected prospectively, therefore reducing or eliminating recall bias. Other advantages are the potential availability of data on OTC medications, illicit drugs, and lifestyle factors such as smoking and alcohol that are not captured during routine healthcare encounters and other secondary data sources. Another benefit of social media data is that a comparator group of unexposed pregnant women can be ascertained, which is often lacking in traditional registries. Additionally, although not the focus of our investigation, social media posts contained many adverse pregnancy outcomes, such as early pregnancy loss, low birth weight, and premature delivery, which are not the primary outcomes of interest in pregnancy registries.

Conversely, there are several drawbacks of social media data. First, there are potential differences in key factors associated with birth defects when compared with the general population [36]. For example, the mean age of the cases and controls in this study was approximately 7 and 9 years younger than the general population, respectively, although this may be a reflection of the large number of women classified as having a missing age [37, 38]. Other factors, such as the education levels and social class of social media users, may differ from the wider population. Additionally, the proportion of women reporting at least one medication use is low in both our cases and controls compared with other studies [2]. Finally, the rate of birth defects in the social media population was lower than the rate in the general US population [39, 40].

Reasons for the underestimation in the current study include the incompleteness or underreporting of key information as a result of multiple factors, such as the fact that women may be less likely to report high-risk behaviors and women who are aware that their babies may have a birth defect may be less likely to discuss this information on social media. Additionally, the Natural Language Processing method used to identify birth defects might not be able to capture all such mentions and requires further development. Many women did not allude to the birth defects in much detail or with as much frequency as would be expected given the detail in their other posts while pregnant. Some women also played down any birth defect, posting remarks such as “it’s no big deal but …”.

Recent reviews of traditional pregnancy registries conducted by the US Food and Drug Administration have identified key challenges in the recruitment of patients including the reduced likelihood of women to continue to use drugs that may be associated with birth defects [41] and the widespread use of early prenatal screening [8]. Social media have the potential to identify women for recruitment into traditional registries even prior to conception, women who are exposed and unexposed to the drug of interest, and to reduce the recall bias associated with key lifestyle and medical factors contributing to birth defects. This ability to target and recruit women from a larger pool would allow for the assessment of birth defects with greater statistical power, and the availability of non-exposed women provides greater clinical relevance to these statistical findings. The ethical issues around this active recruitment method need careful consideration [42]. It is anticipated that ethical approval and informed consent will be required to collect information in this manner and for its use for research purposes. Guidelines to help assist researchers to consider the ethical issues for the many different approaches to using social media in research are available and continue to be developed [42].

The information from social media could also be used to inform public health and health promotion campaigns. Some of the medications identified among this study cohort are known to cause problems during pregnancy [43] and many have safer alternatives. The patterns of medication intake could be used to prioritize which medications should be highlighted as potentially unsafe during pregnancy in public health messages. For instance, the risks associated with the use of ibuprofen during pregnancy may not be understood by women. While data are mixed, non-steroidal anti-inflammatory drugs such as ibuprofen have been linked to an increased risk of spontaneous abortion and congenital malformations when taken in the first trimester [43,44,45] and linked to renal impairment and cardiopulmonary abnormalities in the neonate when taken later in pregnancy [46]. Additionally, there have been reports on an increased risk of postpartum hemorrhage for women exposed to non-steroidal anti-inflammatory drugs [44].

4.1 Limitations

Automatic language processing methods utilized in this study enabled the selection of pregnant women from social media. These methods also facilitated the identification of concepts of interest (birth defects, medication intake, and pregnancy timeframe) and greatly reduced the annotation effort. However, the manual annotation effort for the identification of birth defects, which was the primary focus of this study, still required 800 h to annotate over 100,000 tweets, which limited the ability to include a greater number of controls for each pregnant case and to extract additional valuable information available within these tweets. The amount and detail of information disclosed in the pregnancy timelines was considerable and sometimes overwhelming. The amount of information varies from individual to individual, with some twitter users disclosing many personal thoughts to others who limit the personal information they choose to post. Twitter has recently increased the number of characters allowed on each post from 140 characters to 280 characters. This may increase the level of detail posted and lead to less ambiguous posts and improve data clarity, while also increasing the annotation burden. Additional methodologic challenges included the inability to match cases and controls by maternal age, despite age being the biggest risk factor for birth defects. Not all users had their age in their profile information or posted in their tweets. Further automatic language processing advances are warranted to improve these methods and to develop new methods to automatically extract other relevant data from social media timelines (such as pregnancy outcomes, age, place of residence, race, and later substance use) for rapid safety surveillance. For example, automatic methods to determine age from a timeline would have facilitated a greater than 1:1 match between cases and controls. Additional research to develop automatic methods for detecting birth defects through social media data is needed. With additional inputs and broader algorithms, we may be able to capture additional pregnancies with birth defects in future work.

5 Conclusion

In future research, the study design should ideally incorporate matching of cases and controls by key factors including age, race, geography, gestational timeline, and volume of tweets to have greater certainty on conclusions drawn regarding associations between drug use and outcomes of interest. The specific focus on matching on volume of tweets is to reduce the likelihood that key medical information would be missing and we need to consider how to reduce the chance of false negatives among the control arms. Further, it cannot be assumed that no mention of a birth defect or medication intake indicates that no such event occurred. Therefore, a validation of cases of birth defects identified through pregnancy timelines against diagnoses from medical records would provide additional certainty regarding the specificity and sensitivity for case ascertainment. The associations (positive or negative) derived from social media data should be validated against the association estimated from other data sources, including voluntary registries and claims/electronic healthcare record databases. These validation efforts will be required to use results from studies in social media data sources in submissions to regulatory agencies as an alternative to traditional voluntary registries. Other types of social media (particularly non-microblogging sites) should be investigated as different results may be obtained. Finally, future research is needed to determine in which scenarios social media data may be most informative, including for which drug types, frequency of exposure, and magnitude of association.

While social media data have their limitations, in this pilot effort, we have demonstrated that it is feasible to use Twitter data in assessing medication intake and birth defects; however, the information obtained cannot replace pregnancy registries at this time. With further refinement and validation, social media data could potentially complement other established methods in further characterizing the effects of drugs after introduction to the market, including populations underrepresented or not studied (i.e., pregnant women) in clinical development programs. Development of improved methods to automatically extract and annotate social media data may increase their value in supporting regulatory decision making regarding pregnancy outcomes in women using medications during their pregnancies