Product Selection
The activities described in this article were part of the WEB-RADR project. The WEB-RADR consortium decided to focus these activities on six “drugs of interests” (DOIs), i.e. substances that are manufactured by one of the companies participating in the WEB-RADR consortium (Bayer, Novartis, Sanofi, UCB). For each DOI, a list of product search terms was created using the WHO Drug Global lexicon of global drug names. The product search terms included the products’ generic names, trade names, abbreviations and common misspellings. This resulted in a list of 880 product search terms (between 12 and 418 per DOI) that were used for the Twitter data extraction. Key characteristics of the six DOIs are given in Table 1.
Table 1 Key characteristics of the six “drugs of interest” (DOIs) used for the development of the benchmark reference dataset Twitter Data Extraction, Deduplication and Sampling
The social media data analysed in this report were acquired via an Application Programming Interface from publicly available English-language Twitter posts (Tweets) created between 1 March, 2012 and 1 March, 2015. At the time of data acquisition, Tweets were limited to 140 characters. The data retrieval query that was used to extract data from Twitter contained the 880 product search terms identified in the product selection phase (see Sect. 2.1) and yielded a total of 5,645,336 Tweets. Each of these Tweets potentially contained at least one of the DOIs but not necessarily an AE. The review and annotation of the Tweets later revealed that some Tweets did not contain any DOIs but were included in the data extract as they matched product search terms with alternative connotations (e.g. “ambien”, “concentra”, “freederm”, “intermezzo”).
To remove potentially redundant data, locality-sensitive hashing [7] was applied to the 5,645,336 Tweets resulting in the removal of approximately 80% Tweets identified as duplicates or near-duplicates. The largest single cluster of duplicate Tweets identified by this method contained around 11,000 near-identical Tweets, mostly re-tweets. The remaining subset contained approximately 1.1 million Tweets, and these were grouped by substance name.
From this subset of Tweets, posts were randomly sampled until a target number of at least 1500 posts per DOI was reached. The resulting dataset contained a total of 57,473 Tweets (1–2228 Tweets per product search term). Figure 1 shows the selection and filtering of Tweets through the data extraction, deduplication and sampling process.
Indicator Score
The Tweets selected in the previous step (see Sect. 2.2) underwent classification by a Bayesian classifier that was previously developed by Epidemico, Inc. (now part of Booz Allen Hamilton) for mining AE discussions in social media data [8], based on Robinson’s method for filtering e-mail spam [9]. The classifier has been trained to identify vernacular language that may describe a suspected ADR or resembles an AE (sometimes referred to as a “Proto-AE”) and calculates an indicator score with values from 0.0 to 1.0. The score indicates the probability that a social media post contains at least one AE (0: low probability, 1: high probability). A penalty of 0.2 is deducted from the indicator score if the post does not contain any identifiable symptom [8].
To avoid any bias on the manual annotation of the Tweets (see Sect. 2.4), the indicator score was not shown to the annotators and was also not used to define the order in which the Tweets went into the annotation process. However, the indicator score was used to define the route a Tweet took through the annotation and quality assurance processes. This is described in Sects. 2.4.4 and 2.4.5.
Annotation
Setting up the Annotation Environment
To facilitate human review and annotation of the Twitter data, a graphical user interface was developed (Insight Explorer) [10]. Two separate environments were set up, each with a copy of the 57,473 Tweets, to allow two teams to annotate the Tweets independently and in parallel.
Annotation Guideline, Teams and Training
Before the annotation of Tweets started, an annotation guideline was developed that included guidance on how to distinguish between “AE Tweets” and “Non-AE Tweets” and how to extract and code medicinal products and AEs. Two independent teams of annotators were created. Each team (nine people in total per team) worked in one of the annotation environments, and could not see the annotations made by members of the other annotation team. The members of the teams were pharmacovigilance experts with experience in processing individual case safety reports, including coding of medicinal products and AEs.
Each annotator was trained in the annotation guideline and in the use of the tool used to perform the annotation (Insight Explorer). Weekly meetings were held to support the annotators in case of post-training questions regarding the annotation tool, the annotation guideline, or Tweets containing inconclusive or ambiguous content.
Essentials of the Annotation Guideline
Each Tweet was evaluated as an independent Tweet. Therefore, other Tweets from the same user, related Tweets from other users (re-Tweets or replies) or information outside of the Twitter dataset pointing to hyperlinks within the Tweets were not considered for annotation.
Tweets with at least one DOI and at least one AE reported as a personal experience associated with the reported DOI(s) were classified as “AE Tweets”. In those Tweets, all identifiable DOIs and AEs were extracted and mapped to standard dictionary terms, i.e. product name as reported and International Nonproprietary Name for products, and MedDRA PTs for AEs and indications. Furthermore, details about product-event combinations and product-indication combinations, e.g. causal attribution, were evaluated. If a Tweet contained multiple AEs, it was assumed that the AEs occurred over the same period unless the Tweet contained useable information to the contrary.
Tweets containing at least one DOI but no AE, or a DOI with no AE reported as a personal experience, were classified as “Non-AE Tweets”. Tweets without any DOI were also classified as Non-AE Tweets. For Non-AE Tweets, the DOIs, non-DOI products, AEs and indications were not annotated or mapped to standard dictionary terms.
Please note Due to Twitter’s policy, we are not allowed to publish the complete Tweet contents. Therefore, for demonstration purposes in this article, original substance names were substituted by “<substance name>”. In ESM 1 of the online version of this article, the completely evaluated benchmark reference dataset is available, but without the Tweets’ content. Please use the Twitter ID and available programmes (see the link listed in ESM 1) for accessing the Tweets’ content.
Example of an “AE Tweet” and its annotation:
”my doc wanted to give me < substance name 1 > . I said no because I knew I would like it too much. Tried < substance name 2 > but I was sleepwalking/amnesia”
In this example, only < substance name 2 > was identified as a DOI and, therefore, only data for this substance were subsequently annotated. Of note, even if < substance name 1 > would have been a DOI, the Tweet does not contain a personal experience of an AE associated with < substance name 1 > and hence, no product-event combination would have been annotated for it.
Annotation result:
Classification: AE Tweet
Product(s) as reported: < substance name 2>
Product coded (International Nonproprietary Name): < substance name 2 coded>
Event(s) as reported: amnesia; sleepwalking
Event(s) coded (PT): Amnesia; Somnambulism
Product event(s): < substance name 2 > : Amnesia; < substance name 2 > : Somnambulism
Indication(s) as reported:
Indication(s) coded (PT):
Product indication(s):
Please note: In this example, no indication is reported. Therefore, those fields are left blank.
Two typical examples of “Non-AE Tweets”:
”? < substance name > is a pill that works through the bloodstream to target and attack the infection at its source underneath the nail.?“
“<substance name > , which was priced at Rs 2.28 lakh per month is now available for Rs 6,600“
Annotation Process
The annotation process is outlined in Fig. 2. The two different Insight Explorer database instances are labelled as “IE#1” and “IE#2”. The indicator scores of the Tweets were not displayed to the annotation teams to avoid bias on their manual annotation.
The original goal was that all 57,473 Tweets would be reviewed manually by the two annotation teams. However, it was found that the annotation took longer than anticipated and would not be completed within the timeline defined by the WEB-RADR project. Hence, an exploratory analysis was performed to investigate the potential for annotation automation of “Non-AE Tweets”.
At the time of the exploratory analysis, 15,195 Tweets had been reviewed and, within those, 91 “AE Tweets” had been identified. For Tweets with an indicator score below 0.3, only five AE Tweets were found compared with 5982 Non-AE Tweets. Based on this finding, it was determined that Tweets with an indicator score below 0.3 could be considered Non-AE Tweets and be excluded from manual annotation without significant loss of precision and recall. Applying this filter to the entire dataset of 57,473 Tweets resulted in the classification of 24,311 Tweets as Non-AE Tweets, leaving 33,162 Tweets for manual human curation.
The 33,162 Tweets with an indicator score ≥ 0.3 were manually curated by the two independent annotation teams. Both annotation teams agreed on the classification of 31,340 Non-AE Tweets and 507 AE Tweets (see Fig. 2). For 1315 Tweets, the classification by the two annotation teams differed, illustrating the difficulty of interpreting the content of Tweets (see Sect. 4 for details).
Tweets with indicator scores between 0.3 and 0.7 and classified by both teams as Non-AE Tweets were not processed any further (n = 30,303). For the remaining Tweets with an indicator score ≥ 0.3 (n = 2859), a 100% quality control was performed by a team of experienced MedDRA coders to propose the annotations for the benchmark reference dataset. As shown in Fig. 2, these 2859 Tweets comprised the concordantly classified Non-AE Tweets with an indicator score ≥ 0.7 (n = 1037), the discordantly classified Tweets with an indicator score ≥ 0.3 (n = 1315) and the concordantly classified AE Tweets (n = 507). This quality control process resulted in the identification of 991 AE Tweets. Finally, two quality assurance measures were performed to make final refinements to the benchmark reference dataset.
Quality Assurance
Two quality assurance measures were defined and performed to yield the best quality of the benchmark reference dataset under the given circumstances of this project.
Quality Assurance #1 Of 600 randomly selected Tweets (300 AE Tweets and 300 Non-AE Tweets) with an indicator score ≥ 0.3, were independently evaluated by a team not involved in the prior annotation process (see Quality Assurance #1 in Fig. 2). Among the 300 AE Tweets, a total of 46 Tweets with issues were found: non-DOI products were wrongly identified as DOIs (n = 14); AEs were coded to the wrong PT (n = 25); one Tweet was wrongly identified as an AE Tweet (n = 1); and misspellings were identified (n = 6). Among the 300 Non-AE Tweets, eight were found with AEs (i.e. AE Tweets) [2.7%].
The identified issues were resolved in the benchmark reference dataset.
Quality Assurance #2 Tweets were sorted by descending indicator scores, a total of 1200 Non-AE Tweets were assigned to batches of 100 each (n = 12 batches), and a 100% Tweet content check was performed to identify potentially missed additional AE Tweets (see Quality Assurance #2 in Fig. 2). Among all Tweets (both AE Tweets and Non-AE Tweets together) within the range of an indicator score defined by each batch, the proportion of missed AE Tweets (annotated as Non-AE Tweets but identified as AE Tweets in the second quality assurance) was computed. This proportion was found to vary between 0.9% [batch 12: one missed AE Tweet/(100 Non-AE Tweets + 15 AE Tweets)] and 6.7% [batch 2: 12 missed AE Tweets/(100 Non-AE Tweets + 80 AE Tweets)]. In this quality assurance step, a total of 58 additional AE Tweets were identified and annotated, and the benchmark reference dataset updated accordingly.