1 Introduction

Issues related to the safe use of medicines have attracted tremendous attention over recent decades. Pharmaceutical products are used in or on the human body for the prevention, diagnosis or treatment of disease, or for the modification of physiological function [1]. Modern drugs have changed the way in which diseases are managed and controlled. However, adverse reactions to medicines remain a common, yet often preventable, cause of illness, disability and even death [2, 3]. Studies have shown that adverse drug reactions (ADRs) are probably responsible for millions of deaths globally each year (in 2008, 197,000 ADR-related deaths were reported in the EU alone, according to official EC statistics), in both in- and outpatient settings [4]. 6.5% of UK hospital admissions are due to ADRs, and almost 15% of UK patients’ experience an ADR during their admission [5]. In France, the estimated annual number of ADR-related hospitalisations was 144,000 in 2007 [6]. A recent study in Spain estimated that the incidence of ADR-related hospitalisations was 7.11%, with fatal ADRs amounting to 1.97% [7]. ADR occurrence in an outpatient setting cannot be fully estimated, as currently such studies are scarce [3]. The overall impact of ADRs is high, accounting for considerable morbidity, mortality, prolonged hospital stays and extra costs. Although many of the suspected drugs have proved benefit, measures need to be taken to reduce the burden of ADRs and therefore further improve the benefit-to-harm ratio of the drugs [8].

In addition to advances in technological capabilities, today’s biggest trend is data. New data are created in novel ways and processed and analysed with the help of new and increasingly intelligent methods. In this context, safety monitoring is expanding its evidence base, moving beyond traditional approaches towards sophisticated methods that can identify possible safety signals from multiple information sources, both structured and unstructured [9]. Health-related information increasingly shared online by patients represents a potentially valuable, yet currently largely unexploited source of post-market safety data that could supplement data from traditional sources of drug safety information.

With the use of social media data for pharmacovigilance being still in its infancy, the present paper aims to explore the state of the art in the field, highlighting important research efforts and achievements, and discussing current research challenges and the way forward. In particular, the thrust of the presented work is to map the state of the art in the application of social data to pharmacovigilance and explore its future potential. Thus, the main aim of this paper is to provide a comprehensive and up-to-date review of existing research in this area, and make significant contributions to the area in terms of generating awareness and systematising the knowledge around social data applications to ADR detection, as well as offering new insights and recommendations for future research and practice in this context. In particular, and in the above context, the objectives (and aimed contributions) of our work are as follows:

  1. 1.

    To conduct a comprehensive literature review of the applications of social media to pharmacovigilance, with critical analysis and comparative assessment of the relevant body of literature

  2. 2.

    To derive a classification of social media sources for use in pharmacovigilance

  3. 3.

    To develop classifications and taxonomies for social data use in pharmacovigilance

  4. 4.

    To derive new insights, key challenges and recommendations for future research and practice

The paper is structured as follows.

The present section (Sect. 1) outlines the plan and scope of our scientific literature review and describes the methodology that the review follows; it also presents a background on pharmacovigilance by reviewing the basic literature in this field, as well as sets out the purpose and rationale of the present study, based on the examined literature; and finally, it summarises preliminary knowledge on social networking sites (SNS), explores the relevance of SNS data to pharmacovigilance and provides definitions and analyses of the basic concepts in this research area.

Section 2 provides an overview of the applications of social data to adverse drug reaction detection and their potential and presents a new taxonomy of social data sources based on the conducted literature review and a set of key challenges for the future.

Section 3 discusses the identified key challenges, each in separate subsections.

Section 4 draws the conclusions of this research, by summarising its contributions and discussing the gained insights in terms of potential for practical applications and future research directions.

1.1 Methodology

In recent years, there has been an increasing number of studies linking social data with ADR detection. To offer a broad overview of this emerging research domain, a review of academic literature was undertaken to examine relevant publications in the MEDLINE/PubMed database. The study methodology applied builds on the PRISMA methodology for systematic reviews. Firstly, the scope of this review was appropriately defined, according to the following specification:

Scope of literature review:

  • Literature sources: all corpora included in the MEDLINE/ PubMed database

  • Time frame: 2007 early 2018 (covering all eligible literature in last decade)

  • Geographic coverage: all inclusive

  • Literature selection: the literature search (covering a time window of the last 10 years) used two groups of keywords. The first group included the following terms as approximate synonyms for social data: social media, social networking, forum, Twitter, Facebook, search log and social data. The second group referred to ADR detection and included the terms: adverse drug reaction, side effect and pharmacovigilance. Thus, the literature search query for article selection had the logical form of:

    Social Data, or equiv. AND Adverse Drug Reaction detection, or equiv.

Fig. 1
figure 1

Flow chart for article review and selection process

The initial search resulted in a total of 1374 articles, due to the relatively large set of keywords used. All search results were subsequently scanned, based on the paper title and abstract, to determine whether the respective article should be included in the present literature review. Documents were excluded if they met one or more of the following criteria: (1) irrelevant, (2) not written in English, (3) not a primary or secondary research paper. This first screening resulted in a collection of a total of 186 articles, broadly covering issues related to information technology-enabled post-marketing medicine safety monitoring (traditional methods, consumer behaviour, etc.). Following screening, 101 documents were excluded as they were not a directly linked to the specific topic of the present study. As a result of this process, only 85 articles that cover topics relevant to the application of social data to pharmacovigilance were selected for inclusion in the final study corpus. Additional articles were selected, by scanning the reference lists of selected important articles. This screening resulted in a final collection of a total of 100 articles. The flow chart for the article review and selection process is illustrated in Fig. 1.

The most significant insights drawn are outlined in the following sections. Section 2 provides an overview of the application of social data in pharmacovigilance, while Sect. 3 summarises the current challenges and examines the way forward.

1.2 Pharmacovigilance background

Post-market surveillance of health and drug products is of paramount importance for the pharmaceutical stakeholders (industry and regulators), since many adverse events are not captured in randomised clinical trials (RCTs) and previously undetected adverse reactions may occur as the drug is exposed to patients and situations not controlled for during the clinical trial. The practice of monitoring the safety of medicines is commonly referred to as pharmacovigilance, the origins of which can be traced back to the case of thalidomide [10, 11], which highlighted the importance of drug safety and prompted the start of systematic approaches to monitor the safety of marketed medications [1]. Pharmacovigilance is defined by the World Health Organization (WHO) [1] as the science and activities relating to the detection, assessment, understanding and prevention of adverse effects, particularly long-term and short-term side effect, of medicines. The practice of pharmacovigilance is sometimes called post-market (or post-market) surveillance or post-authorisation monitoring. Within the scope of pharmacovigilance fall the detection, assessment, understanding and prevention of adverse effects or any other possible medication-related problems of herbal, traditional and complementary medicines, blood products, biologicals, medical devices and vaccines.

The term adverse event (AE) is used to refer to any untoward medical occurrence that may appear during treatment with a pharmaceutical product but which does not necessarily have a causal relationship with the treatment [12, 13]. The WHO [1] describes adverse drug reaction (ADR) as a response to a drug which is noxious and unintended and which occurs at doses normally used in man for the prophylaxis, diagnosis or therapy of disease, or for the modification of physiological function, a definition that denotes the existence of a causal relationship between the drug therapy and the observed adverse event. The timely signalling of adverse drug effects is required, in order to promote the safety and quality of drug therapies.

According to the WHO [14], the major tasks of pharmacovigilance are:

  • Early detection of unknown adverse reactions and interactions;

  • Detection of increases in frequency of known adverse reactions;

  • Identification of risk factors and possible mechanisms;

  • Estimation of quantitative aspects of benefit/risk analysis and dissemination of information needed to improve medicine prescribing and regulation.

Post-market safety surveillance relies mostly on data from spontaneous reports of adverse events, medical literature and observational databases. Limitations of these data sources include potential under-reporting, lack of geographic diversity, possibility of patients’ perspectives being filtered through healthcare professionals and regulatory agencies, and time difference between event occurrence and discovery [15, 62]. The need to enhance patient safety calls for a proactive approach to pharmacovigilance, in order to improve patient care and safety in relation to the use of medicines.

Currently, the rapidly increasing supply of information combined with the growing knowledge elicitation capabilities of trending and emerging technologies present epidemiology, pharmacoepidemiology and pharmacovigilance with enormous opportunities [16,17,18]. Recently, the evidence base of safety monitoring has been expanding, moving beyond traditional approaches towards sophisticated methods that can identify potential safety signals from multiple information sources, both structured and unstructured. This refers to the exploitation of secondary data, i.e. data made available and/or collected for other purposes, namely electronic health records (EHR), social media data, etc. [16, 19, 155, 156]. In the broad context of pharmacotherapy, patient perspectives have always been an essential component of medicine safety monitoring. As efforts directed to patient-centric drug development are intensifying, it becomes increasingly important to incorporate the patients’ voice in the pharmacovigilance systems and processes. The Internet has changed our relationship with health care, as people are increasingly sharing online their healthcare experiences [20]. Health stakeholders are adapting to this trend, together with the involved sections of the computer science research community. Mining social media for the extraction of health-related information emerges as a hot topic, particularly in certain areas of health care, for example with regard to health concerns like mental illness [163]. De Choudhury and De [21] studied mental illness communities on Reddit (http://www.reddit.com/), a forum-like platform hosting numerous virtual, text-based support groups in which patients openly discuss a variety of concerns related to their condition, benefiting from the dissociative anonymity of the environment.

Presently, there is a growing interest by drug safety stakeholders (pharmaceutical companies and regulators) in exploring the use of social media (social listening) to supplement established approaches for pharmacovigilance, by harvesting information on patients’ experiences after exposure to pharmaceutical products. Health information posted online by patients is in abundance and is often publicly available, thus representing an untapped source of post-market safety data that could supplement data from existing sources of medicine safety information [22, 156]. For example, the study by Gage-Bouchard et al. [23] on cancer information exchanged on personal Facebook Pages revealed that this information predominantly related to treatment protocols and health services use (35%), followed by information related to side effects and late effects (26%) and medication (16%).

1.3 Social networking sites and their relevance to pharmacovigilance

Social networking sites (SNS) and applications allow for the exchange of user-generated content whereby people talk/ communicate, share information, network and participate in community activities. Boyd and Ellison [24] describe SNS as a web-based service that allow individuals to (1) construct a public or semi-public profile within a bounded system, (2) articulate a list of other users with whom they share a connection and (3) view and traverse their list of connections and those made by others within the system. The nature and nomenclature of these connections may vary from site to site.

More and more individuals are making use of SNS to communicate and stay in contact with family and friends, to engage in professional networking or to connect around shared interests and ideas [25]. There currently exists a rich and diverse ecology of SNS, which vary in terms of their scope and functionality, and include: general-purpose and specialised community sites (e.g. Facebook and LinkedIn); media sharing sites (e.g. YouTube and Flickr); weblogs (blogs); micro-blogging sites (e.g. Twitter); and question/answer discussion forums, which have continued to be around for decades with undiminished popularity despite relentless Internet evolution. Social media user base has undergone a nearly tenfold increase in the past decade: 65% of adults now use social networking sites [26]. Since 2005, SNS have experienced significant growth in active users, with Facebook and LinkedIn in particular being among the fastest growers. Micro-blogging services such as Twitter and Tumblr are also on a growing trajectory. As a result, social media is creating real-world data at an unprecedented rate, with people using social media to discuss their everyday lives, including their health and their illnesses. The motivation to connect and learn about one another has given rise to niche SNS. Recent years have seen the emergence and proliferation of SNS dedicated to healthcare communities (usually consisting of health professionals and/or consumers/patients), which have become particularly popular among patients, with the most common intended use being self-care [27], i.e. social media serving as a platform that allows patients to exchange information about their health condition with others who are battling with the same health issues, and receive peer-to-peer support (online patient communities) [28, 29]. Social support is deemed extremely beneficial in combating health concerns like depression and mental illness [21, 33]. Such networks can be classified mainly in terms of two categories:

  1. (A)

    Generic SNS—this category can include:

    • Big public platform SNS, such as Facebook, Twitter, Flicker and Tumblr, which host a plethora of health-related communities/groups, and also contain big volumes of posts by individual users related to health issues.

  2. (B)

    Specialised healthcare social networks and forums—this category can include:

In the above SNS, users tend to share their views with others facing similar problems/conditions or health outcomes, and this makes such social networks unique and robust sources of information about drugs, health effects and treatments, which can significantly augment the evidence base of research studies and provide additional insight on the needs of specific populations [30]. CureTogether specifically promotes patient-driven research, by establishing research partnerships with universities, research organisations and self-experimenters; SNS can further promote medication adherence, enhance the effectiveness of therapies and contribute to secondary prevention against recurrence of disease [31] and chronic pain management [32]. In the case of mental illness, SNS can serve for the identification of signals of mental disorders and users at risk of self-harm [33]. The rapidly growing popularity of such networks and the abundance of data available through them have recently enabled new research on public health monitoring, including ADR monitoring and formal clinical trial procedures [34, 35]. Cohort discovery and metadata platforms have also emerged (e.g. the Dementias Platform UK, https://www.dementiasplatform.uk). The term social data refers to data derived from social networks. The context of disclosure might be a conversation with friends or in dedicated groups on Facebook or Twitter (where dedicated discussion threads have emerged via hashtags, e.g. #LCSM denoting lung cancer social media posts), or discussions with other patients on online social networks like PatientsLikeMe, where they will share diagnoses, treatments, coping mechanisms and outcomes with one another [29, 36]. Researchers have considered a range of motivations for disclosure in social network sites. The fact that users publish in SNS a considerable amount of information which is not otherwise available makes social data one of the most important potential sources of knowledge in the pharmacovigilance field. Various new types of diverse data are disclosed via SNS. As noted by the OECD [37], personal data are collected online in different, arguably complementary, ways: (i) data can be voluntarily and explicitly shared by a consumer (e.g. when subscribing to a social network); (ii) data can be observed or recorded (e.g. through cookies monitoring access to a website), with or without consumers knowledge, or explicit consent; (iii) data that can be inferred, also by mixing several sources of data that are, by themselves, anonymous.

Based on the works of Van Alsenoy [38] and Schneier [39], social (networking) data can be categorised according to the context and purposes of data disclosure as:

  1. 1.

    Service data: data that users need to give to an SNS in order to use it. This might include the persons legal name, age, credit card number, etc.

  2. 2.

    Disclosed data: data that are posted by SNS users on their own pages (e.g. blog entries, photographs, videos, messages, comments, etc.).

  3. 3.

    Entrusted data: data that are posted by SNS users on the profile pages of other SNS users (e.g. a wall post, comment). Similarly to disclosed data, entrusted data appear on the users own pages, but they do not have control over the data—someone else does.

  4. 4.

    Incidental data: data about an SNS user which has been uploaded by another SNS user (e.g. a picture). Similarly to disclosed and entrusted data, incidental data appear on the users own pages, but they do not have control over it, and they did not create it in the first place.

  5. 5.

    Derived data: data that are inferred from (other) SNS data (e.g. membership of group X implies attribute Y).

  6. 6.

    Behavioural data: data regarding the activities of SNS users within the SNS (e.g. user habits, who they interact with and how).

A further classification can also be made between these two types of data:

  • Collected (back-office) data: Data collected by the service provider, which usually include profile and network data explicitly provided by the user, and click history implicitly provided. Users can assume that everything they do and upload in the browser tab of the service is collected, if the privacy policy of the service does not state it otherwise.

  • Front-end Data: Data that are knowledgeably shared. This includes: (i) Public/disclosed data. Data that are published openly, such as complete name or e-mail. It can be useful for users trying to contact other users. (ii) Social data. Data that are openly shared with the users trusted contacts. Unless these contacts are inside their circle of trust, they cannot access it.

Inevitably and quite understandably, SNS data have come to attract the interest of a wide range of actors. SNS allow for the collection of a large amount of information from different sources. Datasets can be continuously monitored in order to identify the emerging trends in the flows of data. This capability is revolutionary and differs from the traditional sampling method, which is based on the extraction of a representative sample from the total statistical population.

Aligned with this trend, the social media participation model for health organisations is evolving along three dimensions: listening, participating and engaging [40]. With its recognised use as a source of information [41], social media has been used by healthcare stakeholders, in order to distribute information about diseases and their treatment, medicines and announcements [42]. Currently, the potential of SNS as a source of insight is increasingly recognised. Social media mining is becoming an integral part of public health monitoring and surveillance [43], assisted by advances in automated data processing, machine learning and natural language processing (NLP) technologies. Applications include epidemiological investigations, e.g. mining Twitter data for disease topic detection and surveillance [15, 44, 45, 159], and tracking the spread of infectious diseases. However, scholars note that the analysis of health social media content requires further innovations [46]. Social media presents new channels and methods that can enable pharmacovigilance to move away from traditional safety reporting methods towards more patient-centric models for reporting, analysing and monitoring of safety data. Essentially, in terms of safety research, social platforms allow for both information pull (social media listening) and targeted investigations (direct-to-patient research). Social media activities for pharmacovigilance by pharmaceutical companies fall into three broad categories: listening (safety data reporting), engaging (follow-up) and broadcasting (risk communication)—each with varying degrees of complexity, associated issues and requirements [158, 161, 162].

Traditionally, pharmacovigilance has mainly relied on post-marketing spontaneous reporting systems (SRSs) [47], such as the EudraVigilance system operated by the European Medicines Agency (EMA) and the Adverse Event Reporting System (AERS) used by the US Food and Drug Administration (FDA), collecting voluntary reports produced by healthcare professionals, marketing authorisation holders (MAHs) and consumers. However, the reporting rate of such systems is low, causing delays in the detection of ADRs. Basch [48] notes that many adverse reactions are missed due to lack of interest, willingness, availability or awareness of stakeholders to report. Strom [49] considers the systems under-ascertainment (not recognising an event is due to a drug), overascertainment (erroneously ascribing an adverse event to a drug) and under-reporting as its major flaws.

Patient reports have been shown to be of high importance for pharmacovigilance, providing a complementary [50] and independent perspective from those of health professionals [51]. According to Santos [52], the combination of reports from healthcare professionals with first-hand information from patients is of great added value because it increases the chances to identify new safety issues.

With the increasing use of social media and social networks, social data are increasingly recognised as a valid source of real-time information on drug-related adverse events [53], including the assessment of the behaviour and risk perception of consumers [54]. In recent years, many scholars have investigated the availability of adverse event information in social media and appropriate technologies and methods to extract it. Statistical analysis shows that there is value in pharmaceutical companies and regulatory authorities taking a more proactive approach to social media monitoring [55]. Proactive monitoring could provide early warning of new adverse events or clinical information that helps guide drug development and avoid preventable litigation. Regulators and pharmaceutical companies are also starting to monitor social media posts for potential ADR signals. While previously much of the interest in social media has been on the marketing front, their potential application to improve drug safety and pharmacovigilance is also recognised, particularly for the identification of signals of unknown ADRs and unknown drug–drug interactions (DDIs) of concomitant medications, which are often linked to unexpected ADRs [56]. As concluded by Sloane et al. [57], the quantity and near-instantaneous nature of social media provide potential opportunities for real-time monitoring of ADRs, greater capture of ADR reports and expedited signal detection if utilised correctly. According to Liu and Chen [71], an advantage of patient social media is that they cover a large and diverse population and contain millions of unsolicited and uncensored discussions about medications. Furthermore, patient reports of adverse events in social media are more sensitive to underlying changes in the patients’ physiological state than clinical and spontaneous reports. Today, social media are a formal part of the potential sources of data of interest to pharmacovigilance. Most of the regulatory guidance and hence pharmacovigilance activities involving social media and Internet are primarily focused on screening of social media sites and follow-up of reported safety data.

2 An overview of applications of social data in pharmacovigilance

Nowadays, the pharmaceutical industry is engaging more actively with patients on social media for the collection of direct-from-patient information. Pharmacovigilance is increasingly drawing upon different types of data sources in both solicited and unsolicited ways. Figure 2 depicts the principal sources of social data employed presently for ADR detection—these social data sources are the focal points of the present review. The proposed taxonomy is derived from our literature review, by analysing the existing work on social data applications to ADR detection, in terms of category of SNS and nature of reporting. The specialised healthcare SNS category can be further classified into three types of specialised healthcare SNS/forums: health-centred SNS (generic networking sites on general health topics, usually requiring user profiles), disease-specific online health forums/SNS (focused on specific diseases) and medicine-focused sharing platforms (or patient forums), as explained in Sect. 1.3.

Two types of reporting can be distinguished—namely solicited and unsolicited—which can be further analysed in terms of the context and purposes of data disclosure and the area where data are captured according to the classifications proposed in Sect. 1.3.

Type 1: solicited reporting (use of social media as a reporting channel)

Previous research overwhelmingly suggests that further promotion of patient reporting to the SRSs is justified. Direct patient reporting of suspected ADRs has the potential to add value to pharmacovigilance [51, 165]. A study by Avery et al. [59] concluded that patient reporting can (a) contribute types of drugs and reactions different from those reported by healthcare professionals, thus generating new potential signals, and (b) describe suspected ADRs in more detail, thus providing useful information on likely causality and impact on patients’ lives.

Fig. 2
figure 2

Taxonomy of social data sources for pharmacovigilance

The use of new technology can provide new methods and tools to facilitate direct patient reporting of ADR. In this direction, the WEB-RADR project [60] promotes the utilisation of social media and other technologies for ADR reporting in a convenient, quick and efficient way, also seeking to establish guidelines and a regulatory framework on the use of the technology for such reporting.

Type 2: unsolicited reporting (social media monitoring)

Social media data are increasingly recognised as a valid source of patient perspectives and data on adverse events (AEs) in pharmacovigilance [61]. This information is in abundance and is timely, relevant and often publicly available. Social media have thus the potential to become a new-age tool for monitoring data regarding patients’ experience with medications in real time, making and providing early indications of potential safety issues that require further investigation. A typical methodology for signal detection using (in this case, passively collected) social media data includes the following steps, as proposed by Powell et al. [62]:

  • Data harvesting: collection of raw data;

  • Translation: standardisation of drug names and vernacular symptom/event descriptions;

  • Filtering: identification of relevant informative posts and data cleaning (removal of duplicates and noise);

  • De-identification: removal of personally identifying information;

  • Supplementation: addition of other data sources to facilitate the review process and contextualise the results, in order to assist interpretation (e.g. product label, sales data).

By mining the relationships between drugs and ADRs from data reported by online users on health-related issues, we can speed up the process of detecting and confirming ADRs. This is particularly important in the case of new treatments for rare diseases, which are typically tested only on small patient groups [63]. Social data mining, however, may have limited application to orphan drugs [64].

We shall now further examine in more detail the three most promising sources of social data of use to pharmacovigilance as identified above.

2.1 Specialised healthcare social networks and forums

An important source of user-generated content on the Internet is specialised healthcare social networks and forums. These platforms allow for the collection of health-related data focused on either a specific disease or multiple disease areas. User comments in health-related social networks contain extractable information relevant to pharmacovigilance. Research efforts focused on both general health discussion forums [65,66,67,68,69] and disease-specific discussion forums [46, 70, 71] have demonstrated that it is possible to extract complex medical concepts, with relatively high performance, from informal, user-generated content.

Recognising the potential of health-related social networks and forums, the FDA has engaged PatientsLikeMe in a research partnership to generate more AE data. PatientsLikeMe claims to have collected more than 110,000 adverse event reports on 1000 different medications, data that the FDA will now be able to access and analyse in addition to its existing sources of information [72].

2.2 Generic SNS

General-purpose social platforms (e.g. Twitter, Facebook) are also of value. In a recent article, Pierce et al. [73] concluded that an efficient semi-automated approach to social media monitoring may provide earlier insights into certain adverse events, particularly since patient reports of adverse events in social media are more sensitive to underlying changes in patients’ physiological state than traditional spontaneous reports. However, the affordances of SNS vary. SNS privacy policies may hinder the availability of user-generated data for data mining purposes. For example, special dispensation is required for the use of data from Facebook, which is the worlds’ largest social network [73]. Twitter content remains publicly available; nonetheless, the noisiness of Twitter data (short sentences, fragmented sentences, use of abbreviations, misspelling errors) can significantly impact the performance of classification methods [74]. Twitter could potentially be an important source of otherwise unreported adverse drug events, but the extracted data are noisy and hard to process [74,75,76].

Compared to the specialised healthcare social media, generic SNS contain larger volumes of data. The specialised healthcare SNS, however, contain higher proportions of relevant data [55].

Furthermore, the trustworthiness of social data from generic SNS is questioned, since data quality control is lacking and data authenticity cannot be verified. This implies that, while social media mining can reveal early signs of potential ADRs, the information is not sufficient for the proper processing of the identified suspected cases and for the establishment of causality (signal verification). Domain experts of regulatory authorities still need to employ other instruments in order to assess potential drug safety risks. In this light, Freifeld et al. [77] concluded that while patients reporting AEs on Twitter showed a range of sophistication when describing their experience, the wholesale import of individual social media posts into post-marketing safety databases would not be advisable. Rather, in parallel with other post-marketing sources, such data should be considered for idea generation, and reasonable hypotheses followed up with formal epidemiologic studies. Additional work is needed to improve data acquisition and automation.

2.3 Search logs

Search for health information on the web is growing, with an increasing number of people considering the Internet as an important source of knowledge [78]. Anker itet al. [78] note that health information seeking represents a purposeful and goal-oriented activity that according to Niederdeppe et al [79] describes active efforts to obtain specific information in response to a relevant event, outside of the normal patterns of exposure to mediated and interpersonal sources that constitute mere information scanning. Search query volume can provide a measure of pharmaceutical utilization in the community [168]. In this light, Bragazzi and Siri [80] demonstrated that online searches for antidepressants reflect the usage pattern recorded and monitored by the Italian Drug Agency.

Tapping on back-office social data, several scholars have demonstrated how search logs can be used to detect new ADRs. White et al. [81] demonstrated that anonymised signals on drug interactions can be mined from search logs, using a 2011-reported adverse event (hyperglycaemia) due to a previously unknown interaction between the drugs paroxetine, an antidepressant, and pravastatin, a cholesterol-lowering drug. By mining search queries on Google, Bing and Yahoo Search from 2010, White et al. [81] found that people who searched for both drugs were also more likely to search for terms related to the adverse event than those who searched for only one of the drugs. Search information was provided to the researchers anonymously by users who agreed to share their search history. The study was carried out after the ADR had been identified, and using this approach for the identification/detection of unknown ADRs will remain a challenge. Similarly, Chokor et al. [82] mined a variety of Internet data sources and search engines (mainly Google Trends and Google Correlate) for information on reactions associated with the use of two popular major depressive disorder (MDD) drugs: duloxetine and venlafaxine. Yom-Tov and Garilovich [83] noted that web search queries are potentially more suitable for the detection of less acute, later-onset drug reactions, as acute early-onset ones are more likely to be reported to regulatory agencies.

Sarker et al. [55] reviewed studies describing approaches for ADR detection from social media from the MEDLINE, Embase, Scopus and Web of Science databases, and the Google Scholar search engine. Their review suggests that in terms of sources, both health-related and general social media data have been used for ADR detection, and while health-related sources tend to contain higher proportions of relevant data, the volume of data from general social media websites is significantly higher. Norn [84] further expressed the hypothesis that the study of search logs could help pharmacovigilance systems gain insight on a population segment who would hesitate to share their adverse drug experiences on social media and who would prefer to use search for relevant information on the Internet. In view of this promising outlook, the FDA examined the potential use of search to identify adverse drug reactions in collaboration with Google [85].

Table 1 Advantages and limitations of social data in pharmacovigilance
Table 2 Comparison of social data sources for pharmacovigilance
Table 3 Challenges to the operationalisation of social data for pharmacovigilance

The literature review findings were further analysed to identify advantages and limitations of the use of social data in pharmacovigilance and to conduct a comparative assessment of social data sources with respect to a number of key attributes of social big data, including population coverage, usefulness, timeliness, accessibility, quality and processability. These attributes serve as the comparison criteria. Table 1 outlines the advantages and limitations of social media in the context of pharmacovigilance, across the above set of big data attributes, while Table 2 presents the qualitative comparison of the different social data sources in terms of the aforementioned six criteria.

3 Current challenges and the way forward

Pharmacovigilance calls for meaningful, usable and reliable information [84], i.e. information that is relevant, timely, consistent and complete (so as to support ADR detection and assessment) is available and can be extracted and processed, and is accurate and trustworthy. Experimentation and evidence on the ability of social media to enhance pharmacovigilance methods and processes continue to grow and despite the formal acknowledgement of their potential value by pharmacovigilance managing authorities, many challenges exist at different levels and, as a result, social data remain a largely untapped source of knowledge. There are challenges inherent to big data, namely the high volume of data, the different formats in which data (both structured and unstructured) are captured, as well as other intrinsic, structural and regulatory barriers that impede access and analysis. Risks associated with big data include: extraction of useless data (i.e. extraction of data that do not fit the purpose), extraction of data from sources of unknown quality, or without authorisation, etc. The ADR-PRISM project [86] identified five major challenges to the operationalisation of social data for pharmacovigilance, namely: (1) variable quality of information on social media, (2) guarantee of data privacy, (3) response to pharmacovigilance expert expectations, (4) identification of relevant information within web pages and (5) robust and evolutionary architecture.

The use of social media for pharmacovigilance represents a knowledge discovery process. The present study examines its challenges in three dimensions:

  • Conceptual: Challenges that relate to the purpose and value of social media use in pharmacovigilance (value/utility of social media as knowledge sources for pharmacovigilance),

  • Technical: Challenges that relate to the feasibility of the process (information extraction from social media and analysis of social data) and

  • Environmental: Challenges that relate to compliance concerns and affect the acceptability of any new pharmacovigilance process proposed (data privacy and regulatory framework).

The major challenges (summarised in Table 3) identified from our literature review are examined in the following subsections, where each challenge is described and critically analysed to produce insights and future directions related to each challenge.

3.1 Value/utility of social media as knowledge sources for pharmacovigilance

Challenges are not solely technical, since the sharing and using the public health data are also conditioned by motivational, economic, political, legal and ethical barriers [87]. Typically, a valid AE that is reportable to a regulatory agency must meet the following four criteria: identifiable reporter, identifiable patient, suspect drug and adverse event. In addition, clinical, pathological and epidemiological information relating to adverse reactions is necessary for a full understanding of the nature of an adverse reaction [13]. Most social media sources fail to provide complete information for case assessment [88].

The credibility of data varies across social media. Concerns relate to the quality, trustworthiness and integrity of information collected from social channels, the volatility and the overall uncertainty of social data.

Social media generate patient-centric data, which is typically unfiltered and unchecked, and can use the incorrect terms, or refer to diagnoses that are based on Internet research rather than confirmed diagnoses from healthcare professionals (risk of misinformation). Social media disclosures are also conditioned by the disclosers gender and cultural background, resulting in differences in linguistic expression, inhibition levels and tendency for social interaction [89], and the risk of bias (in terms of age, gender, ethnicity and physical location) [57]. Furthermore, it is not clear whether patients and clinicians report the same concerns. A study by Topaz et al. [90] compared patients’ concerns in social media (social media data) with clinicians reports in electronic health records, discovering significant correlations. However, researchers have identified a different emphasis on the type of adverse events reported on social media compared to SRSs, which suggested that social media may be a better source for symptom-related or less serious (non-life-threatening or not requiring hospitalisation) than laboratory test abnormalities and serious adverse events. A systematic review conducted by Golder et al. [63], investigating the prevalence, frequency and comparative value of social media-derived information on the adverse events of healthcare interventions, concluded that, although reports of adverse events are identifiable within social media, there is considerable heterogeneity in the frequency and type of events reported, and the reliability or validity of the data is not thoroughly evaluated. Adverse events identified via social media, but not documented elsewhere, also tended to be mild or related to quality of life. In contrast, under-represented adverse events on social media tended to include laboratory abnormalities or effects requiring diagnosis from a healthcare professional. Serious or severe adverse events were also under-represented in social media [91, 157].

Hence, pharmacovigilance practice could benefit from empowered, motivated and knowledgeable patients, through increased public awareness of pharmacovigilance, through the closer engagement of pharmacovigilance stakeholders (regulatory authorities, industry) with patients, through the diffusion of scientific knowledge regarding health, illness, therapy and medicines, and through the further strengthening of health support networks and communities.

3.2 Information extraction from social media

Social media produce large amounts of raw data that are challenging to analyse. Limitations inherent to big data include difficulties in search, the large volume of irrelevant data, issues surrounding the lack of validation, user bias, etc. They are characterised by big volumes of data of high variety and a high frequency of new data generation (Data Velocity).

Pierce et al. [73] stressed the inherent variability across data sources that can change rapidly over time. Norn [84] noted that the different sources of Internet-based data vary in terms of scope and coverage, as well as with regard to the richness of the provided information. This may include limitations caused by website characteristics (e.g. character limitation) [73]. As SNS vary in terms of their core functional building blocks (identity, conversations, sharing, presence, relationships, reputation and groups), a profound understanding of their characteristics is required in order to develop strategies for monitoring, understanding and responding to different social media activities [92, 164]. For pharmacovigilance, this signifies that the degree of uncertainty and bias of each SNS needs to be taken into consideration. A careful combination of each of these data sources is required to fully realise their benefits and generate valuable signals [57]]. In this process, social media sources need to be considered separately and methods need to be tailored to each channel individually, as each one carries its own separate challenges. The risk of duplicate reports (e.g. caused by parallel posting on multiple platforms) also exists.

The principal task of signal extraction from social media consists in the identification of drugs and symptoms (entity recognition) and of the relationship between them.

Several challenges exist in extracting ADR information from social media, with the principal ones related to the named entity recognition (NER) problem (i.e. the fact that both drug names and reaction terms can be described in a variety of ways) and the relation detection problem. According to Sloane et al. [57], technical challenges inherently related to signal extraction from SNS include:

  1. 1.

    Drugs may be described by their brand names, active ingredients, colloquialisms or generic drug terms (e.g. antibiotic);

  2. 2.

    ADRs may be referred to using creative idiomatic expressions or terms not found within existing medical lexicons;

  3. 3.

    The informal nature of social media results in a prevalence of poor grammar, spelling mistakes, abbreviations and slang;

  4. 4.

    The existence of a side effect may be clear, while the specific side effect experienced remains unclear;

  5. 5.

    Discussion of a drug could involve indications, beneficial effects or concerns of an adverse event;

  6. 6.

    Only a small percentage of social media will relate to ADRs.

Scholars have made significant progress on topics revolving around the recognition of drug names, symptoms and ADRs in social media texts using automated or semi-automated methods [93]. Research efforts for the development of appropriate text mining methods and natural language processing (NLP) techniques are ongoing.

One of the principal challenges is the extraction medical entities from noisy patient-generated content. Given the large volume of social media posts, efforts towards the automatic text classification for ADR detection are receiving growing attention [70, 94,95,96,97,98,99,100,101]. However, lexicon-based approaches [47] for medical entity recognition and tools like MetaMap [102], developed by the US National Library of Medicine to identify medical concepts into the concept codes from the Unified Medical Language System Metathesaurus (UMLS), are not sufficient, given the informal, colloquial nature of discussions and the non-adherence to standardised terminology used by participants [103]. Spelling variants and machine learning techniques are used to detect drug names and symptoms [75, 104]. Text classification approaches for ADR detection have been created [95]. Given the limited amount of annotated data available publicly [55], comparable sets of documents for algorithm training are being developed to further assist research efforts (e.g. TwitMed) [105, 106]. Segura-Bedmar et al. [107] stressed that the task of extracting relations between drugs and their effects from social media in languages other than English can also be hindered by the lack of relevant lexical resources. User-expressed medical concepts are often non-technical, descriptive and challenging to extract. The use of sentiment analysis has also been proposed as a means to improve the performance of ADR identification methods [108,109,110,111]. Human curation is also investigated for the purpose of establishing a gold standard by which future, automated classification methods may eventually be compared [112]. The need for manual data labelling is expected to drop considerably with the application of neural network-based tools [113, 114]. Abdellaoui et al. [98] apply distance-based filtering in order to distinguish between false positives and true ADR declarations. The framework proposed by Liu and Chen [71] employs a hybrid approach combining statistical machine learning methods and rule-based filtering with information from medical knowledge bases, and report source classification to reduce noise. Advanced machine learning-based NLP techniques, also promoted by Nikfarjam et al. [115], allow for the detection of colloquial expressions of ADRs via sequence labelling approaches, such as CRFs or RNNs [116]. Furthermore, multidimensional analysis of ADRs is required to allow for the discovery of associations between drugs and symptoms [117]. Chowdhury et al. [103] proposed a multitask framework based on the sequence learning model with improved learning efficiency and prediction accuracy for ADR and symptom identification.

A scoping review by Lardon et al. [118] noted that, while there is a multitude of methods for identifying target data, the processes of extracting data and evaluating the quality of medical information from social media are not easily scalable and studies usually failed to accurately assess the completeness, quality and reliability of the data that were analysed from social media. While experimental methods have proved advantageous in identifying previously unknown adverse drug reactions, constant active screening of social media is challenging. The knowledge extraction process is effort-intensive and further scrutiny of Proto-AEs (i.e. posts with a resemblance to an adverse event) by medical regulatory authorities and competent medical professionals is often required in order to extract valid ADR signals. The enormous number and diversity of conversations that can take place in a social media setting mean that there are format and protocol implications for stakeholders seeking to make sense and extract knowledge from them. It is crucial that the processes of managing and interpreting this new information are efficient and effective for sustenance, thoughtful use of resources and valuable return of knowledge [60]. Challenges span the entire knowledge discovery value chain from data extraction to data processing and sense-making, including: collection (selection of data types and data sources, as well as validation of the collected dataset), processing (application of relevant knowledge extraction and processing methods, technologies and tools) and analysis (expert analysis for sense-making).

3.3 Analysis of social data

Analysis of social data for medicine safety surveillance is a challenging task. Challenges exist across all phases of the process: signal detection, development of a causality hypothesis and testing of the causality hypothesis.

A recent pilot study by Bhattacharya et al. [61] suggests that the use of traditional pharmacovigilance methods to analyse social media data is ineffective. Mao et al. [119] note that frequency data should not serve as prevalence of the adverse effects/reactions, but as a measure of which symptoms may be the most salient to patients on a day-to-day basis. Pierce et al. [73] underline that further research is needed to develop best practices and methods for determining what constitutes a safety signal in social media. Further research is also needed to investigate the impact of cross-channel information diffusion (data source correlation resulting in information duplication).

While significant progress is being made with regard to signal detection and causality hypothesis generation, determining causation (i.e. ascertaining causality for the identified drug–symptom associations) remains a challenge. The collection of more case data relating to possible causation is needed [120]. Further research is required towards the development of a comprehensive approach to combining evidence from multiple social media, while considering the level of trust in each source [57, 86]. To describe the process, Adjeroh et al. [121] employ the term signal fusion stating the diversity of social media sources, noise, data redundancy and correlation between sources as its major challenges. Abbasi et al. [122] have proposed the CRUFS framework (an acronym denoting credibility, recency, uniqueness, frequency and salience) as a uniform foundation for critically assessing different data channels in social media analysis of adverse drug events.

3.4 Data privacy

In the background of the knowledge discovery process stand personal data protection concerns, which makes balancing the respective interests of patient data protection and medication safety monitoring challenging [123] and the verification of adverse reaction allegations nearly impossible [124]. It is critical that the capabilities offered by new technologies are harnessed in a way that is ethical, compliant with regulations, respecting data privacy and ensuring responsible use of data [60]. As the number of actors engaging with social media and SNS data increases, so does the risk of potential privacy infringements. The increasing demand for data protection driven by the rapid technological advancement and the necessity to reinforce users’ trust in services provided by the public and private sectors are inducing legislators to approve data protection laws or amend the existing regulations in order to adapt them to the technological evolution and new challenges [125]. These new data protection regulations confront researchers with imposing hurdles, ranging from the validity of both the data and how it is sampled to the ethical issues regarding its use [126]. Social networking data will qualify as personal data insofar as they relate to an identified or identifiable individual. Although data from social networks is public, this fact does not deprive it from the protection offered by the data protection legislation. The processing of such data still needs to be fair and lawful. As a result, there needs to be a legitimate ground on the basis of which the data could be processed.

3.5 Regulatory framework

While SNS offer a large and often untapped potential to identify safety issues, the appropriate and effective use of social media can be overwhelming. Desai [127] stresses that new regulatory paradigms are needed and lists the following important questions that need to be answered:

  • What is the limit of the industry’s responsibility in collecting and reviewing social media data?

  • How can pharmacovigilance teams confirm the identifiability of the reporter and patient in safety data obtained via social media and establish safeguards against faulty adverse event reporting?

  • What will be acceptable practices for following up on potential signals within the context of data privacy?

  • What are the protocols for big data integration, analysis and interpretation, and reporting of follow-up results?

Golder et al. [63] note that the methods that can help incorporate these datasets into current pharmacovigilance systems are largely unexplored. Regulatory acceptance of social data might be lower than for traditional sources, due to the uncertainty surrounding the appropriate use of such data from patient privacy point of view and the lack of defined strategies or frameworks in place in order to meet the standards regarding data validity and generalisability [128]. Nonetheless, a need for patient-centricity is increasingly recognised by high-profile institutional drug safety stakeholders (Council for International Organizations of Medical Sciences (CIOMS), European Medicines Agency (the ), Food and Drug Administration (FDA), Innovative Medicines Initiative (IMI), etc.), who have developed a series of regulatory guidances and launched initiatives in this direction across the entire value chain of pharmacovigilance [129]. Experts stress the need for strong and systematic processes for selection, validation and study implementation [130]. As practices and legal requirements vary across countries, the need for a concrete policy framework on the further use of social media as a new valid type of data sources for pharmacovigilance is emphasised [130]. Smith and Benattia [129] further stress the need for an internal revision of the form and function of pharmacovigilance within the biopharmaceutical industry.

In view of the above points, scepticism lingers within the research community regarding the potential use of social media resources in pharmacovigilance. There are differing views among experts on whether social media should only be considered as a support to signal detection (complementing and enriching primary data sources) or whether it has potential for de novo signal detection or assessment and whether the evidence on its utility is required before making these decisions [131]. New methods have had limited success in identifying new drug safety signals. Proof exists that social media may be a source for novel or rare adverse events and mild adverse events and for ascertaining patient perspectives [63, 132]. The growing interest in the field is met with a call for more studies to demonstrate and understand the potential of social data and define their role for the purposes of pharmacovigilance. Research regarding the utility of social data is ongoing [133], as is work in the field of cognitive technologies (e.g. machine learning, artificial intelligence, etc.). Moving beyond the digital trend, IBM [134] predicts that the future of health is cognitive: through the use of cognitive platforms designed to ingest vast quantities of structured and unstructured information from various sources and to allow researchers to find correlations and connections, in order to identify new patterns and insights to accelerate discoveries, treatments and the delivery of health improvements.

Table 4 Knowledge extraction from the perspective of pharmacovigilance

Benefiting from advances made in the application of knowledge extraction technologies to other scientific domains, some problems that relate to the technical feasibility of social media use in pharmacovigilance seem to be nearly solved, while enhancements, improvements and new promising solutions are being announced frequently. However, for achieving the development of effective instruments for knowledge extraction in real-life situations, there is still a need for an in-depth investigation of the overall feasibility and effectiveness of these basically experimental proof-of-concept methods and of their potential contribution to pharmacovigilance systems.

Social media have the potential to offset limitations of traditional data sources (time difference between event occurrence and discovery, under-reporting, lack of geographic diversity, loss of patients’ perspectives), particularly by means of their volume, broad coverage and timeliness. However, since they too are not without disadvantages, complementarities need to be sought with traditional data sources. Harnessing the complementary strengths of traditional data sources and social media can open new directions and expand the scope of post-market safety monitoring. Powell et al. [62] note that additional value can be created by supplementing social data with other sources of information (e.g. product label, sales data, etc.). Overall, further research is needed to better understand the strengths and limitations of social media in post-market safety surveillance and establish best practices [62]. With regard to the four points (key tasks, see Sect. 1.2) put forth by WHO to describe the vision of pharmacovigilance in its traditional sense, social media have proved their potential as lead and lag indicators of unknown and known ADRs, respectively. Scholars have successfully exploited social data for the early detection of unknown adverse events and for the detection of increases in frequency of known adverse reactions (although the definition of relevant proportionality measures for social media remains a challenge). In cases where more detailed information is available on the person, their exposure to the medicine, the adverse outcome and any possible contextual factors, which may be have affected the observed result, social media can also assist in the identification of risk factors and the formulation of hypotheses regarding possible causal mechanisms between drug and symptom. The estimation of quantitative aspects of benefit/risk analysis remains a challenge for researchers, as the traditional proportionality-based schemes are not applicable in this case. Table 4 provides an overview of the current state of research for knowledge extraction for the purposes pharmacovigilance: the main aspects of knowledge extraction are identified (information extraction, data analysis, privacy, regulation) and elaborated in terms of the specific domain perspective, including a categorised presentation of open challenges, issues and prospects and of current practice in the community (principal and secondary/ complementary areas of research).

4 Conclusion

As outlined in its research objectives, the present paper makes a number of contributions to the area of pharmacovigilance and particularly to its social data applications. This section summarises these contributions and presents the conclusions and the gained insights and lessons learnt.

4.1 Limitations of this study

While there are several strengths to this study, which was conducted in a structured and methodical manner, limitations of the review need to be acknowledged. The study only included evidence from the MEDLINE/PubMed database and was intended as a scoping exercise, aimed at investigating the applications of social media to pharmacovigilance, through the critical analysis and comparative assessment of the relevant body of literature. Although the present research effectively identified the most prominent works in the field, for future studies, it would be interesting to assess evidence from other databases as well. As the body of evidence grows, future studies could also aim to perform an in-depth critical analysis of each specific topic identified in the present work.

4.2 Contributions of this study

The presented review was mainly aimed to map the state of the art in applications of social data to pharmacovigilance, to provide a thorough up-to-date literature review, to contribute to the systematisation of current knowledge about the use of social data in this field and to explore the potential of social data for the detection of ADRs.

In particular, the paper makes the following distinct contributions:

Literature review of the applications of social media to pharmacovigilance

A comprehensive literature review of the applications of social media to pharmacovigilance was conducted, with a critical analysis of the relevant body of literature, focused on identifying new important dimensions of this topic and the associated key challenges and on gaining new insights. (See review methodology and execution in Sect. 1, while review analysis and its findings span Sects. 1, 2, 3.) A thorough review of recent scientific literature was completed with a total of 100 articles reviewed, and the final selection of articles was made based on relevance to SNS application to pharmacovigilance (Sect. 1.1). The findings of this literature review were further analysed to formulate new insights, the key challenges and future research directions.

Exploration of social media sources for use in pharmacovigilance

Another contribution is the classification and assessment of social data sources (Sect. 1).

In particular, based on the review findings, the existing work on social data applications to ADR detection was analysed in terms of category of SNS and nature of reporting. A taxonomy of social data sources (i.e. social media including web data) has been proposed in terms of context and purpose. Furthermore, the specialised healthcare SNS category can be further classified into three types of specialised healthcare SNS/forums: health-centred SNS (generic networking sites on general health topics, usually requiring user profiles), disease-specific online health forums/SNS (focused on specific diseases) and medicine-focused sharing platforms (or patient forums), as explained in Sects. 1.2 and 1.3.

Presently, there is a growing interest by institutional drug safety stakeholders in exploring the use of social media (social listening) to supplement established approaches for pharmacovigilance, by harvesting information on patients’ experiences after exposure to pharmaceutical products.

This was followed by literature-informed qualitative comparison of social data sources with regard to a set of key attributes of social big data, including population coverage, usefulness, timeliness, accessibility, quality and processability.

Systematic examination of social data used in pharmacovigilance

A third contribution is to do with classifications and taxonomies for social data used in pharmacovigilance (Sect. 1). From the literature review analysis, a new taxonomy of social (networking) data was derived, which also includes two different classifications of social data–(a) according to the context and purposes of data disclosure, as well as (b) in terms of the area where data are captured according to the classifications–front-office or back-office operation–as proposed in Sect. 1.3. Furthermore, two types of reporting were identified (Secti. 2)—namely solicited and unsolicited—which can also be further analysed in terms of the context and purposes of data disclosure and the area where data are captured.

Finally, the advantages and limitations of the use of social data in the context of pharmacovigilance were identified, critically analysed and discussed (Sect. 2).

New insights, key challenges and recommendations for future research and practice

Last but not least, the present study examined the challenges of knowledge discovery using social data in the context of ADR detection in three dimensions—namely conceptual, technical and environmental (Sect. 3). From the literature review findings, a set of five key challenges related to the use of social data were identified—namely the challenges of value/utility of social media, information extraction from social media, analysis of social data, data privacy and regulatory framework. Each of these challenges was analysed and discussed in the context of social data in pharmacovigilance, along with useful insights extracted from the review findings, and focusing on potential solutions and future research directors.

4.3 Key insights and future directions

While the value potential of social data is increasing, research on social media-based pharmacovigilance is not in a position to supplant more traditional methods [135]. Salathe [15] stresses that data from traditional health systems and patient-generated data have complementary strengths (high veracity in the data from traditional sources and high velocity and variety in patient-generated data) and, when combined, can lead to more robust public health systems. Lazer et al. [135] call for an all data revolution, i.e. methods that employ data from all traditional and new sources. The literature review findings also indicate an agreement among scholars on the potential of social listening to support and supplement pharmacovigilance systems, which currently rely mainly on traditional ADR reporting. According to Incio et al. [136] this can contribute to better decision-making processes in regulatory activities.

Nonetheless, the value of mining social media for ADRs has not yet been realised in practice. Methods exist to reduce noise and make the data suitable for post-market safety surveillance. However, big data cannot be considered a substitute for traditional data collection and analysis, but rather functions as a supplement to existing methods [61, 122]. Abbasi et al. [122] stress the importance of developing an understanding of the strengths and limitations of the various social media channels and the capabilities of real-time analytics. Additional research is therefore needed to better understand the strengths, limitations and best practices of social data innovations in the context of pharmacovigilance, to determine which channels are most suitable with respect to various dimensions.

Social media monitoring is expected to become a standard practice in pharmacovigilance in the future. For this purpose, a careful evaluation of the use of social media as a pharmacovigilance instrument is required, along with new data processing techniques and software tools and infrastructure adapted to the volume, velocity, structure and veracity of social media data. Already, marketing authorisation holders (MAHs) are required by European law to establish and maintain a system for pharmacovigilance and record all suspected adverse reactions brought to their attention. This includes recording suspected ADRs from digital social media.

While at present social media monitoring cannot be used to test hypotheses but just to generate them, emerging technologies are expected to increase its sense-making capabilities and lead to actionable evidence and eventually to the grand vision of achieving complete digital vigilance. This implies delving into more challenging questions, such as the detection of drug-to-drug interactions (DDIs) using social media, of which currently limited examples exist [137]. Beyond the early discovery of ADRs (the reduction in false positives, etc.), the ultimate objective of future research should be to enable the development of methods and instruments in order to identify with certitude the existence of a causal relationship between drug and adverse event (causality), and to assess the severity and the preventability of the ADR [138]. This assessment is critical in order for healthcare stakeholders to be able to develop strategies and plan interventions to reduce the burden of ADRs.

Research in this field is ongoing, with artificial intelligence (AI)-based web and social listening emerging as a promising solution that may improve levels of accuracy and reliability of human-directed monitoring [139], reduce manual data labelling requirements [113, 140] and enable coordinated and efficient systems for developing actionable evidence on medicine safety and effectiveness [141].

Future efforts should be aimed at further developing computational methods for processing large data volumes and natural language processing methods for more detailed and sophisticated data analysis (e.g. to establish causality with the help of social media data). Furthermore, there are some early indications that the joint analysis of multiple data sources (multimodal signal detection) may lead to improved signal detection [93]. Holistic examination and interpretation of knowledge sources are needed in order to produce timely, reliable and actionable results. According to Harpaz et al. [93], this requires a deeper understanding of the data sources used, additional benchmarks and further research on methods to generate and synthesise signals. Moreover, according to Bate et al. [88], a scientifically robust strategy for measuring the specific value of innovative big data sources is needed before such innovations can be incorporated into formal decision-making processes.

In conclusion, substantial future benefits to pharmacovigilance practice are therefore expected to be realised through further advances in data availability and computational methods for mining insights and inferences from large data sets.