Background

In recent years, numerous drug safety databases have been made publicly available, e.g., LiverTox http://livertox.nlm.nih.gov/[1], SIDER http://sideeffects.embl.de/[2], TOXNET http://toxnet.nlm.nih.gov/, the US Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS), and PubMed http://www.ncbi.nlm.nih.gov/pubmed. These databases contribute significantly to the research community, facilitating the enhanced understanding of drug safety issues [3]. Mining large-scale drug safety data is a promising venue for drug regulation [4]. Some databases integrated the safety data from various sources with free text format, for which text mining would be effective to leverage the textual information to gain knowledge of drug safety, and thus address critical safety issues that are difficult to be approached by using other databases.

Topic modeling is a widely used text mining approach for analysis of large volumes of unlabeled documents in order to discover hidden textual patterns [5]. Previous studies in our group demonstrated that topic modeling could be effectively used for the analysis of adverse events for drug safety assessment from the FDA-approved drug labels [6], and for the identification of opportunities for drug repositioning [7]. The National Institutes of Health (NIH) LiverTox database provides comprehensive clinical information for drug-induced liver injury (DILI) which is summarized with a free-text format in several sections.

In this study, we extended our text mining effort with topic modeling to the LiverTox database to ask the question of whether additional knowledge beyond what had been described in the documents could be extracted to guide an enhanced DILI assessment. We placed our emphasis on drug-induced acute liver failure (ALF) which was a severe form of DILI with limited treatment options thus with significant public health impact. With topic modeling, we successfully identified a topic incriminating a drug's liability to cause ALF based on the text in the "Hepatotoxicity" sections of the LiverTox. The identified topic further guided identification of other drugs with the similar liability and, importantly, most of them could be verified and confirmed with additional data. This proof-of-concept study demonstrated the potential utility of topic modeling to the existing text documents in the public domain to gain knowledge as predictive means for the enhanced assessment of drug safety.

Methods

LiverTox database

The NIH LiverTox database is developed by Liver Disease Research Branch of National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) and National Library of Medicine (NLM) to promote the basic and clinical research on DILI [1]. It is a free on-line source of textual documents on DILI summarized from various databases, scientific literature, and interpretations of the curators. The LiverTox contains a set of documents (one document per drug) and each document contains multiple sections. Each section provides different set of DILI information, including introduction, background, hepatotoxicity, mechanism of injury, outcome and management, and others (e.g., case report, product information, chemical structure, and references). In this study, only the "Hepatotoxicity" section was used for topic modeling because the "Hepatotoxicity" section mainly contained the DILI-relevant clinical observations. The findings was compared against to the information from other sections such as "Outcome and management" to demonstrate the utility of the method. In case of no clear ALF evidence presented in those sections, the results were compared to the data from other sources. The "Hepatotoxicity" section for each drug, on average, contains 200-400 words that summarize the DILI-relevant information including clinical features, time to onset and recovery, liver enzymes (frequency of elevation, fold change, and serum levels), liver injury pattern, immunoallergic and autoimmune features, and other hepatotoxicity relevant data. A total of 478 documents (i.e., 478 drugs) were used for topic modeling.

Topic modeling

Latent Dirichlet allocation (LDA), one of the most popular topic modeling approach [5, 810], was applied to explore the LiverTox database. We used LDA in Mallet, a Java-based package, for topic modeling [11]. Number of topics to optimally represent the content of all documents is a key parameter in a topic model. The optimal number of topics can be determined by fitting models with different number of topics to the data. The model fitness can be estimated by the likelihood of the data given a topic model [10]. To obtain the sparse topic and word distributions, the Dirichlet hyperparameters alpha (α) and beta (β) were defined as 0.1 and 0.01, respectively. Before topic modeling, the English stop-words and numerical digits are removed. In addition, three words (i.e., liver, injury, and elevation) presented in more than 80% of documents are also removed as the words with high frequency across the documents will not provide the discriminative information for topics. After that, words in each document are tokenized and then put into LDA to train a topic model. The model yields two probability distributions, one gives a probability value (θ) for each topic to determine its relevance to each document and the other assigns a probability value to each word for its relevance to the topic.

Identification of ALF-Topic

As listed in Table 1, 26 drugs known to cause ALF are selected and used to identify a topic most relevant to ALF in the topic model. There are 23 drugs annotated by Suzuki et al. with a justified causality assessment from the ALF survey conducted in the United States [12]. Another 3 drugs (i.e., benzbromarone, tolcapone, and troglitazone) are withdrawn from the market due to the drug-induced ALF.

Table 1 Summary of topic model for the 26 known ALF drugs in LiverTox database.

For the 26 known ALF drugs, a mean topic distribution of these drugs is calculated, which leads to the determination of a topic that represents best for these ALF drugs. This so-called ALF-Topic is defined to be topic j for which the mean probability value of θj is the greatest among all the topics. In this model, other drugs highly associated with ALF-Topic are expected to be related to ALF.

Investigation of drugs implicated by ALF-Topic

To investigate whether there was any evidence to support the ALF-implicated drugs identified by ALF-Topic, we searched the ALF evidence in their "Outcome and management" sections from the LiverTox database, in the safety sections from the FDA-approved drug labels, in the literature reporting the ALF case reports with the established causality, and in the FAERS with post-marketing ALF case reports from 1969 to 2013. The workflow of this study is depicted in Figure 1.

Figure 1
figure 1

Workflow of this study. The 26 known ALF drugs are from the Suzuki's paper [12]. ALF: acute liver failure; FAERS: FDA Adverse Event Reporting System; Hy's law: a well-accepted rule to incriminate ALF [17].

Results

Identification of ALF-Topic

The study started with the determination of the optimal number of topics for the LiverTox dataset. Consequently, 40 topics were determined as the highest likelihood of the data given the model with the varying number of topics between 10 and 150 (Figure 2). Then, the mean probability value (θ) for each topic was calculated for 26 known ALF drugs. As shown in Figure 3, Topic-37 had significantly higher probability value (0.36) to these drugs compared to the baseline (0.02) from the rest of topics (p < 0.01). Therefore, this topic was considered as an ALF recognizing/predicting topic and denoted as ALF-Topic.

Figure 2
figure 2

Log-likelihood of the data (D) given the model (M) with different settings of the number of topics ( T ).

Figure 3
figure 3

Mean probability values of all 40 topics for the 26 known ALF drugs

The following 15 words were prevalently represented in ALF-Topic: "case, acute, hepatic, therapy, serum, pattern, week, clinical, report, jaundice, hepatocellular, patient, typical, severe, aminotransferase". Three of these words (i.e., jaundice, hepatocellular, and severe) were unique to this topic and were not simultaneously present in first 15 words for the other 39 topics. These specific words might imply a clinical phenomenon likely to indicate the potential of a drug to cause ALF. Thus, ALF-Topic could be applied to identify other ALF-related drugs in this model based on the probability values of this topic to those drugs.

Application of ALF-Topic

For 12 (12/26; ~46%) known ALF drugs listed in Table 1, ALF-Topic (i.e., Topic-37) was ranked as the first topic, and this proportion was significantly higher compared to that of the other topics (p < 0.05). For five other drugs (i.e., Benzbromarone, Halothane, Phenytoin, Simvastatin, and Sulfasalazine), ALF-Topic was ranked as the second, while for Ibuprofen it was in the third place. There were no ALF drugs with ALF-Topic ranked in the fourth place. The results suggested that a drug with ALF-Topic ranked among its first three topics might have a high likelihood to be associated with ALF. This criterion (i.e., ALF-associated drugs would have ALF-Topic ranked among their first three topics in the model) was applied to the rest of drugs in the LiverTox database, and a total of 127 drugs were identified.

Confirmation of drugs identified by ALF-Topic

Among the identified drugs, 77/127 (60.6%) drugs were described in their "Outcome and management" sections with the ALF-related terms such as "liver failure", "hepatic failure", "liver transplantation", "fatal/death", or "fulminant hepatitis" (Additional file 1: Table S1). The remaining 50/127 (39.4%) ALF-implicated drugs were not mentioned to cause ALF in the LiverTox database (Table 2). We examined the safety sections in the FDA-approved drug labels, and found out that 13/50 (26%) drugs were mentioned to have ALF risk in the Warnings & Precautions, and/or Adverse reactions sections (Table 2). For another 7 drugs (7/50; 14%), there were reports for drug-induced ALF with the established causality in literature [1215]. For the remaining 30 drugs, we found that 22 (22/50; 44%) drugs had the ALF case reports in the FAERS (Additional file 2: Table S2), which were obtained by searching the FAERS with the Medical Dictionary for Regulatory Activities (MedDRA) preferred terms: "acute hepatic failure" and/or "hepatic failure". Apart from 4 herbal medicines (i.e, Aloe Vera, Ba Jiao Lian, Chi R Yun, and Shosaikoto/daisaikoto), which were not recorded by the FAERS, no ALF case was reported for the remaining 4 drugs (i.e., Clofibrate, Methocarbamol, Pentamidine, and Reserpine). In summary, among 127 identified drugs, evidence supporting risk of ALF was found for 119 drugs (119/127; 93.7%).

Table 2 Summary of 50 drugs implicated by ALF-Topic without apparent ALF evidence in the LiverTox database.

Discussion

In this proof-of-concept study, topic modeling was demonstrated to be a promising approach to leverage information from drug safety databases comprised of textual data. As a case study, LiverTox database was explored by topic modeling to discover the hidden pattern for the identification of drugs potentially causing ALF. We deliberately chose to analyze the LiverTox "Hepatotoxicity" section alone so the findings could be verified by other sections in the LiverTox to demonstrate the potential utility of topic model in the field of drug safety. Specifically, first, ALF-Topic from the "Hepatotoxicity" sections of the drug documents was discovered, which was interpreted by the prevalence of three specific words (i.e., jaundice, hepatocellular, and severe). Then, this topic was applied to identify ALF-related drugs in the LiverTox database. Thereafter, evidence supporting risk of ALF for those identified drugs was found from the "Outcome and management" sections of the LiverTox or found from other public databases if not available from the LiverTox.

ALF-Topic was confirmed to be relevant to the well-known Hy's law [16, 17], which defines that the observed drug-induced hepatocellular liver injury pattern together with jaundice has a poor prognosis with 10~50% fatality of ALF. The predictive power of Hy's law has been verified by the analysis of extensive studies in Spain and Sweden [18, 19], and it has been recommended by the FDA for assessing the potential of a drug to cause severe DILI [20]. In this study, ALF-Topic identified 127 drugs in the LiverTox database, and approximately 60% (77/127) of these drugs were implicated to cause ALF in their "Outcome and management" sections. For those unspecified drugs, supporting evidence was found for 20 drugs in safety sections of their FDA-approved drug labels or in the literature with established ALF causality. ALF case reports were identified in the FAERS for the other 22 drugs, among which, 6 drugs were predicted as ALF positive drugs by an in vitro experiments. While the in vitro data might not directly indicate the ALF potential in humans, it was demonstrated that these 6 drugs were much closer to the ALF positive control drugs when they were tested by in vitro experiment. Evidence of ALF from the FAERS should be interpreted cautiously, because the causality may not be fully established. For example, although ALF cases of Phenelzine were reported in the FAERS, it was emphasized that Phenelzine might not be the suspect drug [21]. In addition to the overestimated risk, the FAERS only receives reports from the United States. For example, Ethionamide was not reported ALF in the FAERS despite being known to cause ALF in the United Kingdom [22].

For 127 identified drugs, evidence supporting risk of ALF was found for 119 drugs. The result strongly suggests that not only the specific wording but also their probabilistic/statistic relationship in the hidden structure of textual documents were crucial to incriminate drugs for ALF. It is worthwhile to point out that it is beyond the scope of this excise to ask ALF-Topic to identify all ALF-related drugs because ALF mechanisms are complex and the selected 26 known ALF drugs for determining ALF-Topic do not necessary represent the entire landscape of ALF. For example, hepatocellular liver injury pattern is not observed for Efavirenz and Dicloxacillin, which have the potential to cause ALF [12]. Atorvastatin and Ethambutol, known as ALF drugs [12], are not mentioned to cause either jaundice or hepatocellular liver injury in the LiverTox database.

Conclusions

We explored the LiverTox database using topic modeling, and discovered the hidden knowledge to identify drugs with potential to cause ALF. Our proof-of-concept study demonstrates the applicability of topic modeling to leverage information within the textual drug safety databases, which will provide new opportunities for drug safety assessment.