Outline and Assumptions
The objective of this proof-of-concept (PoC) effort was to develop and assess the performance of a predictive model that can help detect AE under-reporting and to develop a visual interface for QA professionals. The scope of this PoC was to predict AE under-reporting, not predicting adverse drug reactions that should occur in clinical trials. GCPs require all AEs, whether or not there might be a causal relationship between the intake of the drug and the events, to be reported timely to the sponsor [1].
The identification of study investigator sites suspected of under-reporting amounts to an unsupervised anomaly detection problem [12]. In this class of problems, one tries to identify which elements of a data set are anomalous; for example, which objects in a production line show a defect, or which study sites are not compliant with GCP. The main difference from a classification task is that the data points are unlabeled. Under the assumption that a majority of them behave normally, a possible approach to solve these problems is to fit a probability distribution to the data and flag as anomalous those data points that have a likelihood below a certain threshold. The performance of the anomaly detector can then be assessed with a small sample of anomalous points, either manually detected or simulated, and regular ones in the same way as one would assess a classifier, namely with metrics such as the area under the receiver operating characteristic (ROC) curve, precision, recall, or accuracy.
Working on the assumption that the curated data set of finished and completed studies used for model training contained a majority of compliant study sites (see also Sect. 2.2.1.), we could build a probabilistic model for the random variable \(Y_{\text{site}}\) describing the number of AEs reported by a given study site. We collected data from each site, which were modeled as a random variable \(X_{\text{site}}\), a feature vector that we believed had a direct influence on \(Y_{\text{site}}\).
When we considered a new study site and made observations of the feature vector \(x_{\text{site}}\) and of the number \(y_{\text{site}}\) of reported AEs, from the conditional probability density \(p\left( {Y_{\text{site}} | X_{\text{site}} } \right)\) of our model we computed the probability of observing this number of AEs or less, that we defined as the significance level. We then picked a threshold and we decided to act for significance levels below it.
Clinical trial data can be interpreted as a set of multivariate time series of measurements for each patient in the study (some of them being constant, for instance the demographic data). Furthermore, this data is typically collected during the patient visits, which is when AEs are reported to the investigator [13]. Therefore, we decomposed the number of AEs \(y_{\text{site}}\) reported by a site into the sum of the numbers of AEs reported by the corresponding patients,
$$Y_{\text{site}} = \mathop \sum \limits_{{{\text{patient}}\,\in\, {\text{site}}}}^{{}} Y_{\text{patient}} ,$$
and similarly, the number of AEs reported by a patient into the sum of the numbers of AEs reported at each visit,
$$Y_{\text{patient}} = \mathop \sum \limits_{{{\text{visit}} \,\in\, {\text{patient}}}}^{{}} Y_{\text{visit}} .$$
We could make predictions either at the site level, patient level, or visit level. Given the granularity of clinical data, we decided to focus on the visit level. A sudden change in vital parameters such as the weight could be indicative of health deterioration and thus the occurrence of AEs [14]. Moreover, once we used this model on ongoing studies, we wanted to be able to update our predictions as new data from the sites came in, which was easier to do if we started at the visit level.
We were thus interested in the probability density \(p\left( {Y_{\text{visit}} | X_{\text{visit}} } \right)\) conditioned on the feature vector \(X_{\text{visit}}\) that summarizes information on the patient known at the time of the visit. To estimate the relation between \(X_{\text{visit}}\) and \(Y_{\text{visit}}\), given the amount of historical data at our disposal, we decided to apply machine learning algorithms. The usual least squared error regression was ill-advised in this situation as it would imply that predicting zero AE instead of five costs the same as predicting 95 instead of 100, which was not the case. We could have considered logarithmic least squares, but since we were dealing with a count variable, it was best to minimize the Poisson deviance. In this class of models, the random variable \(Y_{\text{visit}}\) was interpreted as a Poisson process,
$$Y_{\text{visit}} \sim {\text{Poi}}\left( {\theta_{\text{visit}} } \right),$$
where we had to express the Poisson parameter \(\theta_{\text{visit}}\) as a function of \(X_{\text{visit}}\). Due to the complexity of the underlying biology of AEs, the empirical approach seemed more promising than theoretical modeling and we decided to use machine learning for this task. The advantage of this approach was that Poisson processes are additive in their parameters, so we immediately obtained:
$$Y_{\text{patient}} \sim {\text{Poi}}\left( {\theta_{\text{patient}} } \right),\quad\theta_{\text{patient}} = \mathop \sum \limits_{{{\text{visit}}\, \in \,{\text{patient}}}}^{{}} \theta_{\text{visit}} ,$$
$$Y_{\text{site}} \sim {\text{Poi}}\left( {\theta_{\text{site}} } \right),\quad s\theta_{\text{site}} = \mathop \sum \limits_{{{\text{patient}}\, \in \,{\text{site}}}}^{{}} \theta_{\text{patient}} .$$
Furthermore, assuming our estimate of \(\theta_{\text{site}}\) was accurate, we could calculate the significance level of an observation of \(y_{\text{site}}\) adverse events,
$$S\left( {x_{\text{site}} ,y_{\text{site}} } \right) = P\left( {Y_{\text{site}} \le y_{\text{site}} | x_{\text{site}} } \right) = \mathop \sum \limits_{k = 0}^{{y_{\text{site}} }} \frac{{\theta_{\text{site}}^{k} }}{k!}e^{{ - \theta_{\text{site}} }}.$$
Even if these assumptions did not hold perfectly and \(P\left( {Y_{\text{site}} \le y_{\text{site}} | x_{\text{site}} } \right)\) was thus not a well-calibrated probability, we could still use it as a scoring function to detect under-reporting and evaluate its discriminating power with a ROC curve.
Data
Raw Data
The raw data set we used came from Roche/Genentech-sponsored clinical trials. We used common data attributes from 104 completed studies that covered various molecule types and disease areas. The data set included 3231 individual investigator sites, with 18,682 study subjects that underwent 288,254 study visits. Of note, any study subject data was used in a de-identified format. To mitigate the risk of having studies with under-reporting in our data set, we used only data from completed and terminated clinical trials, where AE reconciliation and SDV had been performed as part of the study closure activities. The six common patient data attributes across the studies that we selected in our curated data set were demographics, medical history, concomitant medications, vitals, visits, and adverse events, following the Study Data Tabulation Model (SDTM) standard [15]. As mentioned above, we focused on the visits, which we labeled by study code, patient number, and visit date. We also considered study attributes available in the Roche Clinical Trial Management System (CTMS) and included them in our data set: study type, route of administration, concomitant agents, disease area, blinding, randomization, and study phase. We used a different classification for the molecule classes and the disease areas from the one used in the Roche CTMS to ensure their clinical relevance in terms of AE reporting. Molecules were classified using the Anatomical Therapeutic Chemical (ATC) classification system [16]. For the disease areas, we used a simple classification that reflects the populations enrolled in our clinical trials (healthy participants, malignancies, autoimmune diseases, neurodegenerative diseases, respiratory diseases, skin disorders, lung diseases, infectious diseases, others). As we needed to have a model that can generalize to the diversity and volume of clinical studies we run at Roche/Genentech, we purposely chose study and patient attributes that are systematically captured in our clinical programs. See Table 1 below for an overview of our curated data set.
Table 1 Attributes available in our curated data-set Features and Targets
Each AE was assigned to the first visit following the onset date and all AEs assigned to a specific visit were aggregated into the observation \(y_{\text{visit}}\), that we tried to predict.
To construct features, we needed to project all data attributes to the visit level. For demographic characteristics that were constant, such as sex and ethnicity, or had a direct dependence on the date, such as age, this was straightforward. For medical history, we counted the events that occurred before every visit. Since new entries from screening in the medical history section of the electronic case report form (eCRF) normally correspond to AEs that should get reported, they provide a strong signal. Similarly, we counted concomitant medications, because the more drugs a patient receives, the more AEs he will likely experience [14, 17]. From the vitals reported at each visit, we included blood pressure and its relative variation since the previous visit. We also used patient weight, its relative variation since the previous visit, and the trend over the last 3 weeks as attributes, as a change in weight could be linked to a worsening of health and hence the occurrence of AEs. The disease area, the molecule class and mechanism of action, and the route of administration were also included as categorical features, as these characteristics have a strong influence on the type and number of AEs [14]. We picked the drug class instead of the molecule itself as a feature to ensure generalization to previously unseen drugs, consenting to increase the bias in order to reduce the variance. For a selection of the created features and how they correlate with the number of reported AEs, see Electronic Supplementary Material 1.
Before regrouping the features in the vector \(x_{\text{visit}}\), we used one-hot encoding on the categorical variables, we raised the age variable to the power 1.4 in order to have a roughly normal distribution, and we standardized the continuous variables.
Once the set of features was selected, we relied on machine learning algorithms to pick the best ones through optimization of a loss function.
In our model, we used 54 features, with the highest contribution coming from the following ones:
-
Number of previous visits made by the patient
-
Cumulative count of concomitant medications up to the current visit
-
Disease is a malignancy (Boolean)
-
Disease is pulmonary but non-malignant (Boolean)
-
Administration is oral (Boolean)
See Electronic Supplementary Material 2 for the full list of features used in the final model.
Training, Validation, and Test Sets
As in most machine learning projects, we split our data into a training, a validation, and a test set. The training set was used to minimize the loss function with respect to the parameters of the model, the validation set to control for overfitting and to pick the hyper-parameters of the model via grid search, and the test set finally to assess the generalization performance to new data [18]. In our case, the test set was also used for the simulation of under-reporting introduced in the outline.
It should be noted that we could not randomly assign each pair \(\left( {x_{\text{visit}} ,y_{\text{visit}} } \right)\) to one of the three sets as we were ultimately interested in \(y_{\text{site}}\), the count of adverse events reported by a single site. We needed to work on subsets \(V_{\text{site}} = \left\{ {\left( {x_{\text{visit}} ,y_{\text{visit}} } \right) | {\text{visit}} \in {\text{site}}} \right\}\) and assign each of them to one of the training, validation, and test sets. At the level of the prediction for \(y_{\text{visit}}\), this prevented data leakage due to a patient finding himself in two different sets.
We assumed that the molecule class had a significant influence on the number of AEs [17]; therefore, we decided to stratify the sites by this factor when splitting them into the training, validation, and test sets, to ensure a representation of every class in each set.
While respecting these constraints, we tried to assign roughly 60% of the sites to the training set and 20% each to the validation and test sets.
Under-Reporting Simulation
In order to evaluate how the significance level \(S\left( {x_{\text{site}} ,y_{\text{site}} } \right)\) discriminates under-reporting anomalies from normal behavior, we had to simulate under-reporting sites due to the lack of real-world examples where all necessary data attributes had been captured. To do so, we picked a sample \(E_{\text{UR}}\) of the test set \(E_{\text{test}}\) where we artificially lowered the AE count \(y_{\text{site}}\) to simulate under-reporting. Explicitly, for each pair \(\left( {x_{\text{site}} ,y_{\text{site}} } \right) \in E_{\text{UR}}\) from this sample of the test set, we built an under-reporting pair \(\left( {x_{\text{site}} ,\underline{y}_{\text{site}} } \right)\), with \(\underline{y}_{\text{site}} < y_{\text{site}}.\) How much smaller than \(y_{\text{site}}\) depended on how we wanted to define under-reporting, which required input from subject matter experts. We defined three types of scenarios (described below), one following a statistical approach, one reducing all AEs by a fixed ratio, and one simulating absence of reporting.
The negative cases \(\left\{ {\left( {x_{\text{site}} ,y_{\text{site}} ,l_{\text{site}} = 0} \right) | {\text{site}} \in E_{\text{test}} } \right\}\) of under-reporting, where \(l_{\text{site}}\) denotes the label for the classification problem, from the test set could then be merged with the positive cases \(\left\{ {\left( {x_{\text{site}} ,\underline{y}_{\text{site}} ,l_{\text{site}} = 1} \right) | {\text{site}} \in E_{\text{UR}} \subset E_{\text{test}} } \right\}\) of under-reporting from the simulated under-reporting set to form the classification test set, from which we could build a ROC curve for the significance levels \(S\left( {x_{\text{site}} ,y_{\text{site}} } \right)\) and \(S\left( {x_{\text{site}} ,\underline{y}_{\text{site}} } \right)\). We selected a sample instead of the whole test set to exclude sites where the difference between \(y_{\text{site}}\) and \(\underline{y}_{\text{site}}\) would be too low to be worrisome from a quality perspective and would therefore add unnecessary noise in the evaluation of the models. In defining the under-reporting scenarios, we thus had to specify \(\underline{y}_{\text{site}}\) as a function of \(y_{\text{site}}\) and which sites to keep in \(E_{\text{UR}}.\)
Statistical Scenario
The ‘statistical scenario’ relied on the assumption that the total number of AEs reported by a single site followed a Poisson distribution, \(Y_{\text{site}} \sim {\text{Poi}}\left( {\theta_{\text{site}} } \right)\). Our best estimate for \(\theta_{\text{site}}\) was given by the observed number \(y_{\text{site}}\) of AEs, and a low number of reported AEs could be defined as the first percentile of this distribution \({\text{Poi}}\,\left( {y_{\text{site}} } \right)\),
$$\underline{y}_{\text{site}} = Q_{{{\text{Poi}}\left( {y_{\text{site}} } \right)}} \left( {0.01} \right),$$
where \(Q_{D}\) denotes the quantile function of probability distribution \(D\). Table 2 summarizes a few values of this function. We kept in the under-reporting sample \(E_{\text{UR}}\) only the sites with \(y_{\text{site}} \ge 8\).
Table 2 Examples of simulated values of under-reporting in the statistical scenario Ratio Scenarios
In the ‘ratio scenarios’, we arbitrarily kept a fixed fraction of AEs. We tried several values, namely \(\underline{y}_{\text{site}} { = 0} . 7 5\times y_{\text{site}}\) (25% under-reporting), \(\underline{y}_{\text{site}} = 0.5 \times y_{\text{site}}\) (50% under-reporting), \(\underline{y}_{\text{site}} = 0.33 \times y_{\text{site}}\) (67% under-reporting), \(\underline{y}_{\text{site}} = 0.25 \times y_{\text{site}}\) (75% under-reporting) and \(\underline{y}_{\text{site}} = 0.10 \times y_{\text{site}}\) (90% under-reporting), and again we kept in the under-reporting sample \(E_{\text{UR}}\) only the sites with \(y_{\text{site}} \ge 8\).
Zero Scenario
The ‘zero scenario’ simulated the absence of reporting from the smaller sites, so we set \(\underline{y}_{\text{site}} = 0\) and retained only those with 10 patients or fewer but at least six reported AEs in total for the positive cases. In our test set, those represented 329 sites out of 643.
Machine Learning Algorithm
The problem of modeling the number of adverse events reported on a given visit as a Poisson process, \(Y_{\text{visit}} \sim {\text{Poi}}\left( {\theta_{\text{visit}} } \right)\), could be tackled with machine learning. Given observations \(x_{\text{visit}}\) and \(y_{\text{visit}}\) of the features and numbers of reported AEs, the goal was to find an approximation \(f\left( {x_{\text{visit}} } \right)\) of \(y_{\text{visit}}\) that minimizes a loss function,
$$L\left( f \right) = \mathop \sum \limits_{\text{visit}}^{{}} l\left( {y_{\text{visit}},\,f\left( {x_{\text{visit}} } \right)} \right),$$
where the sum runs over all visits in the training set and the individual loss \(l\left( {y_{\text{visit}} ,f\left( {x_{\text{visit}} } \right)} \right)\) penalizes inaccuracy in the individual prediction of \(y_{\text{visit}}\). Its exact form depends on the type of modeling. For Poisson processes, it is the Poisson deviance
$$l\left( {y_{\text{visit}},\,f\left( {x_{\text{visit}} } \right)} \right) = 2\left( {y_{\text{visit}} \,{ \log }\frac{{y_{\text{visit}} }}{{f\left( {x_{\text{visit}} } \right)}} - y_{\text{visit}} + f\left( {x_{\text{visit}} } \right)} \right).$$
Several algorithms are suitable to optimize this loss function, the most commonly used are generalized linear models [19], gradient boosting machines [20], and neural networks. We dismissed neural networks as we felt the limited signal to noise ratio did not justify the investment in computational power and architecture design. We tried the other two algorithms and obtained the best performance with gradient boosting machines, so we settled for this one. A thorough introduction can be found in The elements of statistical learning: data mining, inference and prediction [21], but we provide a brief overview of the algorithm here.
A regression tree would try to solve this optimization problem by successively splitting regions of the feature space in halves and assigning a value for \(f\left( {x_{\text{visit}} } \right)\) to each region of the final partition. While the accuracy of a single tree is fairly low, ensemble methods such as gradient boosting machines or random forests aggregate the predictions of many trees in a weighted average and achieve a much better performance. A gradient boosting machine constructs this average iteratively: it starts with a simple estimate and successively updates its current prediction with a new tree that tries to replicate the current gradient of the loss function. This approach was inspired by the gradient descent methods widely used in optimization, which gave the name of the algorithm.
Implementation
We stored our data in a Hadoop [22] cluster to ensure scalability to an arbitrary number of studies, with the data preprocessing and feature engineering coded in PySpark. Several software packages offer more or less sophisticated implementations of gradient boosting machines. They mainly differ by the way single trees are fit to the current gradient of the loss function and by different performance optimizations. We used the Sparkling Water [23] implementation of H2O, which would allow our entire pipeline to be easily exported as a Spark application if we decided, for instance, to move to a cloud-based solution.