The Use of ROC Analysis for the Qualitative Prediction of Human Oral Bioavailability from Animal Data

To develop and evaluate a tool for the qualitative prediction of human oral bioavailability (Fhuman) from animal oral bioavailability (Fanimal) data employing ROC analysis and to identify the optimal thresholds for such predictions. A dataset of 184 compounds with known Fhuman and Fanimal in at least one species (mouse, rat, dog and non-human primates (NHP)) was employed. A binary classification model for Fhuman was built by setting a threshold for high/low Fhuman at 50%. The thresholds for high/low Fanimal were varied from 0 to 100 to generate the ROC curves. Optimal thresholds were derived from ‘cost analysis’ and the outcomes with respect to false negative and false positive predictions were analyzed against the BDDCS class distributions. We successfully built ROC curves for the combined dataset and per individual species. Optimal Fanimal thresholds were found to be 67% (mouse), 22% (rat), 58% (dog), 35% (NHP) and 47% (combined dataset). No significant trends were observed when sub-categorizing the outcomes by the BDDCS. Fanimal can predict high/low Fhuman with adequate sensitivity and specificity. This methodology and associated thresholds can be employed as part of decisions related to planning necessary studies during development of new drug candidates and lead selection.


INTRODUCTION
Oral bioavailability (F) is considered a key parameter during drug development. It can be defined as the fraction of the dose administered orally that reaches systemic circulation, as its unchanged form, which becomes available at its intended site of action to produce the desired therapeutic effect. Even though the latter is difficult to measure, it is usually assumed that the amount of drug at the site of action is proportional to the amount of drug in plasma/blood. Oral bioavailability is dependent upon the fraction of dose that is absorbed in the gastrointestinal (GI) tract (f a ), as well as the fraction that escapes first pass metabolism in both the GI tract and the liver, F G and F H , respectively (Eq. 1) (1,2).
Despite its importance, the information regarding oral drug bioavailability is not always available during the development stages, as it usually requires the administration of an intravenous (iv ) dose as a reference. However, due to safety and solubility reasons the iv dose is not always available; therefore, drug oral bioavailability is generally unknown until later stages in the development process. In addition, low oral bioavailability is generally associated with higher inter individual variability (3).
It is a common practice during drug development to employ animal models for the in vivo determination of safety, efficacy and pharmacokinetic properties of a drug candidate (1). The main goal of such studies is to predict drug's behaviour in humans/man based on animal data. The relationship between the animal models and human oral drug bioavailability has been studied on several occasions. One of the first attempts to investigate this relationship was performed by Sietsema in 1989, where a poor correlation was found between animal (rodents, dogs and non-human primates (NHP)) and human oral bioavailability (4). The lack of correlation could be explained by the interspecies differences in the factors governing oral bioavailability such as morphophysiological differences in GI tract, abundance of transporters and metabolic enzymes and their regional distribution, given that physicochemical properties are inherent to the drug and/or formulation (1,(4)(5)(6)(7)(8)(9)(10)(11). Nonetheless, correlations have been established for intestinal permeability and/or the fraction of dose absorbed between humans and preclinical species, in particular for rat and NHP (10,(12)(13)(14), suggesting that for those particular species, bioavailability differences may rely on the metabolic component.
Based on Sietsema (1989) data, Grass and Sinko (15) plotted the relationship between animal and human oral bioavailability. The plot was similar to a scatter plot and therefore no trends were observed between animal and human bioavailability. However, issues were found with regards to the number of data points in the plot as compared to the original publication from Sietsema (1989), in addition species definitions and the fact of the publication was based on data from 1989 were suggestive that a new dataset was needed in order to perform a comprehensive analysis of the relationship between animal and human oral bioavailability. A recent study by Musther and co-workers (16), addressed those needs by introducing a more comprehensive and updated dataset and the correlations between animal and human oral bioavailability were investigated. Their results were in agreement with the analysis performed in the past by several groups (4,9,13,17,18). Amongst the species studied, NHP showed the strongest correlation with human, followed by poor correlations for dog, rat and mouse. However the large prediction intervals suggest that a pointwise correlation between human and preclinical species is not plausible. In terms of qualitative predictions, the study showed that mouse, rat and NHP underpredict human bioavailability. More interestingly, the median of the ratio between animal and human bioavailability was close to unity, however the large intervals for the median ratio suggest that these results should be treated carefully. Despite the suggested lack of predictability of human oral bioavailability from preclinical species, the models are still employed during drug development. The information gathered from oral bioavailability studies in animal models is employed as part of the decision making process of whether to continue or not with the development of any particular drug (19,20).
Marketed drugs and drug candidates can be classified according to extent of metabolism by means of the Biopharmaceutics Drug Disposition System (BDDCS) (21). Similarly to the Biopharmaceutics Classification System (BCS) (22), the BDDCS divides the compounds into four classes based on their aqueous solubility and permeability. However, the main difference between the two systems is the permeability component. In the BCS, permeability relates to intestinal permeability rate and the extent of absorption, whereas with the BDDCS permeability relates to the permeability rate in the intestine and/or liver, which was found to be correlated with the extent of metabolism (21,23,24). The BDDCS defines a highly soluble compound as a compound whose highest regulatory approved strength is able to dissolve in 250 mL (or less) of water throughout a pH range of 1 to 7.5 at 37°C. Likewise, the BDDCS defines highly permeable compounds as compounds where 70% or more of the administered oral dose is metabolized. A compound is considered poorly metabolized if 50% or more of the administered dose is excreted in the urine or bile in its unchanged form (25)(26)(27)(28). Thus, BDDCS Class 1 compounds are highly soluble and highly metabolized, Class 2 compounds are poorly soluble and highly metabolized, Class 3 compounds are highly soluble and poorly metabolized and Class 4 compounds are poorly soluble and poorly metabolized. In addition, it has been suggested that BDDCS class can be useful to estimate the impact of intestinal transporters in drug absorption and metabolism as well as propose possible food effect and clinically relevant drug-drug interactions (21,(25)(26)(27)(28)(29).
A model can be evaluated based upon its ability to correctly predict any particular outcome, where the prediction performance of a binary classification model can be evaluated by means of Receiver Operating Characteristic (ROC) analysis. The ROC space consists of a plot of the False Positive Rate (FPR) as a function of the true positive rate (TPR). A binary classification is represented by a single point in the ROC space, where a perfect classification will have a FPR of 0 and a TPR of 1. Likewise, a continuous system can be represented by curve in the ROC space and the area under the ROC curve (AUC) can be employed as a measure of the performance of the predictions made from the classification system or model. An AUC of 1 corresponds to a perfect classification/prediction and an AUC of 0.5 corresponds to a random classification/prediction (30)(31)(32).
In the present study, we introduce a new model for the categorical prediction of human oral bioavailability, stated as low or high, from animal data by employing a threshold decision tool based upon ROC analysis. Oral bioavailability for more than 180 compounds was analysed in different preclinical species to generate animal oral bioavailability thresholds that can be employed for the qualitative prediction of human oral bioavailability. Furthermore, the relationships between the resulting classifications were compared with the BDDCS classification of the compounds employed for the analysis.

Dataset Employed
A total of 184 different compounds with reported oral bioavailability in both human and preclinical species, namely mouse, rat, dog and NHP, were employed in this study. The oral bioavailability values for the compounds were collated from the literature by Musther and co-workers, as described elsewhere (16).

Binary Classification and ROC Analysis
A binary classification model was implemented by establishing a threshold (t H ) for high and low human bioavailability at 50% (Eq. 2). A positive outcome (high F human ) was defined when F human was greater or equal than 50%, while a negative outcome (low F human ) occurred when F human was less than 50%.
In a similar fashion, Eq. 2 was modified to classify high and low F animal , by setting up a threshold for animal oral bioavailability (t A ) as per Eq. 3.
Thus for F animal based predictions, false positives (FP) were compounds with high F animal and low F human , and false negatives (FN) were compounds with low F animal and high F human (Fig. 1). If both F animal and F human were high, the compound was classified as a true positive (TP), similarly if both F animal and F human were low, the compound was considered as a true negative (TN). The predictions based on the animal data analysis for bioavailability were evaluated by calculating its sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) for the determined animal thresholds (t A ), as shown in Table I.
All the statistical analysis and the ROC curve generation were implemented with Matlab ® 2012a and its statistical toolbox (The MathWorks Inc., Natick, MA, USA). The construction of the ROC curves, for the overall dataset and by species, was achieved by varying t A form 0 to 100 and recording the error rates (Table I) for each threshold. The overall (all the species combined) ROC curve was constructed by considering all the datapoints within the dataset, including the compounds with F animal values in more than one preclinical species. In addition, a bootstrap random resampling (n = 10000) ,with replacement, was performed for the determination of the average ROC curves and the bias corrected confidence intervals for the resulting AUC, sensitivity and specificity. The curves were averaged by threshold averaging function within Matlab ® . The significance of the differences between the resulting AUCs (for each species and the overall dataset) and the random classification (AUC=0.5) were calculated by the non-parametric Mann-Whitney U test. Determinations of the optimal thresholds for the averaged ROC curves were calculated as the interception of a line of slope, S (Eq. 4), with the averaged ROC curve.
Where, cost(FP) and cost(FN) are the costs of FP and FN, respectively; cost(TN) and cost(TP) are the costs of TN and TP, respectively; N and P, are the number of positives and negatives values based on F human data (32). The net costs for TN and TP were assumed to be 0, initially the net costs for FP and FN were assumed equal. Subsequently, the cost ratio between FP and FN was varied in order to evaluate the impact of the cost assumptions on the determination of the optimal threshold points.
For rat, dog and NHP, alternative thresholds were determined by visually comparing the relationship between animal threshold (t A ) versus its resulting sensitivity and specificity; the closest points to the intersection lines between the latter two were considered as alternative thresholds. New thresholds were compared against the optimal thresholds derived from cost analysis employing Youden's index ( J) (Eq. 5), where the maximum value for J is 1 for a perfect classification, whereas the minimum value is 0 for a threshold with no predictive power (33).
An additional ROC analysis was performed for the compounds in the dataset whose F human values were ≤20% (very low oral bioavailability) in the overall dataset, and for these compounds the human threshold for high/low oral bioavailability was set up at 10%. The analysis was conducted using rat, dog and NHP data and the significance of the AUC and determination of optimal thresholds was performed, as described above (Eq. 4).

BDDCS Classification
The BDDCS classification for 155 of the compounds in the current study was obtained from the dataset published by Benet, and co-workers (34), while for the remaining compounds, a provisional BDDCS classification was given. All individual drugs used for the analysis and their allocated BDDCS classification are listed in Tables SI and SII of the Supplementary Material. For the classification, the extent of metabolism and/or percentage of the dose excreted unchanged in the urine and/or bile was collated from the literature. With regards to solubility determination, for US-FDA approved drugs, the maximum dose strength was taken from the label, when possible; otherwise the maximum dose strength was extracted from published data. Aqueous solubility values were taken from published data when possible, otherwise the "Mass Solubility" value within SciFinder ® chemical properties was employed (35). High/low solubility was assigned using Eq. 6, where Dn is the dose number, D max is the maximum dose strength (mg) and Cs is the aqueous solubility (mg/mL). High solubility was assigned to compounds with Dn equal or greater than 1, while low solubility was assigned for compounds with a Dn less than 1.
Class distribution was then compared within each of the threshold-based outcome groups (i.e. true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN)) for the rat, dog and NHP. Significance of the difference between the BDDCS class distribution within each of the outcome groups (i.e. TP, TN, FP, FN) were evaluated by visual inspection of the plots and by employing Fisher's exact test implemented in the R statistics Package (http://www.r-project.org/).

RESULTS
The analysed dataset consisted in total of 318 data points for animal and human oral bioavailability divided in 30, 122, 125 and 41 pairwise correlations between mouse, rat, dog and NHP with human oral bioavailability, respectively (Fig. 2a). For some of the compounds in the dataset oral bioavailability values were available for more than one species as shown in Fig. 2b for rat, dog and NHP. The summary of resulting ROC curves for the total dataset of 318 F animal values and per individual species are shown in Table II. Combination of all F animal values resulted in a smooth ROC curve with an AUC around 0.8 (Fig. 3). A similar outcome was observed employing individual dog and rat data (Fig. 4b, c) with ROC curve AUC for dog data slightly higher than for the rat (0.8 and 0.7, respectively). In contrast, ROC curves generated for mouse and NHP resulted in a steplike curve rather than a smooth curve (Fig. 4a, d), primarily attributed to the limited number of data points employed in their construction. However, the AUC values of the latter ROC curves were higher than for the overall dataset and the rest of the species. Interestingly, NHP showed an AUC value close to the unity (0.96). With regards to the significance tests for the above AUCs -all the species combined and by species-showed high significance levels compared to the random classification (AUC=0.5) yet for mouse data, p-value was slightly higher than for the rest of the species.
The 'cost analysis' determinations of the optimal F animal thresholds are summarized in Tables III and IV For the overall dataset, threshold was 47%, in agreement with the 50% value employed for the human bioavailability threshold with resulting specificity and sensitivity of 0.82 and 0.66, respectively. A similar situation occurred for dog data, though the threshold was slightly higher than for the overall dataset, whereas specificity and sensitivity were balanced with values close to 0.75. The highest threshold was found when employing mouse data (t A =67%), which gave rise to perfect specificity but this should be viewed with caution given the limited data available (n =30). Rat and NHP showed thresholds <50%, whilst the corresponding specificity for the rat was lower than for any species investigated. In agreement with the AUC values, NHP showed the highest specificity and sensitivity for its optimal threshold. Details of the drugs classified as The NPV and PPV values showed an opposite relationship to their corresponding sensitivities and specificities; this trend was evident for all of the species and the overall dataset. All of the corresponding PPV values were above 0.8, with the exception of the rat (0.72). Interestingly, NPV values were similar for almost all of the species and the overall dataset; however, corresponding values were higher for NHP. The analysis of the impact of the ratio between the cost of FP and FN on the resulting thresholds showed a sigmoid like increase on the threshold value when varying the ratio from 0.0025 to 20 (Fig. 5). However, in some regions the thresholds remained invariant to changes in the FP/FN ratios. Fig. 2 (a) Pie chart of the distribution of the oral bioavailability data points employed for the analysis by species, mouse (n =30), rat (n =122), dog (n =125) and non-human primates (NHP) (n =41). (b) Venn diagram of the relationship between oral bioavailability data points for rat, dog and NHP. The area of the circles represents the number of compounds with oral bioavailability data for both animal species and humans, the areas of the interception represents the number of compounds with bioavailability data for more than one species. Visual inspection of the plots in Fig. 6 led to alternative animal thresholds at 28%, 54%, and 31% for rat, dog and NHP, respectively (Table III). Even though specificity and sensitivity were balanced for all the species, NPV and PPV showed similar relationships to the previous analysis. Yet, the alternative thresholds showed no improvement of the overall predictability of animal data for the particular species as compared with the former thresholds derived from the cost analysis. Youden's index ( J) values for alternative thresholds were lower than the values achieved by the aforementioned thresholds (Tables III and V). ROC Analysis of the reduced dataset of compounds with very low F human was not possible for NHP due to the lack of compounds. Analysis of dog data (n =11) showed that this preclinical animal model cannot be applied to the categorical prediction of very low bioavailability compounds since AUC was not significantly different from the random classification (data not shown). Nevertheless, for rat data (n =34) ROC   analysis was possible (Fig. 7) and the results showed that the rat model provided a significantly improvement on the predictions as compared to the random classification (AUC=0.77; 0.52-0.91, 95% confidence interval (CI)). In this case, the threshold for high/low F animal was 7.4% with a corresponding sensitivity of 0.95(0.75-1.00, 95% CI) and specificity of 0.67(0.40-0.88, 95% CI), indicating that rat can be used for predictions of very low values of human bioavailability predictions. Provisional BDDCS classification was assigned to 29 of the compounds in the dataset (Table SI in

DISCUSSION
Several groups have suggested that pointwise correlation of oral bioavailability between preclinical species and human is almost non-existent and, therefore, data from such studies in preclinical species should be treated with caution if the intention is to predict human oral bioavailability quantitatively (4,15,16,36). The current study performed a systematic analysis of F animal data from four different species, including 184 compounds across all BDDCS classes. Rather than attempting to predict a particular value for F human employing F animal data, ROC analysis performed here showed that the animal data can be employed for a categorical prediction of high or low human bioavailability. The area under the ROC curve is representative of the probability of correctly classifying F human as high and low when employing F animal data: in our analysis the probability was around 80% which can be considered as a high value (30). Even though mouse data showed similar performance, the reduced number of data points employed in the analysis limit us from making any conclusions about the utility of this preclinical model for making this type of prediction. On the contrary, the higher abundance of rat and dog data during preclinical development stages was reflected in the smother ROC curves built for those particular species and the resulting high probabilities for a correct classification employing those species. This finding is in contrast to previous attempts to establish correlations between animal and human oral bioavailability (10,13,15,16,37) .
Albeit with a much smaller sample size than rat and dog, NHP showed the best performance of all the species with a probability for a correct classification around 100%. The results are consistent with previous findings for this particular species from pointwise correlation analyses (13,16). However, the cost and ethical implications on the use of NHP during preclinical development will limit their use to later stages of the drug development process.
The resulting thresholds for the cost analysis in rat, dog and NHP are consistent with the correlation analysis performed by Musther and co-workers (16). Rat and NHP proved to be underpredictive of human oral bioavailability and in the same fashion the thresholds for high/low F Animal were lower than the threshold for high/low F human . In contrast, threshold for dog (58%) was relatively close to the human value, which is consistent with the observed trend in the dog to neither underpredict nor overpredict human oral bioavailability. The model achieved high TPR (>0.70) and acceptable FPR (≤0.40) employing the aforementioned thresholds for all the preclinical species. The former suggests that it is unlikely to have low oral bioavailability values in human when having high oral bioavailability in any of the preclinical species. Opt. t A , optimal threshold for F animal ; NPV, negative predictive value; PPV, positive predictive value When we determined the optimal thresholds by an alternative method, the results were similar to the ones based on cost analysis for rat, dog and NHP. In addition, the new thresholds did not provide any significant improvement of the ability to predict human high/low bioavailability from animal data. The optimal thresholds may also be determined by other alternative methods, such as the Euclidean distance between the ROC curve and the hypothetical perfect classification (e.g., point 0, 1 in the ROC space) (38). However, this particular method does not take into account the nature of the dataset unlike the 'cost analysis' determination based on Eq. 4.
Determination of the optimal thresholds according to Eq. 4 will depend upon the assigned net cost of FP and FN. In a conservative fashion, a higher cost can be assigned to the FP and the thresholds will move to a higher value and the opposite will occur when assigning a higher net cost to FN. However, as shown in Fig. 5, the increase of the ratio follows a step like increase in the thresholds (showing insensitive regions to the assigned net cost ratios). For example a two-fold increase on the net cost of a FP compared to the net cost of FN will increase the threshold for both rat and dog, whereas a threefold increase will have the same impact on the resulting thresholds as a tenfold increase. However, moving the thresholds will affect the corresponding sensitivity and specificity ( Fig. 6) and thus the thresholds represent the best estimator based on a balanced situation.
Even though rat predictions showed higher FPR than the rest of the preclinical species, its convenience and availability during preclinical development stages make it one the commonest species employed to generate bioavailability data for new drug candidates. Our threshold for rat predictions of high/low F human was 22%, consistent with similar thresholds (F ≥20-30%) for the evaluation of rat F animal during drug development published by different groups in the industrial setting (19,20). In contrast to the current analysis, previously proposed thresholds fail to provide any indication of the expected range for human bioavailability and also the corresponding TPR and FPR for the rat based predictions. From the analysis of the reduced dataset for very low human oral bioavailability (<20%), rat data showed the best performance across species for prediction of F human in this range, with the probability for a correct prediction of high/low F human of 77%. Fig. 6 Sensitivity and specificity as a function of F animal thresholds for rat (a ), dog (b ) and NHP (c ). An increase on the thresholds will increase the specificity but at the same time will decrease its sensitivity, the thresholds closer to the intercept between the two lines were chosen as alternative thresholds. Red line and cross (-+-), specificity; Blue line and asterisk (-*-), sensitivity.  The threshold (7.4%), in this case was close the human threshold (10%), with a high TPR of 0.95 and with a small decrease in the FPR to 0.33, confirming further the suitability of the rat data for qualitative predictions of F human during drug development. Nevertheless, the shape of the ROC curve (Fig. 7) suggests that it is necessary to obtain more data to validate this conclusion. The dataset employed for the generation of this predictive model was based upon successful candidates and hence a relatively high prevalence (58%) of compounds with high F human values was observed. It would be of interest to test this model with a dataset reflective of the "true" drug development process. However, we do not expect a high variation on the rat threshold nor the resulting TNR and TPR , as our threshold was in line with the ones employed in the development setting and the latter are insensitive to the prevalence of high F human values (33).
ROC analysis for the evaluation of a predictive model of human bioavailability has been applied before. Langdon and Barret (39) developed a model for the prediction of oral bioavailability in both rat and human based on QSAR and genetic programing (GP). The evaluation of such a model was based on the ability of the model to predict high (F ≥33%) and low (F <33%) bioavailability in both human and rat. Interestingly TNR and FNR were in agreement with results of our study.
BDDCS distributions of the drugs investigated in our analysis were in agreement with the observed distribution for marketed drugs, where Classes 1 and 3 are the predominant classes. Nevertheless, a different distribution can be expected for new drug candidates, with a higher tendency for Class 2 and 3 drugs (23,25). However, our analysis on the outcome of the classification model showed no clear tendency for any particular BDDCS Class. One reason for this observation can be attributed to the fact that BDDCS is a categorical tool with no clear distinction between overall metabolism and first pass metabolism. In addition, the binary nature of both BCS and BDDCS does not account for the continuity of the properties employed within them-e.g., solubility, intestinal permeability and/or extent of metabolism-which might be an issue for the middle range of those properties and therefore affecting the actual class distribution. Finally, interspecies differences in metabolic activity are not accounted for by the existing BDDCS, and therefore the class distribution for a particular drug could be different in a particular preclinical species.
Even though our methodology was applied on animal oral bioavailability data, a similar approach could be employed to its constituents-fa, F G and F H -which might be of more interest for some groups involved in preclinical development. However, we suggest that a more mechanistic approach should be attempted if the intention is to quantitatively predict drug's behaviour in man, taking in consideration the physiological differences between the preclinical species and humans in addition to drug's intrinsic characteristic that can impact on drug's oral bioavailability and its individual components.

CONCLUSION
A new method for categorical prediction of high and low human bioavailability from animal bioavailability data was developed by employing ROC analysis. Oral bioavailability of more than 50% in animals can successfully predict high oral bioavailability in human, with high TPR and low FPR. A similar scenario can be expected for bioavailability values equal or greater than 22%, 58% and 35% in rat, dog and NHP. Even though NHP was the best predictor of F Human , rat was shown to be the best predictor for low human oral bioavailability, supporting the use of this animal model for the F human predictions in mid to early stages of drug development. Thresholds proposed in the current study can be employed in the pharmaceutical industry as part of the tool box of methods for making decisions related to planning necessary studies during the development of new drug candidates and lead selection.