Data Curation
Data from Obach et al. (11) were used for modeling. This is a collection of human fu in vivo data (670 compounds) retrieved from the literature, related mostly to drugs. The compounds without experimental values or those with values expressed as a range were eliminated. SMILES were automatically retrieved using chemical name and chemical abstract service (CAS) as identifiers. JChem and Chemcell (12) were used for retrieving SMILES. Compounds with missing SMILES or incongruences between the two sources were discarded.
Chemicals were neutralized and counter-ions eliminated too. Substances with ambiguous information, metal complexes and inorganic compounds were eliminated. After this cleaning process, the final dataset comprised 512 compounds.
The first issue to face was the skewness (γ1) of the data set: the distribution of experimental values was shifted toward low values. A significant part of the dataset consisted of compounds with a highly bound with proteins, with values between 0 and 0.1 (see Fig. 1). The first bar of the histogram in the upper part of Fig. 1 is much higher than the others, and usually compounds in this activity range are those with a narrower therapeutic index. In order to derive a model able to discriminate small differences in activity and to obtain a distribution more suitable for modeling, we applied two different endpoint transformations.
The first transformation is a pseudo equilibrium constant (3,5,6,14) expressed as in Eq. 1:
$$ \mathrm{logK}=\log \left(\frac{1- fu}{fu}\right) $$
(1)
When fu is equal to 100%, logK is arbitrarily set at 2.
The second transformation is the square-root of fu (\( \sqrt{fu} \)).
Figure 1 shows the distributions of values before and after the transformations with the relative γ1 value. As expected, logK and √fu had less skewed distribution, making them more suitable for modeling than the original fu data.
Model Derivation
We used two approaches to obtain QSAR models for PPB. The first applies machine learning algorithms on molecular descriptors based on chemical features of the compounds. The second approach used CORAL (IRFMN, 2017) software which implements a descriptor extraction algorithm from a SMILES string.
Calculation of Molecular Descriptors
The main (de)protonated form of the molecule on the dataset at physiological blood pH (7.4) was determined with JChem (15). SMILES were modified accordingly. Dragon 7.0 (16) was used to calculate 2D molecular descriptors. Dragon was not able to calculate several descriptors for 23 compounds. Due to the importance of some of these descriptors (for instance AlogP) we decided to exclude these compounds instead of reducing the number of predictors of the model.
Many of the Dragon descriptors are likely to be redundant or not informative, adding uncertainty to the model and lowering its effectiveness (17), besides the longer computational time needed. Although some models are naturally resistant to non-informative predictors, it is obvious that reducing the input space is an important step in model derivation. For this reason, descriptors with constant values (standard deviation 0) and descriptors that correlate over 95% (Pearson correlation coefficient) with another were rejected. Variable selection was then applied using a random forest based approach as implemented in (18) package for R. It is based on three steps. The first iterates a series of random forests, then the algorithm calculates the variable importance (based on permutation score) and eliminates those variables that fall below a user-defined threshold. The second step finds important descriptors closely related to the response variable (interpretation step) and the third step (prediction) identifies the smallest model leading to a good prediction of the response variables.
As the ionization state is important in determining PPB, local models for specific protonation states (acids, bases, neutral chemicals and zwitterions) were also derived. We used ACD/labs 12.0 to calculate the concentration of (de)protonated molecules at pH 7.4. If a molecule is more than 10% in the acid or basic state, it is flagged as acid or base; if a molecule is more than 10% for both the acid and the base ionization state, it is considered a zwitterion. Neutral substances have more than 90% of the concentration in a neutral state. The number of chemicals in each dataset is shown in Table I.
Table I Compounds in Each Datasets for Specific Ionization States When addressing the four subsets with specified ionization states, the neutral form of the molecule was used to calculated Dragon descriptors (since the ionization state was homogeneous in each subset).
For this reason, we were able to save all the compounds for the local models.
When modeling the sub-datasets, the square roots of the fraction unbound gave a better performance, so only these results are shown.
Data Splitting
For the model’s derivation, the dataset was divided into a Training Set (TS) and an External Validation Set (EVS) with a ratio of 80:20. The number of compounds in each set is shown in Table II. In order to ensure a uniform distribution of the endpoint values in the two subsets, we applied an activity sampling method. The dataset was binned into five equal sized portions based on fixed ranges of experimental values. Each bin was then divided based on a 80:20 ratio and then distributed in TS and EVS .
Table II Numerosity of the Splits for Each Dataset and Number of Descriptors Selected Model Training
After VSURF variable selection, a Random Forest (19) algorithm, as implemented in KNIME (20) was applied for model derivation. Data sampling for each tree was done with replacement, and the default number of randomly chosen descriptors at each split was set as the square root of the initial number of descriptors; the descriptors are different for each tree.
Applicability Domain
The AD of a QSAR model is defined as “the physico-chemical, structural, or biological space, knowledge or information on which the TS of the model has been developed, and for which it is applicable to make predictions for new compounds […]. Ideally, the QSAR should only be used to make predictions within that domain by interpolation not extrapolation” (21).
Since there is not a universally accepted method to define AD (21,21,23) a series of approaches were applied. Results were evaluated in terms of gain in performance resulting from the removal of prediction out of AD, and coverage (percentage of chemicals retained after the application of a given AD method) (Table III).
Table III Methods Chosen for Defining the AD, Brief Description and Reference SMILES-Based Descriptors Model Derivation (CORAL)
The optimal descriptors calculated with CORAL (http://www.insilico.eu/coral/) software are attributes extracted from parsing the molecule’s SMILES notations. Obviously the most important treatment in this case is the correct normalization of the SMILES notation because the algorithm works by recognizing recurrent patterns (particular characters or combinations) in the SMILES (32,32,34). To have a good standardization of patterns the SMILES notation has been canonicalized with ACD/labs (35). The possible SMILES attributes are listed in Table IV.
Table IV Smiles Attributes and their Description The TS used for Dragon approach modeling has been further divided into three sets: a TS of 108 compounds, an Invisible Training Set (ITS) of 140 compounds, a Calibration Set (CS) of 143 compounds. Conversely, the validation set is identical to the EVS used with the Dragon descriptor-based models.
The endpoint is calculated as in Eq. 2:
$$ \mathrm{Endpoint}={\mathrm{C}}_0+{\mathrm{C}}_1\ \mathrm{DCW}\left(\mathrm{T},\mathrm{N}\right) $$
(2)
C0 and C1 are the intercept and slope for the Eq. 2, and DCW(T, N) is the combination of SMILES-based attributes, each associated with a correlation weight (CW), as described in Eq. 3. The correlation weights are optimized with the Monte Carlo method to a given number of iterations (N), providing CWs which, used in Eq. 4, provide a maximum correlation coefficient between the descriptor and selected endpoint.
$$ DCW\left({T}^{\ast },{N}^{\ast}\right)= CW(HARD)+\sum CW\left({S}_k\right)+\sum CW\left({SS}_k\right)+\sum CW\left({SS S}_k\right) $$
(3)
The CW(HARD) is the correlation weight of the HARD.
The Sk is the SMILES atom (i.e. single symbol or two symbols which cannot be examined separately, e.g. ‘Cl’, ‘Br’, etc.); the SSk is a combination of two SMILES atoms; the SSSk is a combination of three SMILES atoms. The CW(Sk), CW(SSk), and CW(SSSk) are correlation weights of the above SMILES attributes. The numerical data on the correlation weights are calculated by the Monte Carlo method. The optimization gives maximal value for target function. The target function (TF) is calculated as Eq. 4:
$$ TF=R+{R}^{\prime }-\left|R-{R}^{\prime}\right|+ IIC\ast 1.1 $$
(4)
R and R’ are the correlation coefficients between experimental and predicted values of the endpoint for TS and ITS, respectively. The IIC is the Index of Ideality of Correlation described in the literature (37,38). Attributes with positive CW are considered promoters of an increase of the endpoint value, and those with negative CW are promoters of a decrease. CORAL has an in-house AD evaluation. Only compounds whose SMILES attributes have been selected for model derivation are considered in AD. Predictions of chemicals outside the model AD are considered unreliable and with greater uncertainty and are excluded from the evaluation of the performance (39).
Statistical Analysis
Performance is evaluated on the basis of the determination coefficient (r2) calculated as shown in Eq. 5.
$$ {\mathrm{r}}^2=1-\frac{\sum {\left({\mathrm{y}}_{\mathrm{i}}-{\hat{\mathrm{y}}}_{\mathrm{i}}\right)}^2}{\sum {\left({\mathrm{y}}_{\mathrm{i}}-{\overline{\mathrm{y}}}_{\mathrm{i}}\right)}^2} $$
(5)
where yi is the experimental value of the i-th chemical in the dataset; ŷi is the predicted value of the i-th query compound in the dataset for the determination of r2; \( {\overline{\mathrm{y}}}_{\mathrm{i}} \) is the mean of the experimental values of the compounds in the dataset.
Root Mean Square Error (RMSE) is the square root of the average of the squared differences between prediction and actual observation, as represented in Eq. 6:
$$ \mathrm{RMSE}=\sqrt{\sum \frac{{\left({\hat{\mathrm{y}}}_{\mathrm{i}}-{\mathrm{y}}_{\mathrm{i}}\right)}^2}{\mathrm{N}}} $$
(6)
where yi is the experimental value of the i-th chemical in the dataset; ŷi is the predicted value of the i-th chemical and N is the number of chemicals.
The cross-validated determination coefficient (Q2) has been used for the calculation of statistics in cross-validation.
$$ {\mathrm{Q}}^2=1-\frac{\sum {\left({\mathrm{y}}_{\mathrm{i}}-{\overset{\prime }{\mathrm{y}}}_{\mathrm{i}}\right)}^2}{\sum {\left({\mathrm{y}}_{\mathrm{i}}-{\overline{\mathrm{y}}}_{\mathrm{i}}\right)}^2} $$
(7)
\( {\overset{\prime }{\mathrm{y}}}_{\mathrm{i}} \) is the predicted value in cross-validation (40).
For the Dragon models a 5 fold internal cross-validation (5-fold cv) is used while in the case of CORAL model the equation is calculated as the aggregation of TS, ITS and CS.
Outlier Analysis
A statistical analysis was done in order to check for the possible presence of chemical categories with a large error in prediction. Compounds with absolute error in prediction larger than the mean absolute error (MAE) observed for the whole TS were considered badly predicted (outliers); the remaining compounds were considered correctly predicted.
Chemical categories were defined based on the occurrence in their structures of some “Functional group count” descriptors calculated by Dragon 7.0 (16). Then the distribution of outliers in each category is compared with the distribution of outliers of the entire dataset by a significance test (Fisher’s exact test). This statistic tests the null hypothesis if there is no association between the row variable and the column variable. In this particular case the null hypothesis is the absence of significant difference from the distribution of outliers in a category and in the total distribution. The null hypothesis is rejected when the p-value is less than 0.05.
To evaluate the strength of the probability Likelihood Ratio has been adapted from Ferrari et al. (41) to estimate the statistical relevance of analyses. (Eq. 8)
$$ \mathrm{LR}={\left(\mathrm{TP}/\mathrm{FP}\right)}^{\ast}\left(\mathrm{negatives}/\mathrm{positives}\right) $$
(8)
The TP (true positives) are compounds with a certain functional group that are badly predicted, while the FP (false positives) are compounds with the same functional group but correctly predicted. Negatives are the total number of correctly predicted compounds, while positives are the total number of badly predicted compounds.
The same procedure has been used also to evaluate if some of the models performed better for certain chemical categories.