QEX: target-specific druglikeness filter enhances ligand-based virtual screening

Druglikeness is a useful concept for screening drug candidate compounds. We developed QEX, which is a new druglikeness index specific to individual targets. QEX is an improvement of the quantitative estimate of druglikeness (QED) method, which is a popular quantitative evaluation method of druglikeness proposed by Bickerton et al. QEX models the physicochemical properties of compounds that act on each target protein based on the concept of QED modeling physicochemical properties from information on US Food and Drug Administration-approved drugs. The result of the evaluation of PubChem assay data revealed that QEX showed better performance than the original QED did (the area under the curve value of the receiver operating characteristic curve improved by 0.069-0.236). We also present the c-Src inhibitor filtering results of the QEX constructed using Src family kinase inhibitors as a case study. QEX distinguished the inhibitors and non-inhibitors better than QED did. QEX works efficiently even when datasets of inactive compounds are unavailable. If both active and inactive compounds are present, QEX can be used as an initial filter to enhance the screening ability of conventional ligand-based virtual screenings. Electronic supplementary material The online version of this article (10.1007/s11030-018-9842-3) contains supplementary material, which is available to authorized users.


Introduction
Drug molecules are known to share similar physicochemical properties. Molecules possessing these properties are called druglike. Druglikeness is a useful and simple criterion to screen potential drug molecules. The most popular method for evaluating druglikeness is Lipinski's rule (rule of five, RO5) [1], a rule of thumb focusing on orally administered drugs. The rule consists of criteria related to the following four properties: • The number of hydrogen bond acceptors (HBA) is no more than 5. • The number of hydrogen bond donors (HBD) is no more than 5. • The molecular weight (MW) is less than 500.
• The calculated value of the logarithm of the octanol-water partition coefficient (CLogP) is less than 5.
Molecules that fulfill all the criteria are determined to possess druglikeness. The outcome of Lipinski's rule is basically binary, i.e., whether druglike or not, while the number of fulfilled criteria can be used as a multistage evaluation of druglikeness. In contrast, the quantitative estimate of druglikeness (QED) [2] proposed by Bickerton et al. [3] provides continuous scores of druglikeness. QED is based on eight properties: HBA, HBD, MW (which also appear in RO5), LogP value estimated using the Ghose-Crippen method [3] (ALogP), molecular polar surface area (PSA), as well as numbers of rotatable bonds (ROTB), aromatic rings (AROM), and structural alerts [4] (ALERTS). A QED score is calculated using the geometric mean of desirability functions [5], each of which corresponds to an individual property. The functions are modeled as asymmetric sigmoidal functions and fitted to the histogram of a corresponding physicochemical property of the oral drug. Since each function is adjusted to a maximum value of 1, the QED score is also between 0 and 1. Consequently, a higher QED score indicates the compounds are more favorable as a drug.
After QED, Yusof and Segal [6] developed another quantitative estimation method called relative drug likelihood (RDL). While the QED method is based on similarity to known drugs, RDL focuses on differences between drug and non-drug compounds. The RDL score is calculated as the geometric mean of relative likelihoods instead of the simple desirability functions. Therefore, the relative likelihood is a ratio of the posterior probability of a compound being a drug to that of not being one, which is derived from Bayes' theorem.
The QED approach assesses the similarity of a substance to known US Food and Drug Administration (FDA)approved drugs, i.e., 771 drugs curated by its authors, which constitute a heterogeneous mixture of drugs. Favorable properties for drugs depend on the characteristics of the target protein. For instance, the MW of a drug is thought to be affected not only by the constraint to maintain permeability but also by the volume of the binding pocket. Indeed, distributions of QED scores of drugs vary depending on their targets [2], suggesting that QED is not an optimal method for every target and there is room for improvement of at least some targets. Therefore, we proposed a target-specific QED, named QEX, which specifically screens drug candidates directed at particular targets. Although RDL has been shown to apply to specific objectives such as screening of orally administrated G protein-coupled receptor (GPCR) inhibitors, a dataset of inactive compounds required by modeling of RDL is not necessarily available in other cases. In contrast, since QEX and the original QED can be modeled with only the active compound, it can be used even in cases lacking inactive compounds. In this study, the effectiveness of QEX was examined using several targets, in comparison with the original QED.

QEX outperforms the original QED for individual targets
The screening ability of QEX was examined by crossvalidating five targets. The benchmark screening scores of QEX, the original QED, and Lipinski's RO5 are shown in Table 1a, b, and Table S1 in Supplementary Material 1, respectively. The QEX performed better than the original QED and RO5 did in every case shown in Table 1 and Table  S1. As shown in Table S2 and Figure S1, RO5 passed most of the compounds, thus showing poor screening ability. The result indicates that QEX has an advantage in being able to screen active compounds for a specific target.
The properties that showed peaks of distribution are shown in Table 2. Theoretically, they indicate the ideal values of the ability of each physicochemical property to inhibit a corresponding target, because compounds possessing that property are the most frequent in datasets of known inhibitors. Thus, these properties are assumed to reflect the nature of the target protein, especially the inhibitor-binding pocket. In addition, peak values of the original QED but not  QEX are also supposed to reflect absorption, distribution, metabolism, excretion, and toxicity (ADMET) because the original QED is trained with FDA-approved oral drugs. For instance, the peak value of LogP of the original QED model is lower than that of any QEX model in Table 2, suggesting that low lipophilicity and high hydrophilicity are important for orally absorbed drugs. Then, it can be assumed that the original QED and QEX have different roles in the process of drug discovery. An advantage of QEX is that its model is only trained with a dataset of active compounds. In other words, QEX does not require a dataset of inactive compounds, which are often difficult to obtain in large numbers from public databases [7,8]. If the examples of inactive compounds provided are insufficient, the performance of the machine learning classifier would be worse. In that situation, QEX could be more effective than the machine learning method is. Compared to the original QED oriented to oral drugs, QEX is suitable for screening lead compounds acting on a specific target.

Application to c-Src inhibitor screening
To further assess the QEX, we used it to screen for c-Src inhibitors. c-Src is a tyrosine kinase, and many cellular processes are driven by the activation and inactivation of protein tyrosine kinases through phosphorylation. The interplay of c-Src and other proteins has been widely studied, and its role in pluripotent embryonic stem cells has also been reported [9]. Thus, inhibitors of c-Src have been identified using computational and experimental techniques [10].
For the compound screening, a known inhibitor library was obtained from Chiba et al. [11], and a QEX model for c-Src was built using these inhibitors. Then, we applied our QEX and the QED models to three popular c-Src inhibitors, PP2, gefitinib, and sunitinib, and three noninhibitors, oseltamivir, aspirin, and arginine. The resulting QEX and QED scores (Table 3) show that the QEX model distinguished the inhibitors and non-inhibitors better than the QED model did. In particular, the QEX model showed a low score not only for arginine with its low druglikeness, but also for oseltamivir and aspirin, which are not c-Src inhibitors.

Conclusions
QEX was better suited for screening inhibitors of specific targets than the original QED. QEX is easy to use when datasets of inactive compounds are not available. If both active and inactive compounds are available, QEX can be used as an initial filter to enhance conventional ligand-based virtual screenings.
The concept of QEX could be expanded beyond determination of druglikeness. Indeed, Limmer and Burken [12] developed desirability functions to describe chemical transport across plant root-soil boundaries based on the concept of QED, which are called the quantitative estimates of plant translocation (QEPT) [12]. Thus, it is possible to targetspecific QEPT using numerous data entries, which could contribute to phytoremediation efforts and herbicide design. Finally, this topic is one of our proposed future studies.

Calculation of QEX and original QED values
A QEX was calculated using a procedure that was basically identical to that of the original QED, except that the QEX was modeled using compounds targeting a particular protein or protein family. The algorithm used is briefly described below. In the initial modeling steps, each of the eight physicochemical properties (MW, ALogP, HBD, HBA, PSA, ROTB, AROM, and ALERTS) was computed from a dataset of known active compounds using RDKit [13] version 2015.03.1. Then, a histogram of each property was constructed and was fitted to the asymmetric double sigmoidal function Q(x) shown in Eq. (1) by implementing the Levenberg-Marquardt algorithm in SciPy [14].
All the fitted functions (Q MW (x), Q ALogP (x), . . .) were divided by the maximum values so as to adjust the maximum divided function to 1. The divided functionQ i (x) (i ∈ C, C {MW, ALogP, HBD, HBA, ROTB, AROM, ALERTS}) was used as the desirability function, and a QEX score was assigned as the weighted geometric mean of all desirability functions as shown in Eq. (2).
All the eight weights were exhaustively tried from 0 to 1 in increments of 0.25, and the mean of the 1000 weight combinations which provided the highest Shannon entropy was adopted here. A Shannon entropy of a model was calculated as shown in Eq. (3).
where n is the number of compounds used for modeling. The original QED values in this study were also calculated using the same implementation used for the QEX but were modeled using 771 FDA-approved drugs curated by Bickerton et al. [2] (Supplementary Material 2).

Dataset
All assayed compound data for the five target proteins were obtained from PubChem [15]. Table 4 shows each target as well as the numbers of active (positive) and inactive (negative) compounds. All compound structure data can be downloaded in SDF (structure data file) format in Supplementary material 3, 5, 7, 9, and 11. Their label information is in Supplementary material 4, 6, 8, 10, and 12. Building the QEX model only requires active compounds while inactive compounds were used only for evaluating the prediction performance of RO5, QED, and QEX.

Validation of druglikeness screening performance
The AUC of the ROC [16] and early EF [17] are considered in evaluating screening performances, which are generally used in virtual screening studies. In this study, we used a  list of experimentally verified active and inactive compounds (positive and negative samples, respectively). These positives and negatives were further categorized as true or false according to their rank above or below a certain threshold of the QEX and QED filtering results. Therefore, the actives ranked above a chosen threshold were considered true positives. In contrast, RO5 rankings are based on the number of rules passed. To generate the ROC curve, the true positive ratio (TPR TP/(TP + FN)) and false positive ratio (FPR FP/(TN + FP)) were calculated, where TP, FP, FN, and TN are the number of true positives, false positives, false negatives, and true negatives, respectively. In the ROC curve, the TPR was plotted as a function of the FPR. The AUC was then calculated to assess the quantitative performance of different QEX and QED models. An AUC of 0.5 corresponded to a random selection of the compounds using the target. The EF (x%) value indicates how much more often an active compound is ranked in the top x% of a screening result than it is randomly selected, i.e., the times the dataset is enriched. Specifically, the EF was calculated using Eq. (4): where n x% exp is the number of experimentally verified actives in the top x% of the database and N is the total number of actives in the database. In this study, EF (1%), EF (2%), EF (5%), EF (10%), EF (20%), and EF (50%) were calculated from the top 1, 2, 5, 10, 20, and 50% of the screening results, respectively.
Learning and evaluation of the QEX model function were performed using 5-fold cross-validation. Specifically, the active compounds were divided into five subsets, and the parameters of the fitting functions were determined using four of the five subsets, and the AUC and EF of the remaining subset were obtained. In addition, the QED model, which was constructed in advance using 771 FDA-approved drugs, was also applied to the same subset. The AUC and EF values shown in Table 1 were the average of five validations obtained from five subsets. An overview of the dataset and the validation method is shown in Fig. 1.

Application to c-Src inhibitor screening
Experimentally determined inhibitors of Src family kinases were obtained to construct a Src-specific QEX model for major c-Src inhibitors and irrelevant compounds, which was then compared with the QED model. Inhibitors of Src family kinases were published by Chiba et al. [11,18] through the second computer-aided drug discovery contest of the Initiative for Parallel Bioinformatics (IPAB) [19]. The target Src family consists of ten proteins shown in Table 5. They were extracted using ChEMBL version 19 [21] and BindingDB [22]. The extraction criteria were as follows: half-maximal inhibitory concentration (IC 50 ) < 10 µmol L −1 , K i < 10 µmol L −1 , K d < 10 µmol L −1 , and inhibition rates > 30%, whereas the experimental conditions were not considered. Finally, 3528 unique compounds were identified. They are available in Supplementary material 13 (Src_inhibitors.sdf) and can be obtained from the IPAB Web site [19].