Abstract
Estimating protein targets of compounds based on the similarity principle—similar molecules are likely to show comparable bioactivity—is a long-standing strategy in drug research. Having previously quantified this principle, we present here a large-scale evaluation of its predictive power for inferring macromolecular targets by reverse screening an unprecedented vast external test set of more than 300,000 active small molecules against another bioactivity set of more than 500,000 compounds. We show that machine-learning can predict the correct targets, with the highest probability among 2069 proteins, for more than 51% of the external molecules. The strong enrichment thus obtained demonstrates its usefulness in supporting phenotypic screens, polypharmacology, or repurposing. Moreover, we quantified the impact of the bioactivity knowledge available for proteins in terms of number and diversity of actives. Finally, we advise that developers of such approaches follow an application-oriented benchmarking strategy and use large, high-quality, non-overlapping datasets as provided here.
Similar content being viewed by others
Introduction
The importance of predicting primary and secondary macromolecular targets of therapeutic compounds was clearly demonstrated by retrospective analyses defining the number of known protein targets for drugs1,2. Underlying concepts, such as polypharmacology, specificity or repurposing, are considered throughout any modern drug R&D project. This also concerns the initial stages of discovery where the number of molecules to evaluate is massive, but the physical samples are scarce, prompting the use of fast yet robust bioinformatic models.
Whereas earlier studies about target and bioactivity prediction were conducted3,4,5,6, the game-changing work of Shoichet and colleagues7,8,9,10 on ligand-based reverse screening was accompanied by a remarkable experimental effort to confirm about half of the predicted off-target effects of 656 drugs among 73 possible proteins. Since then, a plethora of computational tools followed by expert opinions were released11,12,13. This research area is no exception to the growing penchant for unsupervised methods calling for due warnings about black-box and overfitting pitfalls. The scientific output has focused on meticulously comparing machine-learning algorithms with sophisticated stratification of the bioactivity knowledge1,14,15,16,17,18. The actual predictive ability has been strikingly overlooked, probably due to the difficulty of constructing appropriate external test sets.
To address this methodological shortcoming, we propose the first assessment of the predictive power of ligand-based reverse screening for the estimation of small molecule drug targets with a vast, diverse, curated, external bioactivity dataset.
Results and discussion
Training
The target prediction engine evaluated here is a logistic model combining shape and chemical similarity19 and trained on data curated from the ChEMBL database20. This method relies on the Similarity Principle, which was quantified by us previously for several molecular descriptors (including ES5D vectors and FP2 fingerprints, see below)21. The robustness of this machine-learning model was scrupulously confirmed by cross-validation several times19 and was recognized by peers as one of the most carefully statistically validated methods in the field15.
Here, ChEMBL was mined to obtain the training data comprising 501,959 compounds showing experimental bioactivity against 3669 protein targets. (see Methods, Data extraction, Supplementary Fig. 1a).
For each of the 501,959 compounds, the tridimensional shape and projection of physicochemical properties were translated into twenty 18-dimension float vectors following the ElectroShape approach (ES5D vectors)22. As well, the chemical structure of each compound was encoded as one 1024-bit binary vector (FP2 fingerprints) (Supplementary Fig. 1b)23. Pair-wise comparisons between all compounds produced the 3D-Score matrix with Manhattan-based similarity values of ES5D vectors (for the closest of 20 conformations), and the 2D-Score matrix with Tanimoto coefficients of FP2 fingerprints (see Methods, Chemoinformatics, Supplementary Fig. 1c).
To address the variation in contributions of these descriptors in the regression with molecular size19, 51 subsets were created, each corresponding to a given number of heavy atoms in the first (“query”) molecule of every pair (see Methods, training methodology). For each subset, a binary logistic model was trained to find the best constant C and coefficients (c1, c2) for the regression features (3D-Score and 2D-Score) (Supplementary Fig. 1d). To reduce noise from training, the final coefficients for calculating the probability of predictions were obtained by fitting the C, c1 and c2 curves via a third-degree polynomial function (Supplementary Fig. 2a).
The high internal classification ability measured by 10-fold cross-validated Matthews correlation coefficient for each 51 size-related subset (MCCcv, see Methods, training methodology, Supplementary Fig. 2b, Supplementary Table 1) confirmed the robustness of the approach as defined several time19,24,25. The lower MCC, precision and recall for the lesser heavy atom classes have already been observed and related to the poorer protein specificity of very small ligands26 and partly to less populated classes (Supplementary Fig. 2c).
External validation
The output of the regression model is an unbiased computed probability. Accordingly, we established a strategy to assess the predictive ability that reflects the applicative scope, i.e. reverse screening to predict the most probable protein targets for as many active compounds as possible. The rank of experimental targets in a list of predicted proteins ordered by calculated probability was recorded. Noteworthy, for this reverse screening evaluation exercise, the calculated probability values are only used as a scoring scale to rank the predicted targets.
The mining of Reaxys® enabled the construction of an unprecedentedly vast test set27. Applying filters comparable to those used for the training set, we retrieved high-quality data for 364,201 small molecules, not included in the ChEMBL training set, yet active on 1180 human proteins shared with ChEMBL (See Methods, Data extraction).
This external test set was reverse-screened against the fraction of the ChEMBL set active on human proteins (i.e. the screening set, see Methods, Testing strategy). The ES5D vectors and FP2 fingerprints of each 364,201 test compound (Fig. 1a) were compared to all 405,544 compounds of the screening set to find the most similar known actives on every 2069 ChEMBL human targets in terms of shape and chemical structure (Fig. 1b). For each protein target, the highest Manhattan-based similarity value and Tanimoto coefficient were inputted in the logistic equation as 3D-Score and 2D-Score features, respectively (Fig. 1c). By using the coefficients (c1, c2 and C) corresponding to the number of heavy atoms in the test compound, the probability was calculated for all 2069 proteins to rank them from most probable to least probable targets (Fig. 1d).
Physicochemical and chemical spaces
To ensure that the applicability domain of the model is respected, and that the validation exercise matches with the “real-life” application—finding probable targets for medicinal-chemistry-oriented bioactive small molecular compounds—the respective physicochemical spaces covered by the training set and the test set were compared.
The distributions of seven molecular and physicochemical descriptors for both sets are depicted in Supplementary Fig. 3, as a function of the number of molecules, and as per percentage of total set (see Physicochemical description in the Methods section). These indicate the clear overlap of the two molecular sets in every descriptor dimension. More precisely and as shown in Table 1, the distributions of lipophilicity, saturation, flexibility, apparent polarity, hydrogen-bonding capacity and size are very similar between the training set and the test set. This is quantified by very negative Z-factors28 for n-octanol/water partition coefficient (WLOGP), the fraction of sp3 carbon (fCsp3), the number of rotatable bonds, the polar surface area (TPSA), and the number of hydrogen-bond acceptors (HBA) and donors (HBD), as well as for molecular weight (MW). This confirms that the test set falls in the applicability domain of the predictive model with very comparable physicochemical spaces covered by both extensive molecular sets.
Beside physicochemical space, the chemical diversity between the two sets was measured by two different types of molecular scaffolds (see Scaffold computation in the Methods section). According to the Murcko wire-like frameworks29, the 501,959 training molecules are described by 25,046 scaffolds, and the 364,201 test molecules by 21,820 scaffolds. As per the more abstract Oprea approach30, the training molecules are described by 38,896 scaffolds, and the test molecules by 33,754 scaffolds. Relatively and for both definitions, the test set is more chemically diverse with an average of 16.7 and 10.8 molecules per Murcko and Oprea scaffolds, respectively, compared to 20.0 and 12.9 molecules per Murcko and Oprea scaffolds for the training set.
Moreover, the common scaffolds between both sets are 10,317 Murcko scaffolds (41.2% of the training set and 47.3% of the test set, Supplementary Fig. 4a) and 15,004 Oprea scaffolds (38.6% of the training set and 44.5% for the test set, Supplementary Fig. 4b). With less than half of both sets overlapping according to two different molecular scaffold definitions, the training and the test sets can be considered as chemically distinct from each other.
Furthermore, by considering the scaffolds from the test set that do not describe any training compound, 11,503 unique distinct Murcko scaffolds can be extracted from 48,001 test molecules (13.2%), and 18,750 unique distinct Oprea scaffolds from 67,554 test molecules (18.5%). Finally, 32,748 test molecules (termed as the Distinct test set and representing 9.0% of the entire test set, see Fig. 2a) can be considered strictly chemically distinct from the training set, since they are described by Murcko and Oprea scaffolds, which do not describe any training compound. This indicates further the relevance of building a vast external test set from a distinct source. Even when applying strict criteria for molecular diversity (unrelated to the model itself, as here, two orthogonal definitions of scaffolds), the predictive ability assessment can be also performed on numerous external compounds objectively defined as chemically distinct from the training data (see section Predictive ability).
Noteworthy, the bioactivity data at the root of the external test set were severely constrained in many dimensions. The most drastic reductions took place in the bioactivity spectrum and the chemical space. The former regards the selection of targets shared with the training set only (1180 among the more than 14,000 human proteins targeted by at least one small molecular compound available in Reaxys version 03.2019) and the latter was the selection of external actives not part of the training set (364,201 compounds among the almost 7 million small molecules with bioactivity data on a well-defined target). Despite these radical reduction measures, the external test set remains large as well as chemically diverse and distinct from the training molecules.
These chemical and physicochemical examinations demonstrate the relevance of the evaluation exercise, core of this study, with an external test set that is not only large, chemically diverse and distinct from the training set but also that falls in the applicability domain of the logistic model according to seven molecular and physicochemical properties. The predictive ability assessment strategy proposed reproduces the real objective of the reverse screening methodology, i.e. finding the probable protein targets of bioactive small molecular compounds, in the context of drug discovery and medicinal chemistry.
Predictive ability
The global predictive ability can be quantified by the success in retrieving, by reverse-screening, one of the experimental targets of bioactive query molecules among the predicted most probable ones. Remarkably, for 51.2% of the test compounds, the predicted protein with the highest probability was indeed a validated target (Fig. 3a). This predictive capacity is considerably higher than the 0.1% expected from a random ranking. The success becomes 72.9% within the predicted 15 most probable proteins, as typically displayed in Web interfaces24,25. The success rate versus rank plateaus, reaching 85.9% at rank #100 (Supplementary Fig. 5).
The predictive ability is essentially constant along the classes of heavy atoms in the test compounds (Fig. 3b), with a noisier signal for smaller molecules and a slight increase for larger molecules. The most stable signal is obtained for molecules containing between 20 and 40 heavy atoms, which are the most populated classes (see Supplementary Fig. 2c and Supplementary Fig. 6 for training set and test set, respectively) and those corresponding to drugs and druglike molecules31.
The relationship between the performance of ligand-based reverse screening and the chemical novelty of submitted compounds has long been published19. However, the size, diversity, and chemical nature of the here-built test set justify a renewed evaluation with many more test molecules objectively chemically distinct from the training set. According to the chemical diversity study described in section Physicochemical and chemical spaces, 9% of the entire test set involves compounds described by molecular scaffolds that cannot be extracted from any molecule in the training set, with respect to both the Murcko and the Oprea definitions (Fig. 2a). The large size of the whole test set made it possible to build a so-called “distinct test set” containing as many as 32,748 molecules. Predicting protein targets of these compounds, chemically strictly unrelated to the training set (nor the screening set, which is a subset of the training set), increases a priori the difficulty of the exercise. Expectedly as shown in Fig. 2b, the success is less for the “distinct test set” than for the entire test set. Nevertheless, the success rate is still very acceptable with a correct target predicted for more than one third (36.2%) of the distinct molecules at highest probability, for more than half of the distinct molecules (51.8%) at rank #5, for more than two-third of the distinct molecules (67.0%) at rank #15, finally reaching 80.0% at rank #100. Gratifyingly, this further confirms the robustness of the prediction generated by the method, even outside of the chemical space covered by the training data. This demonstrates the usefulness and relevance of target prediction by ligand-based reverse-screening for new compounds close to the state-of-the-art in me-too projects, but also for novel chemotypes in more prospective drug discovery.
Considering only test compounds with exactly one experimental target reported in Reaxys broadens the scope of analysis while retaining a vast external test set (300,978 compounds) (Fig. 3c). Retrieving the correct single target is a more difficult exercise than finding one among several, however the predictive ability remains very high with a success rate of 49.7% at highest probability. Moreover 70.5% and 83.4% of success at rank #15 and rank #100 correspond to a list of estimated protein targets enriched by 55 and 10 folds, respectively. This level of enrichment demonstrates the practicality of reverse virtual screening to provide useful guidance and focus on relevant proteins in experiments like the deconvolution of phenotypic screens, the setup of polypharmacology panels, or the selection of repurposing targets (remarkable applicative examples32,33).
A central question for any ligand-based approach is how the predictive performance varies with the amount of knowledge available. The capacity of reverse screening to predict the correct target improves dramatically as the pool of known ligand expands (Fig. 3d). For targets having 11 to 100 known active compounds (36% of the proteins in the screening set), the success is 40.0% at highest probability and 72.5% at rank #100 (an 8.5-fold enrichment). The success is much higher when the proteins to predict have a lot of bioactivity data available like for those with more than a thousand actives (13% of the proteins in the screening set) with 56.3% at highest ranking and 93.9% at rank #100 (an 11-fold enrichment). Conversely, proteins with only few known ligands, like 10 or less actives (11% in the screening set) show substantially lower success rates with 11.4% at highest ranking and 19.8% at rank #100 (a 2.3-fold enrichment). This is a measure of the difficulty to find a very similar active molecule by screening on targets for which only few are known.
The chemical diversity of actives for a given target was also investigated through two distinct molecular scaffolds, the Murcko wire-like frameworks29 and the more abstract Oprea approach30 (Fig. 3e and f). Both analyses lead to the same conclusion that the more diverse the actives, the higher the predictive ability. In both analyses, at rank #15, one can expect a success of about 65% if the target has between 11 to 50 scaffolds among its actives ( ~ 36% of the proteins). The success rate drops to about 55% when the number of scaffolds is 10 or less, but increases up to about 80% when the number of scaffolds is larger than 500 (only 1–4% in the ChEMBL screening set).
For the first time, the relationship between the performance of ligand-based target prediction and the size and diversity of the bioactivity space is quantified. This is a strong incentive to populate specialized databases with bioactive chemicals and targets as diverse as possible, rather than focusing only on the number of molecules or proteins. Noteworthy though, the probability calculated through the logistic equation is not affected per se by the number of experimentally active compounds. It is enough to find one active molecule by reverse screening to calculate a prediction, whose relevance relies on the molecular similarity. However, in applicative target prediction tools, it is common practice to impose a limit of similarity below which a known active of the screening is considered dissimilar and does not enter the calculation of probability, mainly for reducing the time of computation25.
The extent of bioactivity knowledge useful to be reverse screened must be balanced with data quality. The demonstrated excellent capacity to predict targets of bioactive molecules is bound to improve further with continued efforts made on the quantity, the diversity and the quality inside specialized resources. For experts, we strongly suggest extracting open-access data from e.g. ChEMBL20 or PubChem34, or broadly distributed data like in Reaxys, applying strict filtering criteria (see Methods, Data extraction).
Besides having validated and quantified the predictive ability of ligand-based reverse-screening methods at large scale for the first time, we recommend that developers of machine-learning target prediction approaches follow the application-oriented validation strategy (see Methods, Testing strategy) and use large, high-quality, diverse and non-overlapping bioactivity datasets (e.g. both provided datasets from ChEMBL and Reaxys, used here for training and testing, respectively) for future development, validation, and benchmarking studies.
The results and material provided here call for consolidating this bioinformatic method as a valid and mature machine-learning approach in drug research but also in the many applications in biology and chemistry where the protein targets of small molecules require to be estimated. Finally, this supervised machine-learning technology has proven simple and fast enough for implementation behind websites. Simple Web interfaces, like the pioneer Similar Ensemble Approach (SEA, https://sea.bkslab.org)8 or the extensively used and referenced SwissTargetPrediction (http://www.swisstargetprediction.ch)24,25 can quickly provide trustful predictions for routine work or for non-experts in the field. Of note, The SwissTargetPrediction webtool has been cited 2260 times (according to Clarivate®, accessed March 19, 2024); 94% were research articles, 78% of them in the categories “Pharmacology, Medicinal Chemistry, Chemistry (Multidisciplinary), Biochemistry and Molecular Biology” suggesting experimental studies including validation of the computational predictions, like in refs. 32,33. Publications in other categories comprise reviews of medical experts explaining the use SwissTargetPrediction for drug repurposing in their branch, for instance in cardiology35 or engineers supporting the choice and underlying the performance of SwissTargetPrediction’s unique logistic model among other ML algorithms in their Experimentalist’s Guide to Machine Learning for Small Molecule Design36.
Methods
Data extraction
The ChEMBL database20 version 25 was chosen as the data source for training the machine-learning and for screening library, for three main reasons: (i) the open-sourceness enables unrestricted availability for anyone; (ii) various previous versions of the predictive engine evaluated in this work have been built on ChEMBL data, some of which are in the backend of the renown and much used SwissTargetPrediction webtool24,25; (iii) the content is contemporaneous with the granted access to the Reaxys database (version 03.2019), which has been an opportunity to shape an unprecedented large external set from high-quality data from a different source yet of similar origin, i.e. medicinal chemistry-related bioactivity knowledge.
ChEMBL and Reaxys raw content is differently annotated and organized, however it has been possible to homogenize them by applying filtering criteria. We used MySQL requests to extract bioactivity data from a local copy of ChEMBL 25 for compounds with 5 to 80 heavy atoms tested in a binding assay (tagged “B” and confidence score >3) on a human, rat or mouse macromolecular target (single protein or protein complex). Bioactivity information for 501,959 unique small molecule compounds was thus extracted: 452,656 actives with IC50, EC50, Ki or KD ≤ 10 µM, and 46,165 considered inactives with IC50, EC50, Ki or KD ≥ 100 µM. In between is a “gray area” of 3138 compounds that were considered neither active nor inactive. Comparable filters were applied to Reaxys 03.2019. Only active compounds were retrieved with IC50, EC50, Ki, KD, Kic or Km ≤ 10 µM, tagged with type binding, enzymatic, generic, second messenger, electrophysiology or transactivation, and category in vitro. In accordance with the validation objective of this work, molecules present in the ChEMBL set were removed from the Reaxys set using the Obgrep program (OpenBabel version 2.4.1)23 and the JChem Search utility (version 21.3, www.chemaxon.com). Moreover, only data points involving targets shared with the ChEMBL training set were retained for the test set. This was achieved by human curation and mapping on UniProt identifiers37. Information about 364,201 compounds active on 1180 human proteins was thus gathered. Of note, the massive reduction of the Reaxys data in response to the need of this study resulted into using only 5% of the compound having bioactivity data recorded in version 03.2019, and 8% of the human proteins targeted by bioactive small molecules as included in version 03.2019. Importantly, all 1180 targets of the Reaxys test set are findable since part of the ChEMBL screening data whereas each 364,201 test compound was confirmed external to the training and screening sets (see Chemoinformatics section).
Chemoinformatics
The molecular information included in the bioactivity data extracted as detailed above were submitted to further standardization treatments, identical for both sources. The isomeric SMILES obtained from ChEMBL and the two-dimensional SDF from Reaxys were unsalted, desolvated, neutralized, kekulized with JChem Microservices Structure manipulation tools (version 21.3, www.chemaxon.com) and stored as two separate flat files including all extracted values together with IDs. The training set includes all information from ChEMBL describing the bioactivity of 501,959 unique compounds on 3669 proteins. The test set includes the SMILES, ReaxysID, the number of targets and their UniProt identifiers derived from Reaxys content for 364,201 compounds active on 1180 proteins.
To describe the chemical structure of the compounds, each standardized SMILES were then transformed as molecular fingerprints by the path-based FP2 method implemented in OpenBabel (version 2.4.1), which encodes the presence or absence of linear fragments from 1 to 7 atoms23. These FP2 fingerprints were stored as individual 1024-bit binary strings (Fig.1a and Supplementary Fig. 1b).
Using JChem Microservices Structure manipulation and Chemical calculations tools (version 21.3, www.chemaxon.com), each standardized SMILES was then protonated as at pH 7.4 before generating the 20 lowest energy conformations, which were stored as multi-MOL2 files. To describe the shape and the spatial projection of physicochemical properties, every conformer of each compound was encoded into a float vector according to the ElectroShape 5D procedure22 as detailed several times19,24,25. In brief, distances are computed between each atom and six centroids encompassing the structure in a 5-dimentional space (three Cartesian coordinates, as well as atomic charge38 and lipophilic contribution39). The average, the standard deviation and the third moment of all distances for one conformation are stored in an ES5D vector of 18 dimensions (\({x}_{n,p}\)), where \({x}_{n,p}\) is the average distance to the order n between all atoms and the pth centroid (Fig.1a and Supplementary Fig. 1b).
Noteworthy, the completeness of the extraction and standardization procedure was verified a posteriori by analyzing the pairs of molecules where the Tanimoto coefficient (\({T}_{c}\)) on FP2 fingerprints equals to 1.000. These cases were either: i) one molecule is a large substructure of the other (a known limitation of path-based fingerprints); or ii) the compounds are different salts or solvation forms of the same parent molecule; or iii) the compounds differ by stereochemistry. In all cases, both compounds have been kept since not linked to the same bioactivity as for both ChEMBL and Reaxys entries. The same molecule was never found.
Physicochemical description
The physicochemical spaces covered by the training set and the test set were measured by seven descriptors (See Supplementary Fig. 3). The SwissADME web tool40 was used to calculate the molecular weight (MW), the n-octanol/water partition coefficient (WLOGP)39, the topological polar surface area (TPSA)41, the number of rotatable bonds, the fraction of sp3 carbon (fCsp3), the number of H-bond acceptors (HBA) and the number of H-bond donors (HBD), for the 501,959 training compounds and the 364,201 test compounds.
The overlap of each descriptor distributions between the training set and the test set was quantified by Z-factor28, calculated according to Eq. (1), (see Table 1).
where, σtr is the standard deviation of the descriptor values in the training set; σts is the standard deviation of the descriptor values in the test set; µtr is the average of the descriptor values in the training set; µts is the average of the descriptor values in the test set.
Training methodology
Two similarity matrices were computed by pair-wise comparisons between all 501,959 compounds of the ChEMBL training set described by shape (ES5D vectors) and chemical structure (FP2 fingerprints) (Supplementary Fig. 1c). For shape comparison, the 3D-Score similarity matrix is built with the highest Manhattan-based similarity values (\({M}_{{s}_{i,j}}=1/(1+\frac{1}{18}{d}_{i,j})\)), where \({d}_{i,j}\) is the smallest Manhattan distance between all 20 × 20 pairs of ES5D vectors, each encoding a different conformation for molecules i and j. For chemical structure comparison, the 2D-Score matrix contains the Tanimoto coefficients (\({T}_{{c}_{i,j}}\)) between all FP2 fingerprints of pairs for molecules i and j.
The ChEMBL training set was split into subsets, each one corresponding to a given number of heavy atoms in the first (“query”) molecule of every pair. Subsets were thus prepared from 11 to 59 heavy atoms; smaller molecules were grouped in one class ( ≤ 10 heavy atoms) and larger molecules in another one ( ≥ 60 heavy atoms) to finally define 51 size-dependant training subsets (Supplementary Fig. 1d). Each compound in the training subset (active or inactive) was compared to all known actives of each target. Inactive compounds were defined as having an experimental activity higher than or equal to 100 µM (see Methods, Data extraction), or as not being reported active by ChEMBL in any binding assay on the protein under consideration (i.e. alleged inactives). The ratio of 10 inactives for 1 active—previously defined empirically as most suited19—was applied. For every comparison, \({M}_{{s}_{i,j}}\) and \({T}_{{c}_{i,j}}\) were retrieved, and the highest values for each similarity metric considered as the features of the model i.e. 3D-Score and 2D-Score, respectively. Practically, each line of a subset training file regards one training compound and one target, and reports the 3D-Score, 2D-Score, and “1” or “0” to indicate whether this query molecule is active or inactive on that target.
For each of the 51 subsets, a binary logistic model was trained to find the best regression constant C and coefficients (c1, c2) for both features (3D-Score and 2D-Score), according to the Eq. (2), where Probability of being active on a given target is “1” or “0” for all training compounds. The default parameters of the LogisticRegression function of the scikit-learn program (version 0.23.2) were used.
The internal robustness was monitored by 10-fold cross-validation. Matthews correlation coefficients (MCC) were calculated with Eq. (3) and were averaged over the 10 cross-validation sets (MCCcv, Supplementary Fig. 2b). As well averaged precision and recall were calculated with Eq. (4) and Eq. (5), respectively. The Supplementary Table 1 provides, for each size-related training subsets, the MCCCV with standard deviation, together with precision and recall. It should be noted that the 10 folds are random, given the construction of the subsets, which are shuffled and contain no information about molecule or target.
where, TA is the number of known actives returning Probability > 0.5; TI is the number of inactives returning Probability ≤ 0.5; FA is the number of inactives returning Probability > 0.5; FI is the number of known actives returning Probability ≤ 0.5
To reduce the noise from the training, the final coefficients to be employed for calculating probability of predictions were obtained by fitting the C, c1 and c2 curves via a third-degree polynomial function. This way a set of final coefficients for predicting targets are obtained for each of the 51 subsets (Supplementary Fig. 2a).
Testing strategy
All 364,201 active compounds of the Reaxys external test set were reverse screened towards the screening set, which corresponds to the active part of the ChEMBL training set organized by known actives per human target (in total 405,544 molecules active on 2069 proteins, Fig. 1b). For this, the 20 ES5D vectors which encode the shape of the query molecule were compared to the 20 ES5D vectors of all ChEMBL active compounds on one target. The highest \({M}_{s}\) corresponding to the active most similar in shape is considered as the 3D-Score parameter. Similarly, the FP2 fingerprints describing the chemical structure of the query compound is compared to the FP2 fingerprints of all ChEMBL active compounds on one of the 2069 targets of interest. The highest \({T}_{c}\) value corresponds to the active most similar in structure and is considered as the 2D-Score parameter. The probability for the protein to be targeted by the query molecule is obtained by inputting both parameters (3D-Score and 2D-Score) in the logistic Eq. (2) together with the final coefficients and constant (c1, c2 and C) obtained by training and curve-smoothing from the subset corresponding to the number of heavy atoms in the query molecule (Fig. 1c).
The search for the most similar actives according to the shape or to the chemical structure (which can be the same or two different compounds) and the calculation of the probability were repeated independently for all the 2069 protein targets of the screening set. The final output of this testing workflow is a list of the 2069 possible targets ranked from the most probable to the least probable. The quantified predictive ability of the ligand-based reverse screening is defined from the ranks of the known experimental targets for all 364,201 external test compounds (Fig. 3). It is important to note that in the context of this reverse screening, and for the core of the study presented here, the calculated probability values are considered as scores with the only objective to rank the different predicted targets.
Scaffold computation
Two different scaffold definitions were applied to the molecules of both the training set and the test set. All Standardized SMILES (see Chemoinformatics section) were submitted to the strip-it program (version 1.0.2, www.silicos-it.be) to extract the wire-like frameworks as proposed by Bemis and Murcko29 (MURCKO_2 definition), and the more abstract Oprea scaffolds30 (OPREA_2 definition). This enabled the description of the chemical space covered by both molecular sets (Supplementary Fig. 3) and the creation of the so-called “Distinct test set” with 32,748 external test compounds described by distinct scaffolds not extracted from any training molecule, according to both the Murcko and the Oprea definitions (Fig. 2a). This subset (9% of the entire test set) allowed for the evaluation of the success rate of the ligand-based reverse screening for predicting targets of molecules objectively chemically different from the training set (Fig. 2b). As well, the predictive capacity of reverse screening with respect to the chemical diversity of actives for a given target was made possible by grouping the scaffolds of the known actives per proteins (Fig. 3e and Fig. 3f).
Data availability
Bioactivity data were obtained from the ChEMBL (version 25) and the Reaxys (version 03.2019) databases for training/screening and testing, respectively. A short extract of the raw ChEMBL data for training is given in Supplementary Table 2 to show three lines corresponding to an active, an inactive and a “gray area” datapoints, respectively. Processed data have been deposited in a Zenodo repository (https://doi.org/10.5281/zenodo.7534175). The screening set file contains, for each active compound, the standardized SMILES, the ChEMBLID, the number of experimental target(s) and their UniProt identifier(s). Similarly, the test set file contains, for each active compound, the ReaxysID, the number of experimental target(s) and the UniProt identifier(s). For Reaxys users, the chemical structure can be obtained through bulk request on the corresponding website. Access to www.reaxys.com and to Reaxys data can be obtained by contacting Elsevier directly. The first 300 entries also display the standardized SMILES so that every reader can reproduce the results obtained by the reverse screening exercise. The construction of logistic models was performed on the data described here by strictly following the steps detailed in the methodological articles19,24,25,42 and their supplementary materials.
References
Peón, A., Dang, C. C. & Ballester, P. J. How reliable are ligand-centric methods for target fishing? Front. Chem. 4, 492 (2016).
Mestres, J., Gregori-Puigjané, E., Valverde, S. & Solé, R. V. The topology of drug-target interaction networks: implicit dependence on drug properties and target families. Mol. Biosyst. 5, 1051–1057 (2009).
Schuffenhauer, A., Floersheim, P., Acklin, P. & Jacoby, E. Similarity metrics for ligands reflecting the similarity of the target proteins. J. Chem. Inf. Comput. Sci. 43, 391–405 (2003).
Horvath, D. & Jeandenans, C. Neighborhood behavior of in silico structural spaces with respect to in vitro activity spaces−a novel understanding of the molecular similarity principle in the context of multiple receptor binding profiles. J. Chem. Inf. Comput. Sci. 43, 680–690 (2003).
Paolini, G. V., Shapland, R. H. B., Hoorn, W. P., van, Mason, J. S. & Hopkins, A. L. Global mapping of pharmacological space. Nat. Biotechnol. 24, 805–815 (2006).
Oprea, T. I., Tropsha, A., Faulon, J.-L. & Rintoul, M. D. Systems chemical biology. Nat. Chem. Biol. 3, 447–450 (2007).
Hert, J., Keiser, M. J., Irwin, J. J., Oprea, T. I. & Shoichet, B. K. Quantifying the Relationships among Drug Classes. J. Chem. Inf. Model. 48, 755–765 (2008).
Keiser, M. J. et al. Relating protein pharmacology by ligand chemistry. Nat. Biotechnol. 25, 197–206 (2007).
Keiser, M. J. et al. Predicting new molecular targets for known drugs. Nature 462, 175–182 (2009).
Lounkine, E. et al. Large-scale prediction and testing of drug activity on side-effect targets. Nature 486, 361–367 (2012).
Byrne, R. & Schneider, G. In silico target prediction for small molecules. Methods Mol. Biol. 1888, 273–309 (2019).
Comess, K. M. et al. Emerging approaches for the identification of protein targets of small molecules - a practitioners’ perspective. J. Med. Chem. 61, 8504–8535 (2018).
Sydow, D. et al. Advances and challenges in computational target prediction. J. Chem. Inf. Comput. Sci. 59, 1728–1742 (2019).
Sturm, N. et al. Industry-scale application and evaluation of deep learning for drug target prediction. J. Cheminform. 12, 26–13 (2020).
Mathai, N., Chen, Y. & Kirchmair, J. Validation strategies for target prediction methods. Brief. Bioinforma. 21, 791–802 (2020).
Mervin, L. H., Afzal, A. M., Engkvist, O. & Bender, A. Comparison of scaling methods to obtain calibrated probabilities of activity for protein-ligand predictions. J. Chem. Inf. Comput. Sci. 60, 4546–4559 (2020).
Ye, Q., Zhang, X. & Lin, X. Intelligent Computing Theories and Application, 17th International Conference, ICIC 2021, Lecture Notes in Computer Science, Vol. 12838, (eds Huang, D. S., Jo, K. H., Li, J., Gribova, V. & Premaratne, P.) 87–99 (Springer, Cham, 2021).
Yang, S.-Q. et al. Current advances in ligand-based target prediction. WIREs Comput Mol. Sci. 11, e1504 (2021).
Gfeller, D., Michielin, O. & Zoete, V. Shaping the interaction landscape of bioactive molecules. Bioinformatics 29, 3073–3079 (2013).
Mendez, D. et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res 47, D930–D940 (2019).
Bragina, M. E., Daina, A., Perez, M. A. S., Michielin, O. & Zoete, V. The SwissSimilarity 2021 web tool: novel chemical libraries and additional methods for an enhanced ligand-based virtual screening experience. Int J. Mol. Sci. 23, 811 (2022).
Armstrong, M. S., Finn, P. W., Morris, G. M. & Richards, W. G. Improving the accuracy of ultrafast ligand-based screening: incorporating lipophilicity into ElectroShape as an extra dimension. J. Comput Aided Mol. Des. 25, 785–790 (2011).
O’Boyle, N. M. et al. OpenBabel: An open chemical toolbox. J. Cheminform. 3, 33 (2011).
Gfeller, D. et al. SwissTargetPrediction: a web server for target prediction of bioactive small molecules. Nucleic Acids Res. 42, W32–W38 (2014).
Daina, A., Michielin, O. & Zoete, V. SwissTargetPrediction: updated data and new features for efficient prediction of protein targets of small molecules. Nucleic Acids Res. 47, W357–W364 (2019).
Nobeli, I., Favia, A. D. & Thornton, J. M. Protein promiscuity and its implications for biotechnology. Nat. Biotechnol. 27, 157–167 (2009).
Reaxys. (Copyright © 2023 Elsevier Limited except certain content provided by third parties. Reaxys® is a trademark of Elsevier Limited.).
Zhang, J.-H., Chung, T. D. Y. & Oldenburg, K. R. A Simple Statistical Parameter for Use in Evaluation and Validation of High Throughput Screening Assays. SLAS Discov. 4, 67–73 (1999).
Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39, 2887–2893 (1996).
Pollock, S. N., Coutsias, E. A., Wester, M. J. & Oprea, T. I. Scaffold topologies. 1. Exhaustive enumeration up to eight rings. J. Chem. Inf. Comput. Sci. 48, 1304–1310 (2008).
Leeson, P. D. et al. Target-Based Evaluation of “Drug-Like” Properties and Ligand Efficiencies. J. Med Chem. 64, 7210–7230 (2021).
Carotenuto, P. et al. Targeting the MITF/APAF-1 axis as salvage therapy for MAPK inhibitors in resistant melanoma. Cell Rep. 41, 111601 (2022).
Bhunia, D. et al. Spatial position regulates power of tryptophan: discovery of a major-groove-specific nuclear-localizing, cell-penetrating tetrapeptide. J. Am. Chem. Soc. 140, 1697–1714 (2018).
Kim, S. et al. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res 49, gkaa971 (2020).
Abdelsayed, M., Kort, E. J., Jovinge, S. & Mercola, M. Repurposing drugs to treat cardiovascular disease in the era of precision medicine. Nat. Rev. Cardiol. 19, 751–764 (2022).
Lindley, S. E., Lu, Y. & Shukla, D. The experimentalist’s guide to machine learning for small molecule design. ACS Appl. Bio Mater. 7, 657–684 (2023).
Consortium, U. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 47, D506–D515 (2019).
Halgren, T. A. Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. J. Comput. Chem. 17, 490–519 (1998).
Wildman, S. A. & Crippen, G. M. Prediction of physicochemical parameters by atomic contributions. J. Chem. Inf. Comput. Sci. 39, 868–873 (1999).
Daina, A., Michielin, O. & Zoete, V. SwissADME: a free web tool to evaluate pharmacokinetics, drug-likeness and medicinal chemistry friendliness of small molecules. Sci. Rep. 7, 42717 (2017).
Ertl, P., Rohde, B. & Selzer, P. Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. J. Med. Chem. 43, 3714–3717 (2000).
Gfeller, D. & Zoete, V. Protein homology reveals new targets for bioactive small molecules. Bioinformatics 31, 2721–2727 (2015).
Acknowledgements
The authors are grateful to Elsevier Ltd for having provided data from Reaxys, which contains additional data regarding bioactivity, which were not used in this study, synthesis and diverse other topic areas (www.reaxys.com). In particular, we would like to thank Pieder Caduff, Olivier Barberan and Rosalind Sankey for the great discussions and support. The data set derived from Reaxys and used in this study is provided by Elsevier Ltd under license. We acknowledge ChemAxon (www.chemaxon.com) for the licensing agreement.
Author information
Authors and Affiliations
Contributions
V.Z. conceptualized and supervised. A.D. performed analyses. V.Z. and A.D. scripted, and wrote, reviewed, approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
V.Z. is consultant for Cellestia Biotech. V.Z. and A.D. performed a research contract for Elsevier. However, the study described in this manuscript is independent from this contract.
Peer review
Peer review information
Communications Chemistry thanks the anonymous reviewers for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Daina, A., Zoete, V. Testing the predictive power of reverse screening to infer drug targets, with the help of machine learning. Commun Chem 7, 105 (2024). https://doi.org/10.1038/s42004-024-01179-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s42004-024-01179-2
- Springer Nature Limited