Journal of Computer-Aided Molecular Design

, Volume 31, Issue 4, pp 335–347 | Cite as

Ligand-based virtual screening under partial shape constraints

Article

Abstract

Ligand-based virtual screening has proven to be a viable technology during the search for new lead structures in drug discovery. Despite the rapidly increasing number of published methods, meaningful shape matching as well as ligand and target flexibility still remain open challenges. In this work, we analyze the influence of knowledge-based sterical constraints on the performance of the recently published ligand-based virtual screening method mRAISE. We introduce the concept of partial shape matching enabling a more differentiated view on chemical structure. The new method is integrated into the LBVS tool mRAISE providing multiple options for such constraints. The applied constraints can either be derived automatically from a protein–ligand complex structure or by manual selection of ligand atoms. In this way, the descriptor directly encodes the fit of a ligand into the binding site. Furthermore, the conservation of close contacts between the binding site surface and the query ligand can be enforced. We validated our new method on the DUD and DUD-E datasets. Although the statistical performance remains on the same level, detailed analysis reveal that for certain and especially very flexible targets a significant improvement can be achieved. This is further highlighted looking at the quality of calculated molecular alignments using the recently introduced mRAISE dataset. The new partial shape constraints improved the overall quality of molecular alignments especially for difficult targets with highly flexible or different sized molecules. The software tool mRAISE is freely available on Linux operating systems for evaluation purposes and academic use (see http://www.zbh.uni-hamburg.de/raise).

Keywords

Ligand-based Virtual screening Molecular similarity Structural alignment 3D similarity searching Lead discovery Partial shape User-defined constraints 

Introduction

Traditionally, ligand-based virtual screening (LBVS) belongs to the most widely applied tools in computer-aided drug design [1]. Recently, we introduced a new method based on the RApid Index-based Screening Engine (RAISE) [2, 3] named mRAISE [4] and evaluated its capability for virtual screening. Herein, we compared its ability to separate active ligands from a set of decoys for certain targets to a variety of other state of the art methods like ROCS [5], ShaEP [6], MolShaCS [7], LIGSIFT [8] and Align–It [9]. The overall performance measured by the area under the ROC curve (AUC) as well as the enrichment of actives at early percentages of top ranking hits placed mRAISE at the top ranks of all compared methods.

In a second experiment the quality of the three-dimensional alignments of molecules has been validated on the new mRAISE dataset specially created for this purpose. This dataset consists of 182 prealigned ligands for eleven diverse targets extracted from high-resolution PDB structures [10]. mRAISE is able to calculate alignments with an RMSD of less than 2 Å among the top ten ranked hits for 80.8% of all pairwise alignments. Furthermore, it achieves median RMSD values of less than 2 Å for eight of the eleven target groups.

In its previous release, the methods used in mRAISE already tried to address some of the ongoing challenges in LBVS like the dependency on high quality molecule conformations and the lack of methods performing partial shape matching. This was done by using the so called partial bulk method [11], which allows to only consider certain parts of the included steric description of the surrounding structure while comparing two RAISE descriptors. Nevertheless, especially the alignment validation showed remaining difficulties during the comparison of molecules, which are highly flexible or significantly diverging in size.

In the following we want to introduce and analyze new methods for partial shape matching in mRAISE. Recently, there has been an increased interest in the combination of LBVS and SBVS methods in order to exploit all the available information and enhance the potential screening success [12, 13, 14, 15, 16, 17, 18]. Following this trend, the new partial shape constraints in mRAISE can be either manually constructed by an experienced user or automatically derived from protein–ligand complexes. Constraints derived from protein–ligand complexes are either used to search for similar small molecules with a higher chance to fit into the binding site when aligned to the query structure or to maintain close contacts to the protein surface. Validation on the Directory of Useful Decoys (DUD) [19] and the Directory of Useful Decoys Enhanced (DUD-E) [20] datasets show, that while not superior on average, the new constraints can significantly improve the screening performance on certain targets. We further evaluate the quality of the calculated molecular alignments using the mRAISE dataset [4]. The introduced knowledge-based shape constraints demonstrate their benefits especially during the comparison of highly flexible molecules or molecules that differ greatly in size.

Methods

We recently introduced mRAISE [4], a new method for LBVS based on the RAISE approach representing molecules as three-point pharmacophore descriptors with a canonically oriented shape description. mRAISE achieves efficiency by employing an index-based database based on compressed bitmap indices [21]. Such a prepared index can be repetitively screened using descriptors calculated for any query molecule.

During the screening process, all attributes of a query descriptor are compared with those of the descriptors stored in the index using SQL-like database queries. Each pair of matching descriptors represents a possible structural alignment of the respective molecules, which is scored in mRAISE using atom centered Gaussian functions combined with weights based on matching and mismatching physicochemical properties.

Descriptor

Fig. 1

Example of a RAISE descriptor as used in mRAISE. Only every second ray of the shape descriptor is displayed here. Red spheres hydrogen bond donor interaction point, yellow sphere hydrophobic interaction points

A detailed description of the RAISE descriptor and its adaptations for LBVS can be found in the previous publications of mRAISE [4] and cRAISE [3]. In short, a RAISE descriptor is a three-point pharmacophore triangle with an additional description of the surrounding molecular surface (see Fig. 1).

Each descriptor includes coordinates of the three triangle corners, which can be of type hydrogen bond donor, hydrogen bond acceptor or hydrophobic, as well as the possible interaction directions in case of polar interactions. Additionally, the descriptor contains a rough description of the surrounding shape in form of the lengths of 80 equally distributed rays measuring the distance from the triangle center to the molecular surface. All rays are defined with respect to a canonical local coordinate system such that a ray-by-ray comparison is made possible. These rays are of great importance for the definition of partial shape constraints on query descriptors and will be utilized in the following. For the progression of this work, this part of the descriptor will be referred to as bulk rays. As in other RAISE applications, tolerances of ±1.0 Å are allowed during the comparison of triangle side lengths as well as bulk ray lengths.

Individual shape queries

An important difference between methods based on RAISE, is the comparison of the bulk rays during the screening procedure. On the one hand, mRAISE [4] and TrixP [11] search for similarity in the screened structures and therefore match rays of equal lengths. On the other hand, iRAISE [2] and cRAISE[3] try to fit ligands into protein binding sites which means that ligand descriptor rays need to be shorter than those of a binding site descriptor.

Furthermore, in the case of similarity matching, the so called partial bulk matching has been used to incorporate structural flexibility, which only requires a certain percentage of adjacent rays to match. In the following, we will introduce new methods to incorporate additional information derived from protein–ligand complexes or manual selection into the sampling and comparison of specific fractions of the bulk rays in mRAISE. While queries derived from complexes make use of more information than just the ligand if available, the manual selection of important regions of the molecule enables chemists to guide the virtual screening process towards their individual preferences. This new concept of manual selected constraints allows a unique incorporation of expert knowledge to shape-based virtual screening.

To avoid misconceptions in the following, the previously published version of mRAISE will be referred to as mRAISE_classic.

Contact queries

Fig. 2

Simplified depiction of the selection of bulk rays for contact queries. Rays are selected by the distance between the ligand surface (black) and the protein surface (blue): Rays are selected if the surfaces overlap (a) or are in close contact (b) (green rays). Not selected are rays pointing towards bulk (c) and rays where ligand and protein surface are to far away from each other (d) (red rays)

Molecular interactions between a protein and its bound ligand require geometrically close contacts. Regions where such contacts occur are therefore of great importance for the activity of a ligand and of special interest during virtual screening.

In case a protein–ligand complex is available, contact information can be incorporated into the screening procedure. A new mode was implemented in mRAISE, which only uses bulk rays intersecting with the protein surface in a distance of up to 0.5 Å after leaving the surface of the molecule (see Fig. 2). In some cases, this method leads to triangles with very few used bulk rays. To prevent insignificant matches, descriptors with less than five used rays are excluded during screening. In the following, this mode will be referred to as mRAISE_contact.

Inclusion queries

Fig. 3

Simplified depiction of the selection of bulk rays for inclusion queries. Rays are adapted and selected depending on the distance to the protein surface (blue): Rays are selected if they reach the protein surface. The length of the ray is set either to the distance of the ligand surface (a) or of the protein surface (b), depeding on which one is further (green rays). Not selected are rays which never reach the binding site (c) or are at maximum length before doing so (d) (red rays)

If the target of a query structure of a LBVS campaign is known, the steric constraints implied by the binding site can in some cases be of more interest than the actual shape of just one binding ligand. Therefore, an inclusion query aims at finding matches that would fit into the same area of the binding site rather than being of roughly the same shape as the query ligand. Starting from a protein–ligand complex, the lengths of all rays of each query descriptor are extended to the distance where they would enter the binding site molecular surface, if this is within the maximum ray length (see Fig. 3). An exception is made if the van der Waals radii of the atoms overlap and therefore the ray would enter the binding site surface before leaving the molecular surface. In those cases the ray length remains the distance where the ray leaves the molecular surface. Subsequently, the query to the index is changed to match descriptors with shorter rays than the query descriptor. Again insignificant descriptors are excluded during screening. Since the binding site generally surrounds most parts of a molecule, only descriptors with 40 or more remaining rays are used. This mode will be referred to as mRAISE_inclusion in the following.

Manual atom selection

Fig. 4

Simplified depiction of the selection of bulk rays for manual selection queries. Rays are selected if they end within a selected atom sphere (a) (green rays). Not selected are all other rays (b) (red rays)

An important benefit of LBVS is its usability in cases without available high quality protein structures. Nevertheless, even in those cases a user might have experience-based perceptions concerning the regions of a molecule that are important for its activity. These might result from, for example, a structural overlay of multiple compounds active against the same target. Therefore, a fourth established method to define queries with mRAISE is the manual selection of ligand atoms via a graphical user interface. In this case, only those bulk rays that would pierce through the molecular surface corresponding to those atoms, will be matched during the screening procedure. In other words, a ray is only used if it ends within the van der Waals radius of a selected atom and is shorter than the defined maximal length (see Fig. 4). For the purpose of exhaustive screening runs, the constraints derived from a manual selection can be saved to SDF files alongside the query molecule and then be used in the command-line version of mRAISE.

This will not guarantee hits containing the selected substructure, but it will more likely provide structural alignments showing roughly the same shape as the query molecule in the selected areas. Like in the contact queries, descriptors with less than five selected rays are excluded during screening. For the remainder of this work, this mode will be referred to as mRAISE_selection.

Validation

To analyze the influence of the introduced methods on the results of virtual screening with mRAISE [4], the experiments of the previous publication were repeated with mRAISE_contact and mRAISE_inclusion. Screening runs were performed on the DUD as well as on its extended version (DUD-E). Additionally, a dataset to validate the quality of structural alignments introduced in the previous publication (mRAISE dataset) has been used.

Since the manual selection of atoms is highly dependent on the knowledge of the user and there is no obvious correct selection, the performance of mRAISE_selection is only shown for a few special targets as a proof of concept.

Datasets

DUD: The DUD dataset [19] has been originally developed for the validation of docking methods and consists of 2.950 actives and specifically selected decoys for 40 different targets. For each active molecule in the dataset approximately 36 decoys have been selected based on similar physicochemical properties like molecular weight, the number of hydrogen-bond acceptors and donors, the logP value, and the number of rotatable bonds but with dissimilar topology. The data has been downloaded from http://dud.docking.org/.

DUD-E: The DUD-E dataset [20] is the enhanced version of the original DUD, containing 22.886 active compounds for 102 different targets with an average of 224 ligands per target and 50 decoys for each active. The data has been downloaded from http://dude.docking.org/.

mRAISE dataset: The mRAISE dataset [4] has been introduced to validate the quality of calculated molecular alignments. It contains 182 ligands for 11 diverse targets, which have been prealigned based on their mutual binding to identical binding sites using SIENA [22] with a maximal allowed backbone RMSD of 0.5 Å. All structures were taken from a high resolution subset of the PDB [10] and filtered to match certain molecular properties using MONA [23]. Those properties included the number of heavy atoms and rotational bonds of the molecule as well as the number of actually overlapping atoms in the alignment. The data has been downloaded from http://www.zbh.uni-hamburg.de/mraise-dataset.html.

Enrichment validation

Retrospective studies on the DUD and DUD-E datasets rate the ability to separate active molecules from a set of decoys. Common evaluation and comparison metrics used for this purpose are the area under the ROC curve (AUC), as well as the enrichment factor (EF) and the hitrate (HR) at a certain percentage of ranked hits.

The enrichment factor is the ratio of true positives among a certain percentage of ranked hits divided by the rate of actives present in the whole dataset (see Eq. 1).
$$\begin{aligned} EF_{X\%} = \frac{TP_{X\%} / Hits_{X\%}}{N_{Actives} / N_{Total}} \end{aligned}$$
(1)
The hitrate tries to make the EF easier to interpret by dividing the EF at a certain percentage by the actual best possible value (idealEF) at this amount of observed data. The idealEF hereby assumes that as much actives as possibly available have been discovered at the respective percentage of screened data (see Eq. 2).
$$\begin{aligned} HR_{X\%} = \frac{actual EF_{X\%}}{ideal EF_{X\%}} \end{aligned}$$
(2)
In preparation for the screening runs, indices have been created for each target with up to 250 calculated conformations for each molecule. As in the previous work, the included duplicates of the same molecule like protomers and tautomers in the DUD and DUD-E datasets have been removed for the evaluation and molecules providing no matching descriptor are rated with a similarity score of zero.

Alignment validation

For the validation of the alignment quality, all pairs of molecules of the same ensemble in the mRAISE dataset are compared with each other. After the comparison of two molecules M and N, the calculated pose \(p'\) of M is then compared to the pose \(p''\) of M taken from the reference alignment by calculating the RMSD between both poses. As in the previous publication, the calculation of the RMSD is restricted to atoms of molecule M showing a maximum distance of 2.0 Å to any atom of N in the reference alignment. This excludes flexible regions of the molecules, which are not aligned in the binding site and therefore are not part of a conserved binding mode. In the following, this special RMSD is referred to as RMSD-O. During the screening runs, the query conformation is taken from the input file and is then compared to up to 250 generated conformations of a target molecule. To sum up the performance of each method, average and median RMSD-O values for each ensemble are used. It should be noted, that in case of comparisons with no single matching descriptor and therefore no RMSD-O value, this can not be represented in the average value. For the median value on the other hand, those pairs are handled as if they had an infinite RMSD-O.

Results and discussion

Enrichment

Table 1

Average and median AUC values of the ROC curves on the DUD dataset

Mode

Avg. AUC

Median AUC

mRAISE_classic

0.76 ± 0.19

0.84

mRAISE_contact

0.70 ± 0.15

0.73

mRAISE_inclusion

0.76 ± 0.19

0.78

Values with standard deviation

All targets of the DUD and the DUD-E datasets have been screened using mRAISE_inclusion and mRAISE_contact for automated partial shape constraints as described above and are compared to the performance of mRAISE_classic.

Table 1 shows a comparison of the average and median AUC values of all 40 DUD targets for each method together with the respective standard deviations. The inclusion mode of mRAISE shows the same average performance as the classic mode with an AUC of 0.76 ± 0.19 with only a slightly worse median AUC value of 0.78 compared to the 0.84 of mRAISE_classic. The contact mode however shows a slightly decreased overall performance with an average AUC of 0.70 ± 0.15 and a median AUC of 0.73. Looking further into the detailed performances for each target of the DUD (see Fig. 5 and supporting information Table SI.1) shows that, the classic version of mRAISE has the highest value either alone or shared with another mode for 23 of the 40 targets, there nevertheless are a couple of cases where the partial shape constraints show a notable improvement.
Fig. 5

Performance of the different methods on the DUD dataset

The most apparent improvement is for the Progesterone receptor, which shows an increase of the AUC by 0.19 using mRAISE_inclusion. Further targets showing an improvement using this method are Factor Xa (+0.07), Thrombin (+0.07), Cyclooxygenase 1 (+0.06), Glutocorticoid receptor (+0.05), and Hydroxymethylglutaryl-CoA reductase (+0.05). Furthermore, the mRAISE_contact method shows superior performance on some targets with an improved AUC for HIV protease (+0.1), PDGF receptor (+0.08), and Tyrosine kinase SRC (+0.07).

The four targets showing results worth than random selection using mRAISE_classic(P38 MAP kinase, PDGF receptor, Tyrosine kinase SRC and VEGF receptor) show a similar performance using complex-derived partial shape constraints. Nevertheless, as shown in the previous publication [4] these targets show an equally low performance in other LBVS methods, indicating that molecular similarity is not suited in these cases to separate actives from decoys.
Table 2

Average enrichment factor on the DUD dataset at one five and ten percent of ranked hits

Mode

\(EF_{1\%}\)

\(EF_{5\%}\)

\(EF_{10\%}\)

mRAISE_classic

20.2 ± 12.1

9.4 ± 6.0

5.4 ± 3.0

mRAISE_contact

19.3 ± 12.1

8.5 ± 5.4

4.8 ± 2.7

mRAISE_inclusion

19.9 ± 12.3

9.5 ± 6.0

5.5 ± 3.1

Values with standard deviation

Table 3

Average hitrate on the DUD dataset at one five and ten percent of ranked hits

Mode

\({HR}_{1\%}\)

\({HR}_{5\%}\)

\({HR}_{10\%}\)

mRAISE_classic

55.5 ± 33.3

46.7 ± 30.0

53.9 ± 30.6

mRAISE_contact

53.0 ± 33.2

42.1 ± 26.5

48.0 ± 27.2

mRAISE_inclusion

54.6 ± 33.9

47.3 ± 30.0

54.6 ± 30.5

Values with standard deviation

Similar observations can be made looking at the average EF (see Table 2) and the average HR (see Table 3) for the DUD dataset at one, five and ten percent of ranked hits. For the early enrichment at one percent, mRAISE_classic is performing best with an average EF of 20.2 ± 12.1, followed by mRAISE_inclusion (19.9 ± 12.3), and mRAISE_contact (19.3 ± 12.1). Interesting is however, that the inclusion mode performs slightly better than the classic mode at the later enrichment stages.

Looking at the HR further illustrates this difference, while the classic method achieves an average HR of 46.7 ± 30.0 and 53.9 ± 30.6 at five and ten percent of ranked hits, the inclusion method achieves values of 47.3 ± 30.0 and 54.6 ± 30.5 respectively. Again it can be seen that the partial shape constraints can improve the screening performance for specific targets on the DUD. Looking at the early enrichment at one percent (see supporting information Table SI.2), which is usually of the highest interest, shows that for ten of 40 targets either mRAISE_contact or mRAISE_inclusion outperforms the classic version. In five of these cases the HR increases by more than ten percent.
Table 4

Average and median AUC values of the ROC curves on the DUD-E dataset

Mode

Avg. AUC

Median AUC

mRAISE_classic

0.74 ± 0.15

0.73

mRAISE_contact

0.72 ± 0.16

0.76

mRAISE_inclusion

0.72 ± 0.16

0.75

Values with standard deviation

In the previous publication it has already been discussed that despite the comparably rare usage, the DUD-E dataset is actually better suited for validation studies on VS methods than the older DUD. This is due to the fact that the DUD-E covers more targets and at the same time includes more molecules per target and an even higher ratio of included decoys for each active. Furthermore the DUD-E got rid of some property biases that made the separation between actives and decoys easier if exploited in some cases. Detailed performances on all 102 targets using mRAISE_contact and mRAISE_inclusion can be found in the supporting information in Tables SI.3 to SI.5 and Tables SI.6 to SI.8.

A comparison to mRAISE_classic using the average results can be found in Table 4 for the AUC, Table 5 for the EF and Table 6 for the HR.
Table 5

Average enrichment factor on the DUD-E dataset at one, five, and ten percent of ranked hits

Mode

\(EF_{1\%}\)

\(EF_{5\%}\)

\(EF_{10\%}\)

mRAISE_classic

23.45 ± 17.00

7.78 ± 4.92

4.69 ± 2.50

mRAISE_contact

22.67 ± 17.10

7.37 ± 4.96

4.46 ± 2.53

mRAISE_inclusion

22.76 ± 17.04

7.45 ± 4.99

4.45 ± 2.51

Values with standard deviation

Table 6

Average hitrate on the DUD-E dataset at one, five, and ten percent of ranked hits

Mode

\(HR_{1\%}\)

\(HR_{5\%}\)

\(HR_{10\%}\)

mRAISE_classic

37.95 ± 26.36

38.94 ± 24.44

46.98 ± 24.78

mRAISE_contact

36.79 ± 27.04

36.96 ± 24.60

44.66 ± 25.09

mRAISE_inclusion

37.12 ± 27.14

37.37 ± 24.73

44.60 ± 24.93

Values with standard deviation

As can be seen, both new modes achieve an average AUC of 0.72 ± 0.16 which is almost the same as the AUC of 0.74 ± 0.15 of mRAISE_classic. However, the median AUC values of mRAISE_contact (0.76) and mRAISE_inclusion (0.75) are slightly higher than the value of mRAISE_classic (0.73).

The values for the average EF and HR at one, five, and ten percent of ranked hits show the same trend as the average AUC with mRAISE_classic slightly exceeding the performance of mRAISE_contact and mRAISE_inclusion on average. While the overall performance of all modes is therefore comparable, individual cases highlight the strength of each mode and the benefit of the derived constraints for virtual screening.

Table 7 shows a list of targets that had an increased AUC of 0.05 or more using one of the two new modes. If the AUC of a target increases for both modes, only the best performance is listed in the table if the performances differ. In case of the HIV protease, both modes achieve the same AUC value.
Table 7

List of targets of the DUD-E which show an increased AUC using mRAISE_contact or mRAISE_equality in comparison to mRAISE_classic

Inclusion

Contact

Target

Increased AUC

Target

Increased AUC

bace1

\(+\)0.16

pa2ga

\(+\)0.16

ppard

\(+\)0.12

fkb1a

\(+\)0.15

xiap

\(+\)0.10

mcr

\(+\)0.11

fa10

\(+\)0.09

fak1

\(+\)0.10

pyrd

\(+\)0.08

fnta

\(+\)0.09

mapk2

\(+\)0.07

jak2

\(+\)0.08

mk10

\(+\)0.06

gcr

\(+\)0.06

dhi1

\(+\)0.05

pde5a

\(+\)0.05

dpp4

\(+\)0.05

pgh1

\(+\)0.05

hdac2

\(+\)0.05

  

cdk2

\(+\)0.05

  
 

hivpr \(+\)0.13

 

The increased AUC is calculated with respect to the results of mRAISE_classic. The AUC of hivpr increases by the same amount for both modes

As can be seen, for the DUD as well as the DUD-E there are a lot of targets which profit from the new complex-derived constraints while the average and median performance is similar to mRAISE_classic. The comparable performance of the three methods can to some degree be expected since the diversity of the included compounds per target in DUD and DUD-E is limited. Nevertheless it is of high interest to investigate the cases in which partial shape could enhance the performance of the LBVS. On the smaller DUD dataset no clear coherence between molecular properties and a change of performance could be found. The 102 targets of the DUD-E, however, showed an interesting connection between the average amount of rotatable bonds present in the active molecules and the benefit of partial shape constraints. The most apparent improvements increasing the AUC by 0.1 or more using both mRAISE_contact and mRAISE_inclusion all occurred on highly flexible molecule classes which had an average of eight or more rotatable bonds present in the active compounds. The only exception here are the ligands of the Mineralocorticoid receptor (mcr) which show and increased AUC by 0.11 using mRAISE_contact and only has an average of 4.4 rotatable bonds among the actives. To further highlight this observation, the percentages of targets with an improved or equal performance compared to mRAISE_classic with at least eight, nine, or ten average rotatable bonds among their actives can be seen in Table 8.

Alignment

In order to estimate the influence of the structure-driven shape constraints to the alignment quality, all pairs of molecules belonging to the same ensemble have been mutually superimposed. For the validation of molecular alignments calculated with LBVS or SBVS methods, an RMSD of less than 2.0 Å to a biological reference pose is generally considered as a good result. Tables 9 and 10 show a direct comparison of the average and median alignment quality for each mode considering only the top-scored superposition and the best RMSD-O within the top-ten ranked superpositions respectively. Figure 2 in the supporting information shows an example alignment for each ensemble of the mRAISE dataset with a top scored conformation.
Table 8

Percentage of DUD-E targets with equal or improved performance compared to mRAISE_classic and a certain number of rotatable bonds

Average number of rotational bonds

Number of targets

Inclusion

Contact

\(\ge\)8

21

52.4

52.4

\(\ge\)9

11

72.7

63.6

\(\ge\)10

4

100.0

100.0

Average numbers calculated using the actives for each target present in the DUD dataset

mRAISE_classic achieves an average RMSD-O of less than 2.0 Å for four ensembles and a median RMSD-O of less than 2.0 Å for eight ensembles within the ten best ranked hits. In comparison, the contact and inclusion modes achieve median RMSD-O values of less than 2.0 Å for nine of the eleven ensembles. Looking at the average RMSD-O values, mRAISE_inclusion achieves a value smaller than 2.0 Å for seven cases and mRAISE_contact succeeds in six cases. This alignment quality can, to a certain degree, already be observed only considering the best scored conformation (Top 1). Looking at the average values, mRAISE_classic achieved an RMSD-O of less than 2.0 Å in four cases while the new modes are only able to achieve this in three cases. However, looking at the less outlier-dependent median values, mRAISE_classic shows only five ensembles with a median RMSD-O of less than 2.0 Å while the other modes achieve this in seven cases.

Table 9

Comparison of the different methods on the mRAISE dataset

 

Classic

Contact

Inclusion

Serine protease

1.64

1.06

1.80

1.06

1.63

1.06

Thrombin

2.95

2.43

2.93

1.99

3.20

2.20

ALPHA-MANNOSIDASE II

1.96

1.77

2.08

1.40

2.06

1.45

Matrix metalloproteinase-12 (MMP-12)

3.17

2.12

3.04

2.06

2.88

1.89

CDK 2 Kinase

2.83

1.98

2.50

1.81

2.44

1.93

Carbonic Anhydrase II

1.70

1.53

1.71

1.54

1.69

1.53

Thermolysin

3.19

2.16

2.18

1.57

2.07

1.47

CYP121

3.94

4.87

3.51

4.38

3.55

4.27

HIV Protease

2.93

2.56

2.26

2.16

2.51

2.44

Bromodomain-containing protein 4

3.62

4.98

4.50

5.81

3.54

3.24

Isopenicillin N Synthase

1.74

1.63

1.74

1.63

1.57

1.47

RMSD values smaller than 2.0 Å highlighted in bold. The shown results are the average (left) and median (right) RMSD-O value considering the best ranked conformations

Difficulties of the different ensembles while screening with mRAISE have already been discussed in our previous publication. Therefore, only the targets showing a significantly different performance using the new modes will be discussed in detail in the following section.
Table 10

Comparison of the different methods on the mRAISE dataset

 

Classic

Contact

Inclusion

Serine protease

1.40

0.95

1.55

0.94

1.42

0.95

Thrombin

2.28

1.72

2.14

1.57

2.26

1.69

ALPHA-MANNOSIDASE II

1.38

0.90

1.46

0.82

1.52

0.88

Matrix metalloproteinase-12 (MMP-12)

2.74

1.95

2.52

1.73

2.34

1.55

CDK 2 Kinase

2.26

1.28

1.87

1.26

1.91

1.45

Carbonic Anhydrase II

1.23

1.19

1.29

1.21

1.27

1.19

Thermolysin

2.67

1.74

1.83

1.52

1.76

1.42

CYP121

2.85

3.35

2.77

3.30

2.83

3.69

HIV Protease

2.70

2.33

1.93

1.78

2.09

1.89

Bromodomain-containing protein 4 (BRD4)

3.16

4.55

3.85

4.84

2.75

2.40

Isopenicillin N Synthase

1.53

1.51

1.49

1.46

1.39

1.34

RMSD values smaller than 2.0 Å highlighted in bold. The shown results are the average (left) and median (right) RMSD-O value considering the best value of the ten top-ranked conformations only

Screening the ligands of Matrix metalloproteinase-12 with mRAISE_classic shows a good median RMSD-O value of 2.16 Å at first rank and 1.74 Å within the top ten conformations, but at the same time leads to a relatively high average RMSD-O of 3.19 and 2.67 Å respectively. This seems to be due to the ligands of the protein structures 2HU6, 4GR0, and 4GR8 that lack a sulfonamide group, which is present in all other ligands and is a good anchoring point for the alignments of those molecules (see Fig. 6). mRAISE_inclusion is able to reduce the average RMSD-O within the ten best ranked hits by 0.4 Å and achieves slightly better values for the three problematic ligands.
Fig. 6

Picture of the aligned active ligands of the Matrix metalloproteinase-12 with (a) and without (b) the common sulfonamide group

Looking at the Thermolysin ligands, the main reason for the average RMSD-O of 2.67 Å using mRAISE_classic is the comparably small ligand of the protein structure 3QGO. It only consists of 13 heavy atoms, which is half the number of the next smallest molecule. An overview of all ligands of the ensemble can be seen in the supporting information Figure 1. It has been shown in the previous publication [4], that the phenyl ring of the molecule can be superimposed onto a respective ring contained in most of the other molecules which is highly preferable concerning the maximization of shape overlap but at the same time contradicts the actual binding mode of the molecule and would not place it in the binding site at all. mRAISE_classic therefore tends to misplace the ligand and the alignments result in RMSD-O values greater than 10 Å in six out of the nine superpositions with this ligand at first rank. However, using the partial shape constraints derived from the binding site greatly improves the placement of this ligand as can be seen in the lower average RMSD-O values of 1.83 Å using mRAISE_contact and 1.76 Å using mRAISE_inclusion. For all of the six cases showing an RMSD-O greater than 10 Å at first rank for mRAISE_classic, mRAISE_inclusion achieves RMSD-O values of less than 2.71 Å already at first rank. On the other hand, mRAISE_contact also achieves values of less than 2.71 Å for five of the six cases with only one outlier showing an RMSD-O of 4.64 Å.

The ligands of the HIV protease are another example for improved alignment quality using binding site derived partial shape constraints. The difficulty concerning this ensemble is the high flexibility of its ligands, with nine of ten molecules having 12 or more rotatable bonds. The smaller median and average RMSD-O values of mRAISE_inclusion and especially mRAISE_contact show that the dependency on the conformational quality is reduced using the contact shape constraints. This is due to the fact that mRAISE_contact only requires shape similarity for special shape features and simultaneously allows more flexibility in other regions. The improved handling of the HIV protease ligands could already be seen on the DUD and DUD-E studies (see supporting information Tables SI.1 and SI.4). The experiments on the DUD-E also suggest a generally improved handling of highly flexible compound classes with partial shape constraints.

While none of the modes achieves an average or median RMSD-O of less than 2.0 Å for the ligands of Bromodomain-containing protein 4, mRAISE_inclusion nevertheless reduces the median value by 2.15 Å down to 2.40 Å. The ligands in this ensemble are only loosely fixed by a single hydrogen bond in a rather narrow binding site and then point a substantial part of the ligand into different directions (see Fig. 7). Therefore, the overall shape of the molecules is not as high as one would generally expect. For mRAISE_classic this actually led to 16 pairwise comparisons not even having one matching descriptor. Using mRAISE_inclusion not only lowers the average RMSD-O for all comparisons but also finds matches for 14 of the 16 problematic cases, which explains the significant change in the median value.
Fig. 7

Picture of the Bromodomain-containing protein 4 binding site together with the ligand ensemble

Table 11

Percentage of pairs with an RMSD-O smaller than a certain threshold

Method

Percentage \(\le\) 2.5 Å

Percentage \(\le\) 2.0 Å

Percentage \(\le\) 1.5 Å

mRAISE_classic

87.5

80.8

62.9

mRAISE_contact

86.8

80.2

62.6

mRAISE_inclusion

87.8

81.1

65.7

Highest values highlighted in bold. Combined results of all ensembles

Fig. 8

Manual selection of atoms for the query ligands of five targets of the DUD dataset. The van der Waals radius of selected atoms is highlighted in yellow

Another way to look at the results is to calculate the overall percentage of pairwise comparisons achieving a certain RMSD-O value. Table 11 shows the percentage of comparisons achieving RMSD-O values of less than 2.5, 2.0 and 1.5 Å for the respective modes. It can be seen that the mRAISE_inclusion performs slightly better than all other modes with 87.8% of comparisons showing an RMSD-O of less than 2.5 Å, 81.1% showing and RMSD-O of less than 2.0 Å and 65.7% showing an RMSD-O of less than 1.5 Å. While mRAISE_contact shows slightly inferior values to the original method here (see Table 11), the higher count of ensembles achieving median and average RMSD-O values of less than 2.0 Å (see Tables 9 and 10) shows however that this method produces less outliers with high RMSD-O values. The performed experiments highlight the fact that there is no single method providing the best alignments for all ensembles. Most of the time one has to try multiple methods or use a combination of them in order to get the best possible performance. mRAISE now offers multiple different approaches to make use of all available information in order to improve the results in any given scenario.

Manual selection

Table 12

Overview of the five selected DUD targets for the mRAISE_selection experiment

 

Avg number of

AUC (mRAISE_classic)

 

Rotatable bonds

Heavy atoms

hivpr

9.3

37.6

0.58

thrombin

7.1

32.7

0.61

fxa

6.8

32.5

0.64

pde5

5.7

31.2

0.57

fgfr1

5.4

29.7

0.53

Average numbers calculated using the actives for each target present in the DUD dataset

The manual selection of atoms to define partial shape constraints is a very promising but also difficult endeavor. Since an ideal selection of important regions of a molecule requires good knowledge of the target or the compound class one is interested in, an objective validation of this method is not trivial. In the following we will show some examples for manual partial shape constraints based on a selection of atoms to guide the screening procedure.

We selected five targets from the DUD dataset which were considered the most difficult and additionally had an AUC of less than 0.7 using mRAISE_classic. We hereby defined difficulty by the average number of heavy atoms and rotational bonds within the sets of actives. The larger and more flexible the members of a molecule class are, the harder it is to find optimal solutions using LBVS. Sorting the list of DUD targets by the average number of heavy atoms as well as by the number of rotational bonds and picking the five targets with the respective highest numbers and an AUC of less than 0.7 using mRAISE classic actually leads to the same five targets. The selected targets can be seen in Table 12. The atom selections for each query molecule used in mRAISE_selection can be seen in Fig. 8. All selections were done using information drawn from the respective PDB entries and visual inspection of the protein–ligand complexes as a guideline but without any automated calculations.
Table 13

AUC values of mRAISE with and without aid of manual selection

 

mRAISE_classic

mRAISE_selection

hivpr

0.58

0.69

thrombin

0.61

0.70

fxa

0.64

0.73

pde5

0.57

0.68

fgfr1

0.53

0.75

Table 13 shows the performance of mRAISE_selection in comparison to mRAISE_classic. As can be seen, the manual selection of steric constraints improved the performance for all five targets.

These results highlight the power of knowledge-based manually selected constraints on the performance of LBVS, especially on otherwise difficult highly flexible compounds. The quality of the results hereby depends on the actual selection of atoms and therefore on the knowledge of the user. It should also be noted that a better suited selection of atoms could improve the results even further in any given case.

Conclusion

We introduced new methods to create knowledge-based partial shape constraints for virtual screening with mRAISE. These constraints can either be created by a manual selection of atoms in important regions of a molecule or they can be automatically derived from protein–ligand complexes if the respective information is available. Complex-based constraints either try to search for hits that might form the same close contacts to the protein as the query molecule or they aim on finding molecules that have a higher chance to fit into the binding site when superimposed onto the query molecule.

The influence of these new strategies on the screening performance as well as on the quality of molecular alignments has been evaluated and compared to the original performance of mRAISE. With an average area under the ROC curve of 0.76 ± 0.19 for mRAISE_inclusion and 0.70 ± 0.15 for mRAISE_contact, the average performance for all targets of the DUD is comparable to the original version. Similar observations were made using the DUD-E dataset, with an average AUC of 0.72 ± 0.16 for both mRAISE_contact and mRAISE_inclusion in comparison to 0.74 ± 0.15 using mRAISE_classic. However, one could see multiple examples for screenings with partial shape constraints to improve the performance on multiple individual targets.

Further, looking at the quality of calculated molecular alignments highlights the benefits of the new constraints. Here, the overall performance increased with mRAISE_contact achieving an average RMSD-O of less than 2.0 Å for seven and a median RMSD-O of less than 2.0 Å for nine of the 11 ensembles within the top ten ranked hits and mRAISE_inclusion achieving one average RMSD-O of less than 2.0 Å less. Furthermore, the complex-based constraints improved the alignment quality especially for difficult cases with highly flexible molecules like in the HIV Protease ensemble or in case of an outlier of much smaller size than the rest of the ensemble like in the Thermolysin ensemble.

For the non-automatic manual selection of partial shape constraints a validation has been performed using five targets of the DUD dataset. An ideal atom selection requires detailed knowledge about the compound class or the respective target. The validation therefore focused on the most flexible and hence difficult targets, which had an inferior performance using mRAISE_classic. It could be shown that a good selection of atoms and the derived partial shape constraints were able to improve the screening quality in all five cases.

Overall the partial shape constraints proved to be a viable tool to assist LBVS especially in difficult cases with highly flexible compounds. This is the case for constraints derived from additionally available data of the protein–ligand complex as well as for the knowledge-based manual constraints defined by a user.

Next steps will focus on a further validation of the manual selection mode and on an application in real screening experiments with experimental validation of the results. For the validation of the manual selection mode one would require experienced users and challenging scenarios, nevertheless there seems to be a high potential in such studies. Furthermore, it would be interesting to sequentially use combinations of the described modes on the same library. Such a combination might lead to improved rankings compared to the performance of just one method alone. Since the dependency on high quality conformations seems to be reduced using the complex-based shape constraints, it would also be interesting to see if the new modes would perform equally well with less generated conformations.

Supplementary material

10822_2017_11_MOESM1_ESM.pdf (10.8 mb)
Supplementary material 1 (pdf 11076 KB)

References

  1. 1.
    Lavecchia A, Di Giovanni C (2013) Virtual screening strategies in drug discovery: a critical review. Curr Med Chem 20(23):2839–2860CrossRefGoogle Scholar
  2. 2.
    Schomburg KT, Bietz S, Briem H, Henzler AM, Urbaczek S, Rarey M (2014) Facing the challenges of structure-based target prediction by inverse virtual screening. J Chem Inf Model 54:1676–1686CrossRefGoogle Scholar
  3. 3.
    Henzler AM, Urbaczek S, Hilbig M, Rarey M (2014) An integrated approach to knowledge-driven structure-based virtual screening. J Comput Aided Mol Des 28:927–939CrossRefGoogle Scholar
  4. 4.
    von Behren MM, Bietz S, Nittinger E, Rarey M (2016) mRAISE: an alternative algorithmic approach to ligand-based virtual screening. J Comput Aided Mol Des 30:583–594CrossRefGoogle Scholar
  5. 5.
    Grant JA, Gallardo MA, Pickup BT (1996) A fast method of molecular shape comparison: a simple application of a gaussian description of molecular shape. J Comput Chem 17(14):1653–1666CrossRefGoogle Scholar
  6. 6.
    Vainio MJ, Puranen JS, Johnson MS (2009) ShaEP: molecular overlay based on shape and electrostatic potential. J Chem Inf Model 49:492–502CrossRefGoogle Scholar
  7. 7.
    Vaz de Lima LA, Nascimento AS (2013) MolShaCS: a free and open source tool for ligand similarity identification based on Gaussian descriptors. Eur J Med Chem 59:296–303CrossRefGoogle Scholar
  8. 8.
    Roy A, Skolnick J (2015) LIGSIFT: an open-source tool for ligand structural alignment and virtual screening. Bioinformatics 31:539–544CrossRefGoogle Scholar
  9. 9.
    Taminau J, Thijs G, De Winter H (2008) Pharao: pharmacophore alignment and optimization. J Mol Graph Model 27:161–169CrossRefGoogle Scholar
  10. 10.
    Nittinger E, Schneider N, Lange G, Rarey M (2015) Evidence of water molecules-a statistical evaluation of water molecules based on electron density. J Chem Inf Model 55:771–783CrossRefGoogle Scholar
  11. 11.
    von Behren MM, Volkamer A, Henzler AM, Schomburg KT, Urbaczek S, Rarey M (2013) Fast protein binding site comparison via an index-based screening technology. J Chem Inf Model 53(2):411–422CrossRefGoogle Scholar
  12. 12.
    Kitchen DB, Decornez H, Furr JR, Bajorath J (2004) Docking and scoring in virtual screening for drug discovery: methods and applications. Nat Rev Drug Discov 3:935–949CrossRefGoogle Scholar
  13. 13.
    Wei DQ, Zhang R, Du QS, Gao WN, Li Y, Gao H, Wang SQ, Zhang X, Li AX, Sirois S, Chou KC (2006) Anti-SARS drug screening by molecular docking. Amino Acids 31:73–80CrossRefGoogle Scholar
  14. 14.
    Barreiro G, Guimaraes CR, Tubert-Brohman I, Lyons TM, Tirado-Rives J, Jorgensen WL (2007) Search for non-nucleoside inhibitors of HIV-1 reverse transcriptase using chemical similarity, molecular docking, and MM-GB/SA scoring. J Chem Inf Model 47(6):2416–2428CrossRefGoogle Scholar
  15. 15.
    Tikhonova IG, Sum CS, Neumann S, Engel S, Raaka BM, Costanzi S, Gershengorn MC (2008) Discovery of novel agonists and antagonists of the free fatty acid receptor 1 (FFAR1) using virtual screening. J Med Chem 51:625–633CrossRefGoogle Scholar
  16. 16.
    Lin TW, Melgar MM, Kurth D, Swamidass SJ, Purdon J, Tseng T, Gago G, Baldi P, Gramajo H, Tsai SC (2006) Structure-based inhibitor design of AccD5, an essential acyl-CoA carboxylase carboxyltransferase domain of Mycobacterium tuberculosis. Proc Natl Acad Sci USA 103:3072–3077CrossRefGoogle Scholar
  17. 17.
    Vidal D, Thormann M, Pons M (2006) A novel search engine for virtual screening of very large databases. J Chem Inf Model 46(2):836–843CrossRefGoogle Scholar
  18. 18.
    Mestres J, Knegtel RM (2000) Similarity versus docking in 3d virtual screening. Perspect Drug Discov Des 20(1):191–207CrossRefGoogle Scholar
  19. 19.
    Huang N, Shoichet Brian K, Irwin JJ (2006) Benchmarking sets for molecular docking. J Med Chem 49(23):6789–6801CrossRefGoogle Scholar
  20. 20.
    Mysinger MM, Carchia M, Irwin JJ, Shoichet BK (2012) Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking. J. Med. Chem. 55:6582–6594CrossRefGoogle Scholar
  21. 21.
    Wu K (2005) Fastbit: an efficient indexing technology for accelerating data-intensive science. J Phys 16(1):556Google Scholar
  22. 22.
    Bietz S, Rarey M (2016) SIENA: efficient compilation of selective protein binding site ensembles. J Chem Inf Model 56:248–259CrossRefGoogle Scholar
  23. 23.
    Hilbig M, Rarey M (2015) MONA 2: a light cheminformatics platform for interactive compound library processing. J Chem Inf Model 55:2071–2078CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2017

Authors and Affiliations

  1. 1.Center for BioinformaticsUniversity of HamburgHamburgGermany

Personalised recommendations