Background

Molecular similarity is a widely studied topic in chemoinformatics [13] and medicinal chemistry [4]. Molecular similarity is used in ligand-based virtual screening [58], SAR by catalog and to predict side effects. When hits are obtained in a drug discovery project, it is interesting to test similar derivatives in the hope that some will be more potent. Various molecular similarity measures have been developed [9]. Those include measures based on molecular descriptors [10], measures based on molecular fragments (such as MACCS) and measures based on graph matching (such as maximum common substructure searches [1113]). For an extensive reference on molecular descriptors, cf. [10]. The DRAGON software [14] can compute thousands of such descriptors.

Electrostatics are one of the main driving forces of molecular recognition, along with steric complementarity, hydrogen bonding and hydrophobic interactions [15]. It is well known that there is complementarity in shape and electrostatics between a ligand molecule and its receptor protein. Molecules sharing similar electrostatics and shape are expected to bind to the same receptor. This principle has been used in ligand-based virtual screening to look for small molecules similar to known inhibitors and natural substrates [1619]. The electrostatic potential of a molecule originates from the atomic partial charges. While the electrostatic reaches into the long range, the partial charges also contribute to the molecular recognition in the short range where they are the driving forces of polar interactions (hydrogen bonding and salt bridges) determining the specificity of recognition as well as the coordination of bridging waters between the receptor and ligand. Due to our recent study on the electrostatic similarity of small molecules with binding proteins [20], we were interested in the challenge of looking for active compounds while only exploiting the information contained in atomic coordinates and partial charges of a molecule.

The autocorrelation function has long been used in chemoinformatics, by several researchers and in various ways [2127]. In 1980, Gilles Moreau and Pierre Broto proposed to use the autocorrelation function on the molecular graph to encode any property associated with atoms [21]. They subsequently used their autocorrelation descriptor in SAR studies [28, 29]. The atom-type autocorrelation (ATAC) descriptors sum property values of atoms with given types [10]. Atom pairs [22] and Chemically Advanced Template Search (CATS) [30] are examples of such descriptors. The CATS2D descriptor [30], later extended to 3D [26], computes the topological cross-correlation of generalized atom types (hydrogen bond donor/acceptor, positively/negatively charged, lipophilic). The autocorrelation of partial charges of a 3D molecule encoded into histograms was studied [25]. Histograms of the autocorrelation of molecular surface properties were also used in partial least squares analysis to generate ligand-based 3D-QSARs [27], in PCA analysis [23], in Kohonen maps and in training neural networks for the prediction of biological activity [23, 24].

A new molecular descriptor based on partial charges is proposed. It uses the autocorrelation function to encode all atoms of a molecule into two rotation-translation invariant vectors. The method is named ACPC (AutoCorrelation of Partial Charges). Compared to previous methods, neither using histograms nor explicitly computing the electrostatic potential field, but splitting the autocorrelation values based on their sign before applying linear binning [31] with a thin discretization step are essential traits of ACPC. Combined with a scoring function, it can rank-order a database of compounds versus a query molecule. The descriptor is solely based on the 3D distribution of partial charges, uses a single discretization parameter and provides good performance and speed. It was tested in a retrospective ligand-based virtual screening setting. At an average speed of 1649 molecules per second, it reached an average median Area Under the ROCa Curve (AUC) of 0.81 on 40 different targets, outperforming several commonly used methods and making it a useful addition to the arsenal of ligand based virtual screening tools.

Method

The point charge model of electrostatics is considered. In this model, each atom is a point in 3D space with an associated partial charge. The ACPC method first computes the autocorrelation of partial charges of all atoms in a molecule. Then, it separates the positive from negative values before applying linear binning. To score molecules, the sum of cross-correlations at lag zero is computed for corresponding vectors of the two molecules under consideration. A mathematical description follows and a graphical overview can be seen in Figure 1.

Figure 1
figure 1

Overview of the ACPC method. It encodes a 3D molecule into a rotation-translation invariant molecular descriptor based on partial charges and inter-atomic distances. The descriptor consists of two vectors separating the positive from the negative autocorrelation values. These vectors are discretized by linear-binning with discretization step dx in order to obtain the final pair of vectors (LBAC+ and LBAC-).

Algorithm

Let m be a molecule with A atoms.

Let i = (x i ,y i ,z i ,q i ) be the atom at index i in m (1 < = i < = A), with 3D coordinates (x i ,y i ,z i ) and partial charge q i .

Let d ij be the Euclidean distance between atoms i and j of m.

Let k be the lag (an inter-atomic distance in fact).

Let δ k d ij be the Kronecker delta equal to one if k = d ij , zero otherwise.

Therefore, the autocorrelation of molecule m can be written as

AC ( m , k ) = i = 1 A j = 1 A q i q j δ k d ij

Since in practice values of the autocorrelation vector for k = 0 are ignored and each atom pair is considered only once (i,j = j,i), the following formula is used:

ACPC ( m , k ) = i = 1 A - 1 j = i + 1 A q i q j δ k d ij

Since the autocorrelation values can be positive or negative, they are split based on their sign before discretization, in order to avoid cancellation of values. This gives the two vectors AC+ and AC-.

( AC + ( m , k ) , AC - ( m , k ) ) = sign_split ( ACPC ( m , k ) )

such that AC+(m,k) is the subset of ACPC(m,k) where q i  ∗ q j  > = 0 and AC-(m,k) is the subset where q i  ∗ q j  < 0.

ACPC ( m , k ) = AC + ( m , k ) AC - ( m , k )

To convert these two sets into vectors, linear binning [31] is used. Linear binning consists in selecting a discretization step (dx) for an axis, then linearly interpolating each raw data point’s contribution to the two neighboring points on the discretized axis. In our case, a raw data point is a pair of (d ij ,q ij ). Assuming that d ij is between x and z (with z = x + d x) on the discretized axis, q ij is separated into a contribution at x (called c x ) and a contribution at z (called c z ) by applying the formulae

c x = q ij z - d ij dx and c z = q ij d ij - x dx .

The result of linear binning is a vector containing the sum of all these contributions. Therefore, by applying linear binning, two rotation-translation invariant vectors encoding a molecule are obtained:

LBA C + ( m , k , dx ) = linbin ( A C + ( m , k ) , dx ) LBA C - ( m , k , dx ) = linbin ( A C - ( m , k ) , dx )

By default, to score the similarity between two linearly-binned sign-split autocorrelation vectors, cross correlation at lag zero is used. Molecules being encoded in a rotation-translation invariant way, there is no need to scan for the lag which would yield the maximum cross-correlation. This maximum would be at lag zero for similar molecules. Since the goal is to assess the similarity of molecules, scanning the lag would be inappropriate.

Let n be another molecule.

Let l1 = length(LBAC+(m,k,dx)) and l2 = length (LBAC+(n,k,dx)) in p = min(l1,l2) and let l3 = length(LBAC-(m,k,dx)) and l4 = length(LBAC-(n,k,dx)) in q = min(l3,l4).

The cross-correlation at lag zero of molecule m with molecule n is CC(m,n,dx)= k = 0 p LBA C + (m,k,dx)LBA C + (n,k,dx)+ k = 0 q LBA C - (m,k,dx)LBA C - (n,k,dx).

The spectrum obtained for one molecule, before and after linear binning can be seen in Figure 2. Differences in peak heights between AC± and LBAC± are due to linear binning. It is noteworthy that such spectra are not completely uninterpretable. Each Kronecker delta could be textually annotated by the names (from the MOL2 file for example) of the two atoms that give birth to it. Of course, such annotations would be crowded near y = 0 but readable for peaks which are standard deviations away from the mean.

Figure 2
figure 2

Autocorrelogram. In blue and red impulses: positive and negative values of the autocorrelation (AC±) of partial charges for the first ligand of the comt target. The ligand is depicted in 2D at the top right. The linearly binned versions (LBAC±) of the positive and negative parts of the autocorrelogram are the cyan and pink lines.

Software features

ACPC allows rank-ordering of a compound database against a query molecule. By default, ACPC filters out multiple conformers of the same moleculeb by keeping only the best scoring one. Molecular descriptors are computed and scored on the fly. Molecule names, scores and ranks are written to disk. A special program (acpc_big) is also provided to score large databases. In contrary to the default program, acpc_big executes in constant memory space but does not rank nor filter out multiple conformers. For tests on datasets with known active compounds (whose molecule names must be prefixed with the word “active”), the AUC is computed by the CROC package [32]. Several scoring functions are available (CC, Tanimoto, Tverskyref, TverskyDB). The following remaining features are optional. The computation can be parallelized on multicore computers using the Parmap library [33] for multiple queries on the same database. The filtering of multiple conformers can be turned off. The ROC and AUC calculations can be turned off. The ROC curve can be displayed on screen with Gnuplot [34]. The discretization step can be changed.

Results and discussion

Datasets

To evaluate ACPC, two different datasets were prepared. All ligands and decoys come from release two of the Directory of Useful Decoys (DUDr2) [35]. The DUD contains 2950 ligands for 40 different targets. Every ligand has 36 decoy molecules that are physically similar but topologically distinct. OMEGA v2.4.6 [36] was used to generate conformers. The first dataset (named “1conf”) contains a single, lowest energy conformer per molecule. It is further filtered to have unique SMILES. If several molecules have the same SMILES string as assigned by Open Babel [37], only the first encountered molecule was kept. This allows to easily detect and filter out multiple conformers of the same molecule later on, when one wants to keep only the highest scoring conformer. Finally, MOE [38] in combination with its default force-field (MMFF94x) was used to assign partial charges to molecules. As “1conf” is of reasonable size for each target, it is affordable to test each active of a given target in turn as a query, especially for the four software that run well in a cluster environment (ACPC, Pharao, Open Babel and Shape-it). The second dataset (named “25conf”) is the expansion of “1conf” by generating a maximum of 25 low energy conformers per molecule. As “25conf” is a much bigger dataset, each software is run only once with a single query per target. On “25conf”, it is feasible to test all the MOE fingerprints since a single query per target is performed. Cf. section “Availability and requirements” to download these datasets.

Parametrization

The discretization parameter (dx) was setup using 20 randomly selected targets (from a total of 40). Then, for each selected target, only the first ligand in its ligands list was used to measure the AUC reached by ACPC. The dx parameter is introduced because of linear binning [31]. dx allows to balance the trade-off between the speed of the algorithm and the approximation error introduced by binning [31]. Small values of dx are important in our method since they counterbalance the information loss incurred by dimensionality reduction.

The default value proposed (d x = 0.005Å) gives a good compromise between speed and average AUC reached. dx values in the range 0.001 < = dx < = 0.009 are acceptable, but 0.001 consumes too much memory and values over 0.009 don’t perform as well in terms of AUC (Table 1).

Table 1 Effect of the dx parameter

Concerning the charge model used to assign partial charges; the default charge model from MOE (MMFF94x) was used in all experiments. For users with no access to MOE, the Gasteiger charge model [39] from Open Babel performs nearly as well. Other charge models available in Open Babel (QTPIE [40], QEq [41], MMFF94 [42, 43]) perform worse and hence are not recommended for use with ACPC (cf. Table 2).

Table 2 Effect of the force field used to assign partial charges (OB stands for Open Babel)

Additional experiments were carried out retrospectively to confirm the choices of parameters. The test protocol was as follows: 20 targets and five queries per target were randomly selected on “1conf”. Then, the effect of the sign_split function, the dx parameter and the force field used to assign partial charges were measured in three distinct experiments. Those results are shown in Tables 1, 2 and 3. As can be seen from Table 3, the use of the sign_split function gave better AUCs. The finer discretization parameter gave the best AUCs compared to coarser ones (Table 1). Using the MMFF94x force field to assign partial charges gave the best AUCs compared to other force fields (Table 2).

Table 3 Effect of the sign_split function

Validation protocol

ACPC was compared against a diverse set of freely available methods and to the molecular fingerprints available in MOE.

The open source software compared against are: 1) the Pharmacophore alignment and optimization tool Pharao [44], from Silicos-it [45] 2) the purely shape-based tool Shape-it, also from Silicos-it [46] and 3) the MACCS fingerprint as implemented in Open Babel [37].

Pharao is an open source software to align and score small molecules using pharmacophores. Pharao uses 3D Gaussians to represent pharmacophore features instead of the more common points and spheres model. Pharao’s performance has been demonstrated in virtual screening experiments and unsupervised clustering of small molecules [44]. Tversky_ref is recommended with Pharao to score compounds in virtual screening experiments [44].

Shape-it is a tool to align a reference molecule against a database of molecules. Shape-it uses 3D Gaussians to describe the molecular shape [47]. Shape-it can find the alignment of molecules which maximizes their volume overlap. Tversky_ref was used to score compounds with Shape-it, since it gave better results than Tanimoto or Tversky_db.

The MACCS fingerprint is a bit string registering the presence or absence of structural features (MACCS stands for Molecular ACCess System, originally developed by Molecular Design Limited, now Accelrys). In Open Babel, Tanimoto is used to score compounds with MACCS.

All ligand fingerprints available in MOE v2013.08 [38] have also been used. Some of the MOE fingerprints (TAD, TAT, TGT, TGD) can be traced back to the literature [48, 49]. All MOE fingerprints use Tanimoto to score compounds, except ESshape3D and ESshape3d_HYD which use an inverse distance. If an abbreviated name is used in tables, it is given between parentheses below.

  •  ESshape3D_HYD (ES3DH) is an eigenvalue spectrum shape fingerprint. It allows for comparison of 3D shapes made by hydrophobic heavy atoms of a molecule.

  •  ESshape3D (ES3D) is similar to ESshape3D_HYD but uses all heavy atoms instead of just hydrophobic ones.

  •  GpiDAPH3 (Gpi3) is a three points pharmacophore fingerprint calculated from the 2D molecular graph. Each atom is given one of eight atom types computed from three atomic properties (in pi system, is donor, is acceptor). Anions and cations are ignored.

  •  piDAPH3 (pi3) is similar to GpiDAPH3 but uses the molecule’s 3D conformation instead of the 2D molecular graph.

  •  piDAPH4 (pi4) is similar to piDAPH3 but considers quadruplets of pharmacophore features instead of triangles.

  •  TAD is a two points pharmacophore fingerprint calculated from the molecule’s 3D conformation. It considers pairs of pharmacophore features (donor, acceptor, polar, anion, cation, hydrophobic).

  •  TAT is similar to TAD but uses triangles instead of point pairs.

  •  TGD is similar to TAD but uses the 2D molecular graph instead of the 3D conformation.

  •  TGT is similar to TGD but uses triangles instead of point pairs.

PAR [50] was used to accelerate some experiments by parallelizing their execution on a multicore computer.

Performance

As a reminder about reading AUC values from ROC curves: AUC = 0.5 is the performance of a random method. An AUC score of less than 0.5 means that a method perform worse than random. All rationally engineered methods are expected to perform significantly above 0.5 in terms of AUC.

Test results using all possible queries on the “1conf” dataset are shown in Figure 3 as a per target quartile plot and in Table 4 as median AUCs, while Figure 4 gives an aggregated overview.

Figure 3
figure 3

1st, 2nd and 3rd quartile plots for ACPC , MACCS, Pharao and Shape-it on all queries of the “1conf” dataset. For clarity, targets are sorted on the x axis by decreasing median value obtained by ACPC. The red dotted horizontal line at AUC = 0.5 indicates random performance.

Table 4 Median AUCs on the “1conf” dataset
Figure 4
figure 4

Cumulative distribution functions for Shape-it, Pharao, MACCS and ACPC on all queries of the “1conf” dataset.

In Figure 3, three classes are distinguishable. Class 1: ACPC is far in front. Class 2: ACPC’s performance ties or significantly overlaps with at least one other method. Class 3: ACPC is outperformed by at least one other method. Class 1 targets (17 cases): gart, dhfr, parp, pnp, sahn, tk, gbp, fxa, cox2, fgfr1, src, cdk2, hivrt, thrombin, pr, vegfr2 and ache. Class 2 targets (9 cases): tie in eragonist, na, egfr, ppargamma, trypsin, ampc, ada, ar and pde5. Class 3 targets (13 cases): hmga, rxralpha, comt, mr, gr, hivpr, hsp90, alr2, inha, erantagonist, p38, pdgfrb, ace. Cox1 is classified as an exception as no method perform well on this target: ACPC, Pharao and Shape-it all have median AUCs below 0.5 and MACCS has a huge standard deviation with a mean not far from 0.5. By looking at the cumulative distribution function (CDF) from experiments on “1conf” in Figure 4, such statements can directly be read: the probability for ACPC to have an AUC less than 0.5 is about 10%. All other software have a higher probability to have an AUC less than 0.5. By looking at the CDF, the previous statement is seen to hold for ACPC for any AUC > = 0.41. The best performance is reached 20 times by ACPC, 13 times by MACCS, seven times by Pharao and three times by Shape-it.

Test results on the “25conf” dataset are shown in Table 5. In terms of average AUC reached, the best method is ACPC (0.78) followed by GpiDAPH3 (0.72) then MACCS (0.7). The best AUC is reached 17 times by ACPC, five times by Pharao and four times by the MOE fingerprints TAD and TAT. On average, the GpiDAPH3 fingerprint from MOE is the second best method on this dataset. In order to confirm the stability of the test results on the “25conf” dataset using a single query, ACPC was run with all the active ligands as queries. The resulting average AUCs are shown in column ACPC25c in Table 5. For comparison, the average AUCs from the “1conf” dataset are also calculated and shown in columns ACPC1c in Table 5. The performance of ACPC on “25conf” using a single query is stable, since the average AUCs are comparable to that using all the actives as queries. Moreover, the performance of ACPC using either a single query or all actives as queries is similar to that from the “1conf” dataset.

Table 5 Average AUCs on the “25conf” dataset

Speeds of the software ran on “1conf” are shown in Figure 5. Speed tests were performed on one core of a 2.4 GHz Intel Xeon workstation with 12 GB of RAM running Ubuntu Linux 12.04. Speed measurements were done on the egfr target, since it has the most ligands and decoys. Reported numbers were averaged over three runs. Only scoring was performed by each method, no ranking or filtering of compounds was done. ACPC and Shape-it screen a database of compounds read from a MOL2 file. Open Babel reads compounds from a SMILES file. Pharao reads compounds from a.phar file (its own text format for pharmacophore features). ACPC processed 1649 molecules per second, Pharao 1416, Open Babel 553 and Shape-it 65.

Figure 5
figure 5

Average processing speed of ACPC , Pharao, Open Babel (OB) and Shape-it on the target with the most ligands and decoys (egfr).

The scaffold diversity among actives in the top 10 to 50 ranked molecules was also investigated for cox2 and egfr, the two targets with the most ligand clusters (44 and 40 respectively). Andrew Good’s clustering analysis of DUD (http://dud.docking.org/clusters/) was used to assign each active to a cluster via reduced graphs [51]. Ten random queries were used on each target then cluster of actives in the top 10 to 50 molecules were analyzed for each method (Table 6). While the number of distinct clusters of actives found in the top 10 molecules is somewhat comparable for ACPC, Pharao, MACCS and Shape-it; the rate at which new clusters of actives are discovered by ACPC doesn’t increase as fast as other methods. But this is to be expected; pharmacophores (Pharao) are a powerful way of generalizing atom types while shape (Shape-it) is an even more permissive representation of molecules.

Table 6 Average number of distinct clusters found for active molecules among the top N ranked molecules for cox2 and egfr in “1conf” using 10 random queries

Early during the development of the method, scoring the similarity of descriptors with Pearson’s r and Spearman’s rank order correlation coefficient [52] was tried. Later on, distance metrics for continuous variables were tried (some equations can be found in [26, 44]): the Tanimoto coefficient, Tversky_ref, Tversky_db and 1 1 + d where d is the Manhattan distance or the Euclidean distance. None of these performed better than cross-correlation in tests on small random partitions of the DUD (20 queries chosen randomly across all ligands and targets).

ACPC’s good performance might be explained by the fact that all atoms of a molecule are considered at the same time and all intra molecular distances are handled in the same way. While atom centers partially encode the shape, partial charges encode some of the recognition features (eg. hydrogen bond acceptors and donors). ACPC measures the global similarity of two molecules in terms of intra molecular vectors and partial charges. The high number of molecules per second the method can process is a direct benefit from its simplicity and reliance on a strong mathematical property: being rotation-translation invariant. In ACPC, molecules do not need to be optimally superposed before scoring and the electrostatic potential field is not computed.

Recommended usage

ACPC was designed to be rotation and translation invariant. However, ACPC is not invariant to the conformer of a molecule, neither to the charge model that was used to assign partial charges (Table 2). ACPC is also sensitive to the choice of the dx parameter (Table 1). Hence, the following protocol has been validated: the query molecule(s) and the database to screen must be prepared in the same way. The same software with same parameters must be used to assign partial charges and generate conformers for all molecules.

Limitations

Our method, by construction, cannot distinguish a molecule m from its enantiomer. If n is a perfect mirror image of m and m is a copy of m with all partial charges reversed (sign flipped) and n is a copy of n with all partial charges reversed, then they cannot be distinguished.

CC ( m , m , dx ) = CC ( m , m , dx ) = CC ( m , n , dx ) = CC ( m , n , dx )

Also, since the method is purely based on partial charges, it should perform poorly if the recognition of a binding site by a query ligand is not mostly driven by electrostatics. This should be the case for binding sites that are predominantly non-polar.

While release two of the DUD [35] was used to prepare datasets, it has known shortcomings. DUD was initially created as a benchmark set for molecular docking but, like some previous authors [19, 53, 54], we use it to test ligand-based virtual screening methods. If some ligands L1 and L2 of the same target are targeting different sub-pockets of the binding site, it is incorrect to use one of them (in a ligand-based approach) to find the other. Another problem previously noticed by other authors [55] is that DUD decoys are only supposed to be inactive. If tested experimentally, some decoys may be found to be active. In future studies, DUD Enhanced, a more recent version of DUD [56] which overcomes some of its previous drawbacks and includes more targets might be used.

Potential users should keep in mind that there is no silver bullet for in-silico drug discovery. ACPC is no exception. From previously shown results, MACCS or Pharao or some of the MOE fingerprints are seen to perform better than ACPC on several targets. Also, Pharao, MACCS and Shape-it seem to promote scaffold diversity of actives earlier in the ranked list of compounds (Table 6).

Upcoming features

The following features are under consideration for future releases of ACPC. On the purely technical side, a GPU-based version of the tool is doable [57] to reach higher throughput. Combined with a metric distance, clustering compounds databases would then become computationally tractable. Other interesting topics would require more research and experiments, such as the automatic creation of a consensus query from a set of known actives, investigating other orthogonal feature spaces such as atomic radii, solvent accessible areas and per atom hydrophobic contribution, to increase the discriminative power of the method. The choice and parametrization of suitable kernel functions for use in Kernel Density Estimates (KDE) is also an interesting direction that could result in reduced sensitivity to the discretization parameter and probably better AUCs (at the cost of heavier computation). With KDE, the method could probably be extended to cluster binding sites based on the partial charges of their surface atoms.

Conclusions

We revisited the family of rotation-translation invariant molecular descriptors and proposed a new, simple but powerful encoding of the autocorrelation of partial charges of a 3D molecule. Our implementation is fast and displays good performance in retrospective ligand-based virtual screening experiments. ACPC should be a useful tool for ligand-based virtual screening. ACPC is open source, freely available and automatically installable as an OPAM [58] package. Requests and contributions from users are welcome.

Availability and requirements

Project Name: ACPC

Project home page: http://www.riken.jp/zhangiru/software.html

Operating system: Linux

Programming language: OCaml http://ocaml.org

Other requirements: the OCaml Package Manager (OPAM) http://opam.ocaml.org/

Dataset (352 MB):http://www.riken.jp/zhangiru/software/DUD_ACPC_1.0_validation.tar.xz The CROC Python package (optional) http://pypi.python.org/pypi/CROC/ Gnuplot [34] (optional) http://www.gnuplot.info/

License: BSD

Any restrictions to use by non-academics: None

Endnotes

aReceiver Operating Characteristic.

bThey should have the same molecule name.