Introduction

Computational methods are an indispensable part of the drug design process, supporting the search for new compounds with desired biological activity. One of the most popular in silico strategies is virtual screening (VS), in which large libraries of compounds (commercially available or generated computationally using various combinatorial approaches) undergo nonexperimental evaluation [1,2,3]. Usually, one of the first filters in VS cascades is connected with the application of ligand-based approaches [4], when only structures of compounds with already determined affinity for a particular receptor are used for making predictions for new compounds. The structures that successfully pass to subsequent filtering steps are typically assessed by the confrontation of the features they possess with pharmacophore models constructed for a particular target [5, 6] and/or are docked to the binding site of the respective protein (structure-based VS [7]).

One of the fundamental issues connected with the application of various computational methods in the search for new drug candidates is the provision of proper conversion of the chemical structure to a form that can be handled by computer methods. The most popular way to capture chemical information is the application of numerical descriptors or fingerprints [8,9,10,11]. The former method usually characterizes the physicochemical properties of compounds, and the examples of such descriptors are as follows: molecular weight, octanol–water partition coefficient (logP), pKa, number of hydrogen bond donors, number of hydrogen bond acceptors, number of atoms of a particular type, number of bonds of a particular type, atomic charges, polarity, molecular volume, etc. On the other hand, fingerprints are a result of the translation of compounds into the form of a bit string. Their two main groups can be distinguished: key-based and of a hashed type. Key-based fingerprinting annotates the presence (1) or absence (0) of particular chemical moieties in the molecule, whereas in the hashed fingerprints, bit strings are formed from the molecular graph, on the basis of which paths up to a fixed length are generated (subsequently, starting from each atom), and a hashing function is applied to encapsulate the structural information in the string. Fingerprints are used not only for chemical structure characterization, but they are also applied for description of ligand–receptor complexes obtained in docking. They provide information about the interaction of a ligand with particular amino acids of a protein, as in the case of interaction fingerprints (IFts) [12] and structural interaction fingerprints (SIFts) [13]. The most important advantages of fingerprints are the relative simplicity, low computational costs connected with their generation, and simplicity of making comparisons between two 0–1 strings. The latter procedure can be carried out with the use of various similarity coefficients or by application of machine learning approaches [14, 15].

The ligand-based approaches usually make also use of various molecular descriptors. They are generated from the two-dimensional (2d) structure of the compound (2d descriptors), or they use their spatial orientation obtained either in minimization of a ligand solely or using the docking poses—three-dimensional (3d) descriptors [16, 17].

Predictive models (constructed most often with the use of machine learning methods of various complexities) based on such representations can be of great help in the search for new active compounds. Such approaches are widely known for the ability to deal with a large amount of information in a fast and efficient way, so the application of machine learning is rapidly growing also in the field of computer-aided drug design, mainly due to a significant increase in amount of data also in the field of cheminformatics and pharmacy. However, despite the great power possessed by various algorithms in the identification of new potentially active compounds, their performance strongly depends on experimental conditions, such as training set compositions, compounds representation, and parameters of particular learning algorithms themselves. Therefore, a multifactor optimization needs to be performed in order to obtain the optimal predictive power of such methods [18, 19].

In this study, we focused on one of the above-mentioned parameters influencing the performance of machine learning methods performance, that is, the compounds representation, 3d descriptors in particular. The variations of their values, depending on the input used for their generation (docking pose), were analyzed. It was examined whether the consistency of 3d descriptor values obtained for different orientations in the binding site of the same compound is correlated with the compound activity. All existing crystal structures of aminergic G protein-coupled receptors (GPCRs) were used for docking of known ligands and compounds with confirmed inactivity toward the respective proteins. The 3d descriptor values calculated for poses obtained from docking for active and inactive compounds were compared with those descriptors that were generated from compounds with minimized energies (prepared in LigPrep for docking) and compared with the analysis of variations in atom positions in the docking poses of a particular compound. The results of such study were also made available online (http://chem.gmum.net/vischem_stability), together with the possibility to manually analyze chemical structures referring to a particular data point. The differences in the distribution of 3d descriptor values indicate that not all crystal structures are effective in this type of experiments (docking followed by automatic analysis of its results), and the prepared tool can be of great help during the design of such studies, especially by assisting in the selection of crystal structures for docking that would provide optimal performance of machine learning methods (when the distribution of descriptor values for active and inactive compounds will be too similar, the machine learning algorithms would also face difficulties in making distinction between these two groups of molecules).

Methods

The compounds with experimentally determined activity/inactivity were fetched from the ChEMBL database [20] according to the previously described protocol [21]: Only data produced on human- and rat-cloned receptors were considered, sets of active compounds included structures with Ki parameter values below 100 nM (also activities expressed as IC50 were taken into account, assuming that Ki = IC50/2) [22], and the sets of inactive compounds included structures with Ki (or equivalent activity parameter) values above 1000 nM. The considered targets were all aminergic GPCRs for which crystal structures are available, including serotonin receptors 5-HT1B, 5-HT2B, 5-HT2C; muscarinic receptors M1, M2, M3, M4; adrenergic receptors beta1, beta2; dopamine receptors D2, D3, D4; and histamine receptor H1. All respective crystal structures were fetched from the PDB database [23], and they are presented in Table 1.

Table 1 Crystal structures used in the study

The compounds were prepared for docking in LigPrep [59] (protonation states generated at pH 7.4 ± 0.0; a maximum of four stereoisomers per compound was allowed), and the number of compounds in each dataset is presented in Table 2.

Table 2 Number of compounds from each dataset used in the study

The compounds were docked in Glide to the respective crystal structures; a maximum of ten docking poses were allowed for each compound. For each compound conformation obtained in docking, 3d descriptors were generated using the recently published package for descriptors calculation—Mordred [60]. It contains 214 3d descriptors grouped into the following categories: CPSA, geometrical index, gravitational index, MoRSE, moment of inertia.

The descriptor-based compound representations were compared in the following way—for each compound, the standard deviation (std) between values of all descriptors obtained for all docking poses of a particular compound was calculated and visualized using LargeVis by embedding into a 2d space. The results of such analysis were prepared in an interactive way and made available online, so chemical structures referring to a particular data point can be manually analyzed (http://chem.gmum.net/vischem_stability). In addition, studies considering the first k conformations (for compound sets sorted by docking score) were performed with k adopting values 3, 5, and 10. Moreover, the stds of coordinates of each atom of a compound were calculated (after the compounds alignment) and compared to determine how differences in descriptor values are related to differences in atom positions for various docking poses (Fig. 1).

Fig. 1
figure 1

Example analysis of consistency in compounds docked poses presented as the average std of all descriptors used for compounds representation

The scheme of the whole project is presented in Fig. 2.

Fig. 2
figure 2

Scheme of the study presented in the paper

Results and discussion

The averaged std values of generated descriptors between various compound conformations, depending on the number of compound orientations taken into account, are gathered in Table 3 with the stds of atom positions for docking poses obtained for a particular compound.

Table 3 Comparison of average std of between analyzed 3d descriptors and atom coordinates

The analysis clearly shows that the variation of compound orientations in the binding site depends on the crystal structure used for docking rather than on the compound affinity for the receptor. Moreover, the rate of variation in the atom positions is not necessarily correlated with the rate of variation in the 3d descriptor values, although it is the most common situation. Additionally, active compounds were not more consistently docked (in terms of both 3d descriptor values and atom positions) in comparison with inactive molecules, although intuitively and theoretically that should be the case.

The number of docking poses taken into account in the performed analyses has only a slight impact on the observed dependencies that were rather consistent for a particular crystal structure, in terms of both similarity between 3d descriptor values and consistency of compound orientations described as the std of their atom coordinates. For the 5-HT1B crystal structures, the std of both 3d descriptor values and atom coordinates was higher for active than inactive compounds for all cases. On the other hand, 5-HT2B ligands were, in general, more consistent in terms of the analyzed properties than compounds that were inactive for this receptor.

Interestingly, 5-HT2C and ACM1 ligands were more consistent compared with inactive compounds in terms of atom coordinates but less consistent when 3d descriptors were considered. However, for the rest of the muscarinic receptors, the tendencies were again shifted toward higher inconsistency for active compounds.

Adrenergic receptors were, in general, characterized by higher consistency for active compounds in terms of both parameters considered; however, there are examples of crystal structures for which there were variations in the consistency of 3d descriptor values and atom coordinates—such as 2R4R, 2R4S, 3KJ6, 3P0G, and 5D5A. The above-mentioned differences occurred also for the crystal structure of histamine receptor H1.

Example differences in docking poses for various crystal structures for two pairs of structurally similar active and inactive compounds (CHEMBL428892, CHEMBL66310, and CHEMBL387545, CHEMBL225364 pairs) toward 5-HT1B are presented in Fig. 3.

Fig. 3
figure 3

Analysis of docking results of selected compounds to crystal structures of serotonin receptor 5-HT1B; green: active compound, red: inactive compound; D3.32 residue of the protein is visualized as sticks. (Color figure online)

The docking results presented in Fig. 3 confirm the strong dependence of the consistency of a compound docking pose on the crystal structure used for docking. The active compound, CHEMBL428892, was docked in varying poses for the 4IAQ crystal structure, whereas when 4IAR and 5V54 were used, it similarly fit in the respective binding site. On the other hand, its inactive analog, CHEMBL66310, was docked less consistently to all crystals used; the highest variations occurred for 5V54, as one of the poses was flipped. CHEMBL387545 adopted two different orientations in the binding site of 4IAQ, and two in 4IAR, whereas for 5V54, all of the obtained docking poses were similar. CHEMBL225364, despite being inactive toward 5-HT1B, was docked very similarly to its 4IAQ and 5V54 crystal structures with conformation variations occurring for 4IAR.

Example visualization of the 3d descriptors space is presented in Fig. 4. It can be observed that for the outcome of this docking studies, representation by 3d descriptors will not necessarily provide good performance of predictive model constructed with the aim of making distinction between active and inactive compounds toward ACM3.

Fig. 4
figure 4

Example analysis of stability of docking poses for ligands of the ACM3 receptor. Gray area refers to compounds active or inactive toward other targets considered in the study. In this case, active compounds are relatively well demarcated from the inactive ones

Visualizations for all crystal structures are available online at http://chem.gmum.net/vischem_stability, together with the provision of compound structures referring to a particular data point (Fig. 5). The possibility to manually analyze the results enables the explanation and optimization of machine learning methods performance and other automatic and semiautomatic analyses performed on 3d descriptors as compounds representation. Such visualizations can be of great help at the stage of designing such experiments, e.g., in terms of choosing the crystal structure that provides the best discrimination between active and inactive compounds, especially considering that the consistency in 3d descriptor space is not necessarily correlated with the std of atom coordinates in various docking poses, or in terms of proper evaluation of the applicability of a model, by the analysis of chemical structures of compounds that occupy similar regions of 3d descriptors space, despite possessing different activities toward considered receptor.

Fig. 5
figure 5

Example visualization of the prepared online tool. Red points refer to active compounds and black to the inactive ones. Chemical structure corresponding to each data point can be manually analyzed. (Color figure online)

To explain the causes of such dependencies of the compound pose variations, the analysis of the relationship between the number of rotatable bonds and the std in 3d descriptor values, expressed as the Pearson correlation coefficient, was carried out (Table 4; for simplification, the results were averaged for all crystal structures for a particular protein). According to intuition, the performed analysis indicated a strong dependency of the consistency in descriptor values on the number of rotatable bonds (the higher the number of rotatable bonds, the lower the consistency of 3d descriptor values) with Pearson correlation values approaching 0.718 for 5-HT2B. Another observation is that in general, the correlation between the number of rotatable bonds and the variations in compound orientations in the binding site (described by 3d descriptors) was stronger when a higher number of docked poses was taken into account. Example visualizations of such dependencies are presented in Fig. 6, and respective figures for the remaining data are present in the Supplementary Material.

Table 4 Comparison of values of the Pearson correlation coefficient between the number of rotatable bonds and the average std of the 3d descriptor values for particular compound conformations
Fig. 6
figure 6

Analysis of average std of the 3d descriptor values depending on the number of rotatable bonds in a particular compound for 5-HT2BR ligands, when ten docked poses were taken into account

Moreover, an analysis of std of each descriptor values separately was performed (Supplementary Material; for simplification, the obtained std values were averaged for all crystal structures available for a particular protein). It was found out that the lowest variation occurred for 3D-MoRSE descriptors determined for a distance equal to 1 (std of 1–2 × 10−18), whereas the highest std values were on average obtained for descriptors from the group of geometrical indexes, such as the geometric Petitjean index and the geometrical shape index (std of 0.5–0.7 on average). Relatively high fluctuations in descriptor values (of similar std range as in the case of the geometrical shape index descriptors) were also observed for 3D-MoRSE descriptors calculated for a distance equal to 10.

Concluding remarks

In summary, a deep analysis of 3d descriptors generated for compounds docked to crystal structures of aminergic GPCRs, with the aim of their application in machine learning experiments, supplemented with the tool for manual inspection of structures referring to a particular data point was prepared. A strong dependence of the obtained results on the crystal structure used for docking was proved. Moreover, although the variation of 3d descriptor values is typically correlated with the variation of compound conformations, there were several cases where these dependencies were not preserved, revealing the limitations of the applied depiction of docking poses (for the ideal representation, higher variation in docking poses should led to higher variation in descriptor values and vice versa). Additionally, the ability to manually analyze all the information, including the analysis of chemical structures referring to particular data points, enables a better design of machine learning experiments conducted on such type of data, allowing for the maximization of the power of predictive models, for example by proper selection of crystal structures for studies (those that provide the best discrimination between active and inactive compounds in machine learning-based experiments) or the proper evaluation of the model applicability via the analysis of chemical structures with overlapping fragments of descriptors space.