Background

The concept of molecular similarity is commonly used in different areas of chemistry including drug discovery. This is because one of the core paradigms in drug design is that similar compounds share similar properties. A number of molecular representations and similarity coefficients have been proposed [1] to quantify the molecular similarity between single molecular structures and compound libraries.

In chemoinformatics, molecular fingerprints are one of the most common representations of chemical structures. Representations of this type are simplifications of the chemical information contained in any chemical entity through binary vectors. Figure 1a illustrates a schematic representation of a binary fingerprint representation of a chemical structure. Each position in the vector indicates the absence (0) or presence (1) of features predetermined in the design of the fingerprint. For instance, binary vectors developed thus far are the Molecular ACCess System (MACCS) keys [2] and PubChem fingerprints. Despite the fact binary fingerprints lacks of accuracy, they have the advantage of increasing calculation speed and reducing storage space. These features, combined with broad applicability for several years have made molecular fingerprints one of the standard representations to measure molecular diversity among several other applications. However, since the amount of information stored in molecular databases is increasing constantly, there is a need to generate simplifications of the molecular representation of compound databases to open new approaches to studies of the chemical space, optimize the storage and enhance the speed of computations.

Fig. 1
figure 1

a Schematic representation of a binary and dictionary-based molecular fingerprint. b Schematic representation of a database fingerprint (DFP)

The goal of this work was to introduce a new binary fingerprint that encodes the main features of a compound data set. The herein called database fingerprint (DFP) is schematically illustrated in Fig. 1b and further explained throughout this Preliminary Communication. The DFP is inspired on the concept of Shannon entropy (SE) [3] and is based on redundancies present in binary representations. It is well known that the redundancies present in a given signal are the responsible of the information content and therefore of the indirect relation with noise and SE. DFP take advantage of these facts to extract the general pattern of molecular information contained in chemical compound sets represented with any binary fingerprint. As case of study, a DFP was generated for ten data sets of general interest in chemistry with particular emphasis on drug discovery. The basic concept of DFP is illustrated first with a small fingerprint (MACCS keys 166-bits) for relative small data sets (up to 1500 molecules). Then, the application of DFP is shown for a newer and more complex molecular representation (PubChem fingerprints) for larger databases up to 25,000 molecules. Related molecular representation methods like bit fingerprints and different informational content metrics can be complementary to DFP in studies of consensus chemical space characterization [47]. One of such approaches is the modal fingerprint. This fingerprint is based on common molecular paths found in chemical sets to determine a unique representation of 2048 bits long that depends in a preset percentage of the database used. This representation can contain, for example, carbonyl or amide functional groups, but also molecular fragments or complete molecular structures [8].

Methods

DFP concept and construction

The main steps to construct the DFP are shown in Fig. 2. To illustrate the concept of DFP, MACCS keys (166-bits) [2] were calculated for the ten compound data sets in Table 1 using MayaChemTools [9]. As a reference, 1500 binary vectors 166-bit long were generated randomly with the server www.random.org (that uses atmospheric noise to generate random numbers). Since the focus of this work was the generation of a novel fingerprint representation that includes the main features (bit positions) of the compounds in a molecular library, the following approach, inspired on the concept and applications of SE [10, 11] was followed: Firstly, for each binary digit position of the features encoded in the MACCS keys the frequencies and probabilities were recorded. Then, the total SE of the distribution of the 166-bits in the MACCS keys was computed (as a metric of molecular diversity).

Fig. 2
figure 2

Overview of the approach implemented in this work

Table 1 Compound databases used to illustrate the concept of DFP

To generate the DFP a threshold for the bit probability was established. If the probability for a given bit was greater than the threshold, the bit position was assigned with a number 1. If the probability was equal or lower than the threshold, the bit position was assigned with a number zero. Lastly, to construct the DFP with MACCS keys, two different probability thresholds were explored as first approach: (a) the mean value of the probability distribution of the herein calculated random vectors (0.55) and (b) the mean probability of a data set plus one standard deviation.

To illustrate the concept of the DFP ten data sets were chosen as test cases (Table 1). The compound collections cover a broad range of sizes (ranging from 92 to 1500 molecules) and structural features. Data sets included a small group of 92 synthetic compounds sharing the benzimidazole scaffold (this data set has been used in activity landscape studies [12], a commercial set of 113 molecules for epigenetic drug discovery (‘Epigenetic focused’), an in-house data set with 566 compounds tested as inhibitors of DNA methyltransferase 1 (DNMT1). This set has been used in chemoinformatic analysis of the epigenetic relevant chemical space [13, 14]. Other compound collections used here were 837 molecules in clinical trials (‘Clinical’), a general screening collection (typically used in high-throughput screening—HTS) with 1100 molecules, 1498 natural products and 1498 semi-synthetic compounds, 1490 drugs approved for clinical use [15], 1500 generally recognized as safe (GRAS) compounds [16] and a set of 1500 molecules selected from Generated Data Base 13 (GDB13) available at http://gdb.unibe.ch/downloads/ [17].

DFP application with PubChem fingerprint and larger data sets

The application of the DFP was applied on 100–25,000 compound databases (Table 2). To this end, we used the PubChem fingerprint that is a newer and more complex molecular representation. For this section we increased the number of compounds for several libraries and included a data set used in HTS with 15,000 molecules (PrimScreen 15 available at http://www.otavachemicals.com/download-compound-libraries/cat_view/110-diversity-sets). The PubChem fingerprint encodes molecular fragments information with 881 binary digits. The list of the substructure encoded on each bit can be accessed at ftp://ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt. This molecular representation was selected to calculate the bit position frequencies and probability distributions to construct the DFP for the original databases.

Table 2 Compound databases used to show the application of DFP

For this part, three different thresholds (0.5, 0.6 and 0.7), the informational significant bit positions were selected using Differential Shannon Entropy [18] implemented in the IMMAN package software [19]. The probability distribution and relation between classical Shannon entropy average, DFP/Tanimoto similarity and k-mean clustering of the informational significant bit positions was studied.

Results and discussion

This section is organized in two major parts. First, the concept of DFP is discussed using MACCS keys for compound data sets up to 1500 compounds. The second part shows an application of DFP with PubChem fingerprints for larger data sets.

Distribution of binary fingerprint: SE as metric of database diversity

Figure 3 shows the probability distributions of MACCS keys (166-bits) for three representative data sets (drugs, benzimidazoles, and Epigenetic-focused) plus the randomly generated binary fingerprints as a reference. The probability distributions of the other compound data sets are shown in Additional file 1: Fig. S1. The corresponding SE values for each probability distribution is shown in each group and are further reported in Table 1 for all data sets. In addition, Table 1 summarizes the mean similarity value using the MACCS keys fingerprints and Tanimoto index (MACCS keys/Tanimoto similarity) of all ten data sets. Table 1 and Fig. 3; Additional file 1: Fig. S1 show that each data set had different values of SE that was associated with the mean MACCS keys/Tanimoto similarity.

Fig. 3
figure 3

Probability distributions of MACCS keys (166-bits) of representative data sets studied in this work. The number of compounds, mean MACCS keys/Tanimto similarity, and Shannon entropy (SE) are shown

Figure 4 shows the relationship between SE and mean MACCS keys/Tanimoto similarity. The plot shows that high SE is associated with high intra-set diversity i.e., low similarity. Likewise, lower SE is associated with high similarity. Of note, SE is not a magnitude that can be expressed in terms of an absolute scale because no upper limit boundaries are known. A general observation is that high SE is an indicative that it is less likely that two compounds in the data set have similar fingerprint representation. If this observation is repeated for many pairs of compounds in the data set, then the overall similarity of the compound data set is low and the mean similarity of the data set is expected to be low. In contrast, if the overall SE of the data set is (relatively) low, it is likely that two molecules in the data set have similar fingerprint representation. Therefore, it is expected that the overall diversity of the data set is (relatively) low e.g., the overall similarity of the compound data set is high. This general trend was observed for nine out of ten data sets.

Fig. 4
figure 4

Relationship Shannon Entropy and MACCS keys/Tanimoto similarity for the ten compound data sets in Table 1. A drugs, I general screening, C clinical, G GDB13, D DNMT1, E epigenetic focused, M semi-synthetic, N natural products, B benzimidazole, GR GRAS, R random

A notable exception was the GRAS set: SE of the MACCS keys has a relative low value (30) but the data set has high diversity (as measured with MACCS keys/Tanimoto <0.40). In other words, despite the fact that there is a relative low entropy in the fingerprint representation of GRAS, it happens that the likelihood that two compounds share similar fingerprint representation is low. It is worth noting that MACCS keys/Tanimoto captures pair-wise relationships that are not directly captured by the SE of the entire fingerprint. A second notable exception was the random set that had, as expected, the highest SE value (above 80) but MACCS keys/Tanimoto similarity of 0.33. The distinct feature of GRAS (as compared to the other data sets considered in this work) can be related to the particular structural features of molecules in this data set. It has been shown that GRAS molecules have a high content of aliphatic chain and has a low diversity of molecular scaffolds [20]. It should also be considered that MACCS keys is unable to capture the particular features of GRAS compounds.

The plot in Fig. 4 shows two main clusters that group together the different data sets. These databases can be related through the nature of the compounds in each cluster. In the larger cluster (upper left), all the data sets, with exception of GDB13, are related to synthetic bioactive molecules. While the small cluster contains data sets that include natural products, semi-synthetic natural products and benzimidazole derivatives, all present in living organisms.

Based on the above results, it can be suggested that SE of the probability distributions of MACCS keys (166-bits) can be used as an additional criterion to rapidly assess the fingerprint-based diversity of compound data sets. Of course, additional metrics and criteria e.g., scaffold diversity, should be considered for a comprehensive assessment of the structural diversity of data sets [21]. It is worth noting that the concept of SE was initially used to measure the content of information in particular messages [3]. Nowadays, along with similarity and molecular scaffolds, SE has been implemented to measure scaffold diversity [10, 22]. In chemoinformatics, SE is also related to the generation of many kinds of molecular representations based on graph theory and virtual similarity searches, among others [23, 24]. In particular, SE has been used previously to determine the similarity between a given molecule and a focused library [24]. In that approach, Wang et al. calculated the variation of SE of a focused library with and without a given compound to determine their similarity with the redundant futures present in the database.

DFP

As described above, 166-bit long DFP were generated for all ten compound data sets in Table 1. Representative DFP of selected data sets are shown in Additional file 1: Table S1. Two different thresholds were used to determine the limit redundancy value, the mean probability of a random distribution and the inter-mean plus one database standard deviation (vide supra). As described below, to select the most representative threshold value, a comparison with city block distance was performed. Using this criteria one DFP per database was calculated with the different thresholds, resulting in the selection of the mean probability of a random distribution as a final threshold.

DFP and inter-set relationships

Table 3 shows the city block distance [1] between the data sets considering the newly developed DFP. A 2D visualization of the distance matrix is presented in the Additional file 1: Fig. S2.

Table 3 Inter-set relationships of the compound data sets computed with the database fingerprint (DFP) and city block distance

As expected, the randomly generated set was the most distant i.e., most dissimilar, to the other ten data sets with real molecules. In agreement with previous publications [13, 14] there was a small distance between compounds in the clinic (‘Clinical’) and general screening and approved drugs. Similarly, there was a small distance between the commercial molecules focused on epigenetic targets (‘Epigenetic focused’) and compounds for general screening and molecules in the ‘Clinic’. Indeed, it can be expected a large overlap between the chemical spaces of all these data sets using MACCS keys from which the DFP was designed. In contrast, after random, GRAS compounds were the second most distant to all other data sets considered in this study. This is consistent with previous results that support that GRAS molecules are dissimilar to other databases commonly used in drug discovery using MACCS keys [25].

Taken together the results in Table 3, further visualized in Additional file 1: Fig. S2, can be concluded that the newly DFP is a reasonable approximation of the fingerprint-based representation of a molecular database. Similar trends between the inter-set relationships were obtained with the DFP and the Tanimoto coefficient (Additional file 1: Table S2 and Fig. S3), and the inter-set relationships calculated with MACCS keys and the Tanimoto coefficient (Additional file 1: Fig. S4).

DFP and intra-set relationship

Table 4 shows the relationship between the intra-set mean similarities computed with two strategies, namely; a classical approach calculated the pair-wise mean similarity with MACCS keys/Tanimoto coefficient. The second approach was an approximation of the intra-set similarity using the newly proposed DFP: for each data set, the similarity based on the DFP was calculated as the mean similarity between the MACCS keys representation of each compound and the DFP of the data set. Results summarized in Table 4 (and plotted in Additional file 1: Fig. S5) show a direct relationship between these two values supporting the hypothesis that DFP was able to retain the general information contained in a given compound data set. Even if DFP underestimated the similarity values (Table 4), it was a reasonable tool to estimate the intra-set molecular diversity, since these comparison studies are relative to the databases.

Table 4 Intra-set mean similarity of the compound data sets

DFP application with PubChem fingerprint and larger data sets

For three different thresholds (0.5, 0.6 and 0.7) the informational significant bit positions of PubChem, 198, 180, and 159 respectively, were selected using Differential Shannon Entropy implemented in IMMAN package software. Figure 5 shows the classical Shannon entropy average versus the average DFP/Tanimoto Similarity based in the 198 information significant bit positions obtained with a 0.5 threshold with IMMAN software. Figure 5 also displays the databases cluster membership on five clusters obtained with k-mean Euclidean distances implemented in WEKA software [26].

Fig. 5
figure 5

Relationship Shannon entropy and DFP/Tanimoto similarity and k-mean Euclidean clustering for the ten compound data sets in Table 2 at threshold of 0.5 threshold value

Similar to Figs. 4, 5 shows two main clusters that group together different data sets that contain chemically related compounds. For instance, in the larger cluster colored blue, all the data sets, with exception of PS15, are related to synthetic bioactive molecules. While the small two-member clusters, in red color, group FDA and Approved datasets. The one-member clusters correlates with the previously reported distinct nature of GRAS, MEGx, and Benzi compounds.

This general grouping of compound data sets in Fig. 5 is consistent with the probability distribution of the 198 significant bit positions recovered from the original databases represented by PubChem fingerprints. In Fig. 6 the datasets probability distributions can by grouped in a similar way to the cluster membership illustrated in Fig. 5.

Fig. 6
figure 6

Probability distribution of the 198 significant bit positions recovered from the original databases represented by PubChem fingerprint at threshold of 0.5

The same analysis was applied for 0.6 and 0.7 DFP thresholds. The implementation of this cutoff criteria led to a significant decrease in the resolution of the DFP to distinguish differences between the databases studied. The respective probability distributions and classical Shannon entropy average versus the average DFP/Tanimoto Similarity plots, with k-mean clustering, can be found in the Additional file 1: Figs. S6–S9.

Conclusions and perspectives

In this Preliminary Communication we introduced the DFP as an approach to generate a binary fingerprint representation of a compound collection with a fixed size. The new fingerprint has the ability to include the main structural futures of the molecules in the data set. The construction of the DFP is based on the distribution of the probabilities of each position in a given binary fingerprint of fixed length. A test cases, DFP were generated for ten compound data sets of different size using, as an example, a short and commonly used fingerprint representation: MACCS keys (166-bits). The application of DFP is also illustrated for large molecular libraries using PubChem fingerprints, with a total of 881-bits. DFP for compound data sets with a broad range size (ranging from 100 to 25,000 molecules) were calculated using three different threshold values to explore the fingerprint behavior with respect to database size, diversity, cutoff criteria and different content of information metrics. It was concluded that DFPs are reasonable representations of the compound data sets to measure the intra- and inter-set relationships. One of the principal perspectives of DFP is its performance in virtual screening and library design applications. Despite the fact that a quantitative analysis of the advantages of DFP over other fingerprints in terms of computer time is beyond the scope of this work [the comparison will largely depend on the specific fingerprints compared, compound databases and computer(s) processors] is clear that DFP saves time because they are pre-calculated and stored for later applications.