Practical application of the Average Information Content Maximization (AIC-MAX) algorithm: selection of the most important structural features for serotonin receptor ligands

Warszycki, Dawid; Śmieja, Marek; Kafel, Rafał

doi:10.1007/s11030-017-9729-8

Practical application of the Average Information Content Maximization (AIC-MAX) algorithm: selection of the most important structural features for serotonin receptor ligands

Original Article
Open access
Published: 09 February 2017

Volume 21, pages 407–412, (2017)
Cite this article

Download PDF

You have full access to this open access article

Molecular Diversity Aims and scope Submit manuscript

Practical application of the Average Information Content Maximization (AIC-MAX) algorithm: selection of the most important structural features for serotonin receptor ligands

Download PDF

Dawid Warszycki¹,
Marek Śmieja² &
Rafał Kafel¹

1355 Accesses
2 Citations
Explore all metrics

Abstract

The Average Information Content Maximization algorithm (AIC-MAX) based on mutual information maximization was recently introduced to select the most discriminatory features. Here, this methodology was applied to select the most significant bits from the Klekota-Roth fingerprint for serotonin receptors ligands as well as to select the most important features for distinguishing ligands with activity for one receptor versus another. The interpretation of selected bits and machine-learning experiments performed using the reduced interpretations outperformed the raw fingerprints and indicated the most important structural features of the analyzed ligands in terms of activity and selectivity. Moreover, the AIC-MAX methodology applied here for serotonin receptor ligands can also be applied to other target classes.

2D SIFt: a matrix of ligand-receptor interactions

Article Open access 08 September 2021

Advances in Feature Selection for Data and Pattern Recognition: An Introduction

Smaller feature subset selection for real-world datasets using a new mutual information with Gaussian gain

Article 20 August 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Fingerprints, which are a representation of a chemical compound structure in the form of a bit string, have been widely used in chemoinformatics for many years [1,2,3,4,5,6,7,8,9]. They encode structural features into a bitstring, where a value of “1” denotes the presence of a given pattern, and “0” indicates its absence. The process of encoding a structure into a fingerprint is based on either structural keys or graph representations. Structural fingerprints are only one among the methods applied for extracting the selectivity and/or activity-determining features. Nevertheless, methods such as pharmacophore modelling and interaction fingerprints are much more time-consuming due to several additional steps which have to be performed as conformers generation, compounds mapping, docking, etc. Moreover, because of the very wide pharmacophore features and interaction patterns definitions, an exhaustive statistical analysis of selected features will be ambiguous [10,11,12]. Although the fingerprints with the highest bit count display a high level of performance in virtual screening campaigns [13], the share of irrelevant bits in the representation increases the computational cost of any calculations and also introduces informational noise. The reduction in fingerprint length without information loss has become an important challenge for cheminformatics. Several methodologies, e.g., consensus fingerprints [14], bit scaling [15], reverse fingerprints [16] and bit silencing [17] reduce fingerprints by weighting of particular bits. An approach proposed by Nisius et al. [18] selects fingerprint bits according to their discrimination power which is measured by the Kullback–Leibler divergence. Herein, we present the application of the Average Information Content Maximization algorithm (AIC-MAX) as another solution for fingerprint reduction and hybridization in a case study of selecting the most important structural features for serotonin receptor ligands.

Materials and methods

To resolve the aforementioned difficulties with application of high resolution fingerprints, the AIC-MAX algorithm [19] was recently introduced to select features with the highest discriminatory potential in virtual screening-like experiments. AIC-MAX uses mutual information normalized by the Shannon entropy to rank a group of features ${X}=\{{X}_{1}, {\ldots }, {X}_{\mathrm{N}}\}$ with respect to their significance measured by activity label $Y=\{y\}$.

$$\begin{aligned} \mathrm{AIC}_y ({X})=\frac{\sum \limits _{x\in S_\mathrm{N}} {\sum \limits _{y\in \{0,1\}} {P_i (x;y)\log _2 \frac{P_i (x;y)}{P(x)P_i (y)}} } }{-\sum \limits _{y\in \{0,1\}} {P_i (y)\log _2 P_i (y)} } \end{aligned}$$

where ${S}_\mathrm{N}=\{0,1\}^{\mathrm{N}}$ is a binary sequence (fingerprint of length N) and ${P}_{{i}}({y})$, ${P}_{{i}}({x})$ and ${P}_{{i}}(x;y)$ denote the probabilities that $\{{Y}_{{i}}={y}\}$, $\{{X}_{1}={x}_{1}, {\ldots }, {X}_{\mathrm{N}}={x}_{\mathrm{N}}$} and $\{{X}_{1} = {x}_{1}, {\ldots }, {X}_{\mathrm{N}}={x}_{\mathrm{N}}$, ${Y}_{{i}}={y}\}$, respectively.

The algorithm extends the application of existing techniques [14,15,16,17,18, 20] and allows the construction of a joint reduced representation for several biological targets [19]. In this paper, we apply AIC-MAX to analyze the most significant features (determining activity) of 14 serotonin receptors and construct various reduced representations that are able to distinguish their ligands.

Among the popular fingerprints [21,22,23,24,25], the Klekota-Roth fingerprint (KRFP) was selected because of its high resolution (4860 bits) and non-hashing characteristics, indicating that each bit corresponds to the exact structural feature. This fingerprint was generated for compounds with a determined affinity for any serotonin receptor (5-$\hbox {HT}_{1\mathrm{A}}\hbox {R}$, 5-$\hbox {HT}_{1\mathrm{B}}\hbox {R}$, 5-$\hbox {HT}_{1\mathrm{D}}\hbox {R}$, 5-$\hbox {HT}_{1\mathrm{F}}\hbox {R}$, 5-$\hbox {HT}_{2\mathrm{A}}\hbox {R}$, 5-$\hbox {HT}_{2\mathrm{B}}\hbox {R}$, 5-$\hbox {HT}_{2\mathrm{C}}\hbox {R}$, 5-$\hbox {HT}_{4}\hbox {R}$, 5-$\hbox {HT}_{5\mathrm{A}}\hbox {R}$, 5-$\hbox {HT}_{6}\hbox {R}$, 5-$\hbox {HT}_{7}\hbox {R}$) stored in the ChEMBL database using PaDEL-Descriptor software [23, 26]. Compounds with activity for a particular serotonin receptor were divided into active ($K_{{i}}$ or equivalent below 100 nM) and inactive sets ($K_{{i}}$ or equivalent higher than 1000 nM, Table 1) according to a previously utilized methodology [10].

Table 1 Number of active and inactive compounds for serotonin receptors retrieved from the ChEMBL database

Full size table

Results and Discussion

The AIC-MAX algorithm selected one hundred bits for each target (number optimized in a previous study) [19]. In total, only 242 different bits ($\sim $5% of the KRFP bits) covered structures of all studied actives, exhibiting a relatively high level of similarity among the ligands of serotonin receptors. With the exception of KRFP bits, which introduced only noise (encoding, i.e., simple aliphatic chains), there were 29 different common substructures for the ligands of all serotonin receptors, among which 8 bits characterized fragments with a polarizable nitrogen atom and 5 an aromatic system—two main pharmacophore features of 5-HTR ligands [27]. Moreover, for all receptors, bit encoding an amide bond (#839) was indicated as crucial, yet more specific bits for particular receptors were also found (such as the phenylsulfonylamide fragment (#4326) for ligands of 5-$\hbox {HT}_{6}\hbox {R}$, and o-metoxyphenyl (#4541) for 5-$\hbox {HT}_{1\mathrm{A}}\hbox {R}$, Fig. 1).

In the second experiment, AIC-MAX was applied to select the most important features for distinguishing ligands with activity specific to one receptor versus another. The procedure was repeated for all pairs of receptors (66 times). The set of “selective features” could be applied to search for selective ligands, which is an essential goal of 5-HTR ligand research. Analysis of the 5-$\hbox {HT}_{\mathrm{1A}}\hbox {R}$ ligands revealed 297 bits (Fig. 2) that can be applied in selectivity studies. Among them, 16 unique bits (#438, #467, #620, #647, #677, #2265, #3157, #3179, #3402, #3682, #3788, #3892, #3943, #4294 and #4295) were selected in every experiment against each of the other serotonin receptors. Some of the abovementioned fragments can be described as noise; however, five bits encoded an aliphatic amine. Moreover, very characteristic structural features of 5-$\hbox {HT}_{\mathrm{1A}}\hbox {R}$ ligands, such as piperidine (#3157) and piperazine (#3179) moieties, were also found within such bit collection, confirming previous observations [10]. The algorithm also indicated crucial role for the amide fragment (#2265), which is highly abundant in 5-$\hbox {HT}_{\mathrm{1A}}\hbox {R}$ ligands. Analysis of the most discriminative bits for the remaining receptors (see Supplementary Materials) also revealed structural features that are typical for such receptors, including usually secondary and tertiary amine groups and different aromatic systems.

To evaluate the potential of selective bits, machine-learning experiments (with the application of the random forest method, see Supplementary Materials for details of experimental settings) aimed at the separation of compounds that act on individual target compared with other targets were conducted [28]. Classification results were measured by Mathews Correlation Coefficient (MCC), which is a well-known validation index, especially for imbalanced data sets [29]. MCC takes values from −1 to $+$1, where $+$1 represents perfect prediction, 0 represents random prediction, and −1 represents an inverse prediction. The results were compared with data obtained for the original (raw) KRFP fingerprint.

The results (Fig. 3) indicate that the reduced fingerprint is not only faster, but also more accurate than the original KRFP fingerprint in 44 out of 66 cases, and the MCC value increased. This observation was supported by a statistical analysis performed with the application of Wilcoxon signed-rank test [30]. Results confirmed that at 0.05 significance level there is no reason to reject the hypothesis that the reduced representation outperforms classical KRFP fingerprint in the classification experiment. Improvement of the results was observed most frequently for the 5-$\hbox {HT}_{\mathrm{5A}}\hbox {R}$ ligands (10 of 11 instances) and least frequently for 5-$\hbox {HT}_{\mathrm{2A}}\hbox {R}$ ligands (5 of 11 instances). This result can be explained by the unique structures with affinity for the 5-$\hbox {HT}_{\mathrm{5A}}\hbox {R}$ in comparison with other receptor ligands (but is in fact due to their relatively small number, because usually so small set of actives covers a very limited chemical space and therefore reduced fingerprint is consisted of unique bits which makes achieving high results easier in discrimination experiments). Additionally, the 5-$\hbox {HT}_{\mathrm{2A}}\hbox {R}$ ligands are often multipotent compounds [31].

Experimental studies confirmed that since AIC-MAX algorithm maximizes, a discriminatory power of a group of bits (not only the potential of every bit individually) and the resulted representation contains enough information to characterize active compounds as original KRFP fingerprint. Therefore, it can be applied in the wide spectrum of screening applications aimed for particular target as well as for searching the compounds selectivity potential, which is a one of the most important challenges in computer-aided drug design.

Reduced fingerprints especially should be utilized in machine-learning experiments where application of previous conclusions should ensure outstanding results [32, 33].

Conclusion

In this paper, we presented the application of the AIC-MAX algorithm to identify the most significant chemical patterns for fingerprint representation of serotonin receptor ligands. Moreover, we demonstrated the performance of the AIC-MAX algorithm for selecting the most important substructures to distinguish ligands between two closely related receptors, which is one of the most demanding challenges in computer-aided drug design. The experimental studies confirmed that AIC-MAX is able to produce a reduced representation that preserves almost all meaningful information contained in original KRFP fingerprint and provides efficient numerical computations as well as outperforms the original fingerprint.

References

Cereto-Massagué A, Ojeda MJ, Valls C, Mulero M, Garcia-Vallvé S, Pujadas G (2015) Molecular fingerprint similarity search in virtual screening. Methods 71:58–63. doi:10.1016/j.ymeth.2014.08.005
Article PubMed Google Scholar
Kurczab R, Nowak M, Chilmonczyk Z, Sylte I, Bojarski AJ (2010) The development and validation of a novel virtual screening cascade protocol to identify potential serotonin 5-HT(7)R antagonists. Bioorg Med Chem Lett 20:2465–2468. doi:10.1016/j.bmcl.2010.03.012
Article CAS PubMed Google Scholar
Zajdel P, Kurczab R, Grychowska K, Satała G, Pawłowski M, Bojarski AJ (2012) The multiobjective based design, synthesis and evaluation of the arylsulfonamide/amide derivatives of aryloxyethyl- and arylthioethyl- piperidines and pyrrolidines as a novel class of potent 5-HT7 receptor antagonists. Eur J Med Chem 56:348–360. doi:10.1016/j.ejmech.2012.07.043
Article CAS PubMed Google Scholar
Gabrielsen M, Kurczab R, Siwek A, Ravna AW, Kristiansen K, Kufareva I, Abagyan R, Nowak G, Sylte I, Bojarski AJ (2014) Identification of novel serotonin transporter compounds by virtual screening. J Chem Inf Model 54:933–943. doi:10.1021/ci400742s
Article CAS PubMed PubMed Central Google Scholar
Smusz S, Kurczab R, Satała G, Bojarski AJ (2015) Fingerprint-based consensus virtual screening towards structurally new 5-HT6R ligands. Bioorg Med Chem Lett 25:1827–1830. doi:10.1016/j.bmcl.2015.03.049
Article CAS PubMed Google Scholar
Staroń J, Warszycki D, Kalinowska-Tłuścik J, Satała G, Bojarski AJ (2015) Rational design of 5-HT 6 R ligands using a bioisosteric strategy: synthesis, biological evaluation and molecular modelling. RSC Adv 5:25806–25815. doi:10.1039/C5RA00054H
Article Google Scholar
Smusz S, Czarnecki WM, Warszycki D, Bojarski AJ (2014) Exploiting uncertainty measures in compounds activity prediction using support vector machines. Bioorg Med Chem Lett 25:100–105. doi:10.1016/j.bmcl.2014.11.005
Article PubMed Google Scholar
Witek J, Smusz S, Rataj K, Mordalski S, Bojarski AJ (2014) An application of machine learning methods to structural interaction fingerprints—a case study of kinase inhibitors. Bioorg Med Chem Lett 24:580–585. doi:10.1016/j.bmcl.2013.12.017
Article CAS PubMed Google Scholar
Czarnecki WM, Tabor J (2015) Multithreshold entropy linear classifier: theory and applications. Expert Syst Appl 42:5591–5606. doi:10.1016/j.eswa.2015.03.007
Article Google Scholar
Warszycki D, Mordalski S, Kristiansen K, Kafel R, Sylte I, Chilmonczyk Z, Bojarski AJ (2013) A linear combination of pharmacophore hypotheses as a new tool in search of new active compounds—an application for 5-HT1A receptor ligands. PLoS One 8:e84510. doi:10.1371/journal.pone.0084510
Article PubMed PubMed Central Google Scholar
Kurczab R, Bojarski AJ (2013) New strategy for receptor-based pharmacophore query construction: a case study for 5-HT7 receptor ligands. J Chem Inf Model 53:3233–3243. doi:10.1021/ci4005207
Article CAS PubMed Google Scholar
Mordalski S, Kosciolek T, Kristiansen K, Sylte I, Bojarski AJ (2011) Protein binding site analysis by means of structural interaction fingerprint patterns. Bioorg Med Chem Lett 21:6816–6819. doi:10.1016/j.bmcl.2011.09.027
Article CAS PubMed Google Scholar
Sastry M, Lowrie JF, Dixon SL, Sherman W (2010) Large-scale systematic analysis of 2D fingerprint methods and parameters to improve virtual screening enrichments. J Chem Inf Model 50:771–784. doi:10.1021/ci100062n
Article CAS PubMed Google Scholar
Shemetulskis NE, Weininger D, Blankley CJ, Yang JJ, Humblet C (1996) Stigmata: an algorithm to determine structural commonalities in diverse datasets. J Chem Inf Comput Sci 36:862–871
Article CAS PubMed Google Scholar
Xue L, Stahura FL, Bajorath J (1971) Similarity search profiling reveals effects of fingerprint scaling in virtual screening. J Chem Inf Comput Sci 44:2032–2039. doi:10.1021/ci0400819
Article Google Scholar
Williams C (2006) Reverse fingerprinting, similarity searching by group fusion and fingerprint bit importance. Mol Divers 10:311–332. doi:10.1007/s11030-006-9039-z
Article CAS PubMed Google Scholar
Wang Y, Bajorath J (2008) Bit silencing in fingerprints enables the derivation of compound class-directed similarity metrics. J Chem Inf Model 48:1754–1759. doi:10.1021/ci8002045
Article CAS PubMed Google Scholar
Nisius B, Vogt M, Bajorath J (2009) Development of a fingerprint reduction approach for Bayesian similarity searching based on Kullback–Leibler divergence analysis. J Chem Inf Model 49:1347–1358. doi:10.1021/ci900087y
Article CAS PubMed Google Scholar
Śmieja M, Warszycki D (2016) Average information content maximization—a new approach for fingerprint hybridization and reduction. PLoS One 11:e0146666. doi:10.1371/journal.pone.0146666
Article PubMed PubMed Central Google Scholar
Nisius B, Bajorath J (2010) Reduction and recombination of fingerprints of different design increase compound recall and the structural diversity of hits. Chem Biol Drug Des 75:152–160. doi:10.1111/j.1747-0285.2009.00930.x
Article CAS PubMed Google Scholar
Hall LH, Kier LB (1995) Electrotopological state indices for atom types: a novel combination of electronic, topological, and valence state information. J Chem Inf Model 35:1039–1045. doi:10.1021/ci00028a014
Article CAS Google Scholar
Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E (2003) The chemistry development kit (CDK): an open-source java library for chemo- and bioinformatics. J Chem Inf Comput Sci 43:493–500. doi:10.1021/ci025584y
Yap CW (2011) PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32:1466–1474. doi:10.1002/jcc.21707
Article CAS PubMed Google Scholar
Klekota J, Roth FP (2008) Chemical substructures that enrich for biological activity. Bioinformatics 24:2518–2525. doi:10.1093/bioinformatics/btn479
Article CAS PubMed PubMed Central Google Scholar
Ewing T, Baber JC, Feher M (2006) Novel 2D fingerprints for ligand-based virtual screening. J Chem Inf Model 46:2423–2431. doi:10.1021/ci060155b
Article CAS PubMed Google Scholar
Bento AP, Gaulton A, Hersey A, Bellis LJ, Chambers J, Davies M, Krüger FA, Light Y, Mak L, McGlinchey S, Nowotka M, Papadatos G, Santos R, Overington JP (2014) The ChEMBL bioactivity database: an update. Nucleic Acids Res 42:D1083–D1090. doi:10.1093/nar/gkt1031
Article CAS PubMed Google Scholar
Hibert MF, Gittos MW, Middlemiss DN, Mir AK, Fozard JR (1988) Graphics computer-aided receptor mapping as a predictive tool for drug design: development of potent, selective, and stereospecific ligands for the 5-HT1A receptor. J Med Chem 31:1087–1093. doi:10.1021/jm00401a007
Article CAS PubMed Google Scholar
Breiman L (2001) Random forests. Mach Learn 45:5–32. doi:10.1023/A:1010933404324
Article Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27:861–874. doi:10.1016/j.patrec.2005.10.010
Article Google Scholar
Alpaydin E (2014) Introduction to machine learning. MIT Press, Cambridge
Google Scholar
Warszycki D, Mordalski S, Staroń J, Bojarski AJ (2015) Bioisosteric matrices for ligands of serotonin receptors. Chem Med Chem 10:601–605. doi:10.1002/cmdc.201402563
Article CAS PubMed PubMed Central Google Scholar
Smusz S, Kurczab R, Bojarski AJ (2013) The influence of the inactives subset generation on the performance of machine learning methods. J Cheminform 5:17–25. doi:10.1186/1758-2946-5-17
Article CAS PubMed PubMed Central Google Scholar
Kurczab R, Smusz S, Bojarski AJ (2014) The influence of negative training set size on machine learning-based virtual screening. J Cheminform 6:32–40. doi:10.1186/1758-2946-6-32
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The work was supported by the National Science Centre (Poland) Grants No. 2016/21/D/ST6/00980 and 2016/21/N/NZ25/01725 and by the Polish-Norwegian Research Programme operated by the National Centre for Research and Development under the Norwegian Financial Mechanism 2009–2014 in the frame of the Project PLATFORMex (Pol-Nor/198887/73/2013). We would also like to thank Professor Andrzej Bojarski for his invaluable contribution, discussions and criticism regarding our work.

Author information

Authors and Affiliations

Institute of Pharmacology, Polish Academy of Sciences, Smetna street 12, 31-343, Kraków, Poland
Dawid Warszycki & Rafał Kafel
Faculty of Mathematics and Computer Science, Jagiellonian University, 6 Lojasiewicza Street, 30-348, Kraków, Poland
Marek Śmieja

Authors

Dawid Warszycki
View author publications
You can also search for this author in PubMed Google Scholar
Marek Śmieja
View author publications
You can also search for this author in PubMed Google Scholar
Rafał Kafel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dawid Warszycki.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (docx 1023 KB)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Warszycki, D., Śmieja, M. & Kafel, R. Practical application of the Average Information Content Maximization (AIC-MAX) algorithm: selection of the most important structural features for serotonin receptor ligands. Mol Divers 21, 407–412 (2017). https://doi.org/10.1007/s11030-017-9729-8

Download citation

Received: 06 October 2016
Accepted: 16 January 2017
Published: 09 February 2017
Issue Date: May 2017
DOI: https://doi.org/10.1007/s11030-017-9729-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Practical application of the Average Information Content Maximization (AIC-MAX) algorithm: selection of the most important structural features for serotonin receptor ligands

Abstract

Similar content being viewed by others

2D SIFt: a matrix of ligand-receptor interactions

Advances in Feature Selection for Data and Pattern Recognition: An Introduction

Smaller feature subset selection for real-world datasets using a new mutual information with Gaussian gain

Introduction

Materials and methods

Results and Discussion

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (docx 1023 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Practical application of the Average Information Content Maximization (AIC-MAX) algorithm: selection of the most important structural features for serotonin receptor ligands

Abstract

Similar content being viewed by others

2D SIFt: a matrix of ligand-receptor interactions

Advances in Feature Selection for Data and Pattern Recognition: An Introduction

Smaller feature subset selection for real-world datasets using a new mutual information with Gaussian gain

Introduction

Materials and methods

Results and Discussion

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (docx 1023 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation