Abstract
Quantitative Structure–Activity Relationship not only provides guidelines regarding structural features responsible for biological activity but it can be used also for prediction of desired activity prior to synthesis of untested chemicals. Therefore, an appropriate validation of any QSAR is of utmost importance to judge its external predictive ability. Generally, internal and external validations (preferred by many) are used in the absence of a true external dataset. The model developed using external method may not be reliable as it may not capture all essential features required for the particular SAR due to omission of some compounds, especially for small datasets. In external validation, the splitting is done either rationally or in random manner before descriptor selection. In the present study, rational splitting of dataset was performed using a novel method and its effect on statistical parameters was analyzed. The analysis reveals that the predictive ability of a QSAR model is sensitive toward (1) the method of splitting and (2) distribution of the training and the prediction sets. In addition, purposeful selection can be used to influence the statistical parameters; therefore, external validation based on single split is insufficient to guarantee the true predictive ability of a QSAR model. Besides, it appears that the selection of descriptors prior to splitting (information leakage) has little role to play in deciding external predictivity of the model. The present study reveals that as many as possible statistical parameters should be examined along with boot-strapping instead of single external validation.
Similar content being viewed by others
References
Baumann K, Stiefl N (2004) Validation tools for variable subset regression. J Comput Aided Mol Des 18(7–9):549–562
Chirico N, Gramatica P (2011) Real external predictivity of qsar models: how to evaluate it? comparison of different validation criteria and proposal of using the concordance correlation coefficient. J Chem Inf Model 51(9):2320–2335
Chirico N, Gramatica P (2012) Real external predictivity of QSAR models. Part 2. New intercomparable thresholds for different validation criteria and the need for scatter plot inspection. J Chem Inf Model 52(8):2044–2058
Consonni V, Ballabio D, Todeschini R (2009) Comments on the definition of the Q2 parameter for QSAR validation. J Chem Inf Model 49(7):1669–1678
Consonni V, Ballabio D, Todeschini R (2010) Evaluation of model predictive ability by external validation techniques. J Chemomet 24:194–201
Golbraikh A, Tropsha A (2002) Beware of q2! J Mol Graph Model 20(4):269–276
Gramatica P (2013) On the development and validation of QSAR models. Methods Mol Biol 930:499–526
Gramatica P, Chirico N, Papa E, Cassani S, Kovarich S (2013) QSARINS: a new software for the development, analysis, and validation of QSAR MLR models. J Comput Chem 34(24):2121–2132
Gramatica P, Cassani S, Chirico N (2014) QSARINS-chem: insubria datasets and new QSAR/QSPR models for environmental pollutants in QSARINS. J Comput Chem 35(13):1036–1044
Hawkins DM (2004) The problem of overfitting. J Chem Inf Comput Sci 44(1):1–12
Hawkins DM, Basak SC, Mills D (2003) Assessing model fit by cross-validation. J Chem Inf Comput Sci 43:579–586
Hawkins DM, Kraker JJ, Basak SC, Mills D (2008) QSPR checking and validation: a case study with hydroxy radical reaction rate constant. SAR QSAR Environ Res 19(5–6):525–539
Huang J, Fan X (2011) Why QSAR fails: an empirical evaluation using conventional computational approach. Mol Pharm 8(2):600–608
Hwang JY, Kawasuji T, Lowes DJ, Clark JA, Connelly MC, Zhu F, Guiguemde WA, Sigal MS, Wilson EB, DeRisi JL, Guy RK (2011) Synthesis and evaluation of 7-substituted 4-aminoquinoline analogues for antimalarial activity. J Med Chem 54(20):7084–7093
Kiralj R, Ferreira MMC (2009) Basic validation procedures for regression models in QSAR and QSPR studies: theory and application. J Braz Chem Soc 20:770–787
Kubinyi H (2002) From narcosis to hyperspace: the history of QSAR. Quant Struct Act Relat 21:348–356
Mahajan DT, Masand VH, Patil KN, Ben Hadda T, Jawarkar RD, Thakur SD, Rastija V (2012) CoMSIA and POM analyses of anti-malarial activity of synthetic prodiginines. Bioorg Med Chem Lett 22(14):4827–4835
Mahajan DT, Masand VH, Patil KN, Hadda TB, Rastija V (2013) Integrating GUSAR and QSAR analyses for antimalarial activity of synthetic prodiginines against multi drug resistant strain. Med Chem Res 22:2284–2292
Martin TM, Harten P, Young DM, Muratov EN, Golbraikh A, Zhu H, Tropsha A (2012) Does rational selection of training and test sets improve the outcome of QSAR modeling? J Chem Inf Model 52(10):2570–2578
Masand VH, Jawarkar RD, Patil KN, Nazerruddin GM, Bajaj SO (2010) Correlation potential of Wiener index and molecular refractivity vis-a`-vis Antimalarial activity of xanthone derivatives. Org Chem 6(1):30–38
Masand VH, Jawarkar RD, Mahajan DT, Hadda TB, Sheikh J, Patil KN (2012) QSAR and CoMFA studies of biphenyl analogs of the anti-tuberculosis drug (6S)-2-nitro-6-{[4-(trifluoromethoxy) benzyl]oxy}-6,7-dihydro-5H-imidazo[2,1-b][1,3]oxazine(PA-824). Med Chem Res 21:2624–2629
Masand VH, Mahajan DT, Patil KN, Hadda TB, Youssoufi MH, Jawarkar RD, Shibi IG (2013) Optimization of antimalarial activity of synthetic prodiginines: QSAR, GUSAR, and CoMFA analyses. Chem Biol Drug Des 81(4):527–536
Masand VH, Mahajan DT, Gramatica P, Barlow J (2014) Tautomerism and multiple modelling enhance the efficacy of QSAR: antimalarial activity of phosphoramidate and phosphorothioamidate analogues of amiprophos methyl. Med Chem Res
Mitra I, Roy PP, Kar S, Ojha PK, Roy K (2010) On further application of r m2 as a metric for validation of QSAR models. J Chemomet 24(1):22–33
Roy K, Mitra I (2012) On the use of the metric rm(2) as an effective tool for validation of QSAR models in computational drug design and predictive toxicology. Mini Rev Med Chem 12(6):491–504
Roy K, Roy PP, Leonard JT (2008) Exploring the impact of size of training sets for the development of predictive QSAR models. Chemomet Intel Lab Sys 90:31–42
Sahigara F, Mansouri K, Ballabio D, Mauri A, Consonni V, Todeschini R (2012) Comparison of different approaches to define the applicability domain of QSAR models. Molecules 17(5):4791–4810
Schuurmann G, Ebert RU, Chen J, Wang B, Kuhne R (2008) External validation and prediction employing the predictive squared correlation coefficient test set activity mean vs training set activity mean. J Chem Inf Model 48(11):2140–2145
Scior T, Medina-Franco JL, Do QT, Martinez-Mayorga K, Yunes Rojas JA, Bernard P (2009) How to recognize and workaround pitfalls in QSAR studies: a critical review. Curr Med Chem 16(32):4297–4313
Selassie CD (2003) History of Quantitative Structure-Activity Relationships. In Burger’s Medicinal Chemistry and Drug Discovery, 6 ed.; Abraham, D. J., Ed. JohnWiley&Sons, Inc.: 2003; Vol. 1
Sushko I, Novotarskyi S, Korner R, Pandey AK, Cherkasov A, Li J, Gramatica P, Hansen K, Schroeter T, Muller KR, Xi L, Liu H, Yao X, Oberg T, Hormozdiari F, Dao P, Sahinalp C, Todeschini R, Polishchuk P, Artemenko A, Kuz’min V, Martin TM, Young DM, Fourches D, Muratov E, Tropsha A, Baskin I, Horvath D, Marcou G, Muller C, Varnek A, Prokopenko VV, Tetko IV (2010) Applicability domains for classification problems: benchmarking of distance to models for Ames mutagenicity set. J Chem Inf Model 50(12):2094–2111
Todeschini R, Consonni V, Mauri A, Pavan M (2004) Detecting “bad” regression models: multicriteria fitness functions in regression analysis. Anal Chim Acta 515(1):199–208
Tropsha A (2010) Best practices for QSAR model development, validation, and exploitation. Mol Inform 29:476–488
Turcotte V, Fortin S, Vevey F, Coulombe Y, Lacroix J, Cote MF, Masson JY, R CG (2012) Synthesis, biological evaluation, and structure-activity relationships of novel substituted N-phenyl ureidobenzenesulfonate derivatives blocking cell cycle progression in S-phase and inducing DNA double-strand breaks. J Med Chem 55(13):6194–6208
Van Drie JH (2007) Computer-aided drug design: the next 20 years. J Comput Aided Mol Des 21(10–11):591–601
Yuriev E, Agostino M, Ramsland PA (2011) Challenges and advances in computational docking: 2009 in review. J Mol Recognit 24(2):149–164
Acknowledgments
We are thankful to QSARINS and RapidMiner developing teams for providing the evaluation and free versions of the softwares. One of the authors (VHM) is thankful to Dr. Paola Gramatica, Italy for providing QSARINS v1.2 and later versions.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
44_2014_1193_MOESM1_ESM.docx
Figures showing training and prediction sets distribution for different model for the datasets, figures S1–S3, brief explanation of the statistical symbols and the developed models is available in the supplementary file. (DOCX 49 kb)
Rights and permissions
About this article
Cite this article
Masand, V.H., Mahajan, D.T., Nazeruddin, G.M. et al. Effect of information leakage and method of splitting (rational and random) on external predictive ability and behavior of different statistical parameters of QSAR model. Med Chem Res 24, 1241–1264 (2015). https://doi.org/10.1007/s00044-014-1193-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00044-014-1193-8