Skip to main content
Log in

Effect of information leakage and method of splitting (rational and random) on external predictive ability and behavior of different statistical parameters of QSAR model

  • Original Research
  • Published:
Medicinal Chemistry Research Aims and scope Submit manuscript

Abstract

Quantitative Structure–Activity Relationship not only provides guidelines regarding structural features responsible for biological activity but it can be used also for prediction of desired activity prior to synthesis of untested chemicals. Therefore, an appropriate validation of any QSAR is of utmost importance to judge its external predictive ability. Generally, internal and external validations (preferred by many) are used in the absence of a true external dataset. The model developed using external method may not be reliable as it may not capture all essential features required for the particular SAR due to omission of some compounds, especially for small datasets. In external validation, the splitting is done either rationally or in random manner before descriptor selection. In the present study, rational splitting of dataset was performed using a novel method and its effect on statistical parameters was analyzed. The analysis reveals that the predictive ability of a QSAR model is sensitive toward (1) the method of splitting and (2) distribution of the training and the prediction sets. In addition, purposeful selection can be used to influence the statistical parameters; therefore, external validation based on single split is insufficient to guarantee the true predictive ability of a QSAR model. Besides, it appears that the selection of descriptors prior to splitting (information leakage) has little role to play in deciding external predictivity of the model. The present study reveals that as many as possible statistical parameters should be examined along with boot-strapping instead of single external validation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Baumann K, Stiefl N (2004) Validation tools for variable subset regression. J Comput Aided Mol Des 18(7–9):549–562

    Article  CAS  PubMed  Google Scholar 

  • Chirico N, Gramatica P (2011) Real external predictivity of qsar models: how to evaluate it? comparison of different validation criteria and proposal of using the concordance correlation coefficient. J Chem Inf Model 51(9):2320–2335

    Article  CAS  PubMed  Google Scholar 

  • Chirico N, Gramatica P (2012) Real external predictivity of QSAR models. Part 2. New intercomparable thresholds for different validation criteria and the need for scatter plot inspection. J Chem Inf Model 52(8):2044–2058

    Article  CAS  PubMed  Google Scholar 

  • Consonni V, Ballabio D, Todeschini R (2009) Comments on the definition of the Q2 parameter for QSAR validation. J Chem Inf Model 49(7):1669–1678

    Article  CAS  PubMed  Google Scholar 

  • Consonni V, Ballabio D, Todeschini R (2010) Evaluation of model predictive ability by external validation techniques. J Chemomet 24:194–201

    Article  CAS  Google Scholar 

  • Golbraikh A, Tropsha A (2002) Beware of q2! J Mol Graph Model 20(4):269–276

    Article  CAS  PubMed  Google Scholar 

  • Gramatica P (2013) On the development and validation of QSAR models. Methods Mol Biol 930:499–526

    Article  PubMed  Google Scholar 

  • Gramatica P, Chirico N, Papa E, Cassani S, Kovarich S (2013) QSARINS: a new software for the development, analysis, and validation of QSAR MLR models. J Comput Chem 34(24):2121–2132

    Article  CAS  Google Scholar 

  • Gramatica P, Cassani S, Chirico N (2014) QSARINS-chem: insubria datasets and new QSAR/QSPR models for environmental pollutants in QSARINS. J Comput Chem 35(13):1036–1044

    Article  CAS  PubMed  Google Scholar 

  • Hawkins DM (2004) The problem of overfitting. J Chem Inf Comput Sci 44(1):1–12

    Article  CAS  PubMed  Google Scholar 

  • Hawkins DM, Basak SC, Mills D (2003) Assessing model fit by cross-validation. J Chem Inf Comput Sci 43:579–586

    Article  CAS  PubMed  Google Scholar 

  • Hawkins DM, Kraker JJ, Basak SC, Mills D (2008) QSPR checking and validation: a case study with hydroxy radical reaction rate constant. SAR QSAR Environ Res 19(5–6):525–539

    Article  CAS  PubMed  Google Scholar 

  • Huang J, Fan X (2011) Why QSAR fails: an empirical evaluation using conventional computational approach. Mol Pharm 8(2):600–608

    Article  CAS  PubMed  Google Scholar 

  • Hwang JY, Kawasuji T, Lowes DJ, Clark JA, Connelly MC, Zhu F, Guiguemde WA, Sigal MS, Wilson EB, DeRisi JL, Guy RK (2011) Synthesis and evaluation of 7-substituted 4-aminoquinoline analogues for antimalarial activity. J Med Chem 54(20):7084–7093

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Kiralj R, Ferreira MMC (2009) Basic validation procedures for regression models in QSAR and QSPR studies: theory and application. J Braz Chem Soc 20:770–787

    Article  CAS  Google Scholar 

  • Kubinyi H (2002) From narcosis to hyperspace: the history of QSAR. Quant Struct Act Relat 21:348–356

    Article  CAS  Google Scholar 

  • Mahajan DT, Masand VH, Patil KN, Ben Hadda T, Jawarkar RD, Thakur SD, Rastija V (2012) CoMSIA and POM analyses of anti-malarial activity of synthetic prodiginines. Bioorg Med Chem Lett 22(14):4827–4835

    Article  CAS  PubMed  Google Scholar 

  • Mahajan DT, Masand VH, Patil KN, Hadda TB, Rastija V (2013) Integrating GUSAR and QSAR analyses for antimalarial activity of synthetic prodiginines against multi drug resistant strain. Med Chem Res 22:2284–2292

    Article  CAS  Google Scholar 

  • Martin TM, Harten P, Young DM, Muratov EN, Golbraikh A, Zhu H, Tropsha A (2012) Does rational selection of training and test sets improve the outcome of QSAR modeling? J Chem Inf Model 52(10):2570–2578

    Article  CAS  PubMed  Google Scholar 

  • Masand VH, Jawarkar RD, Patil KN, Nazerruddin GM, Bajaj SO (2010) Correlation potential of Wiener index and molecular refractivity vis-a`-vis Antimalarial activity of xanthone derivatives. Org Chem 6(1):30–38

    CAS  Google Scholar 

  • Masand VH, Jawarkar RD, Mahajan DT, Hadda TB, Sheikh J, Patil KN (2012) QSAR and CoMFA studies of biphenyl analogs of the anti-tuberculosis drug (6S)-2-nitro-6-{[4-(trifluoromethoxy) benzyl]oxy}-6,7-dihydro-5H-imidazo[2,1-b][1,3]oxazine(PA-824). Med Chem Res 21:2624–2629

    Article  CAS  Google Scholar 

  • Masand VH, Mahajan DT, Patil KN, Hadda TB, Youssoufi MH, Jawarkar RD, Shibi IG (2013) Optimization of antimalarial activity of synthetic prodiginines: QSAR, GUSAR, and CoMFA analyses. Chem Biol Drug Des 81(4):527–536

    Article  CAS  PubMed  Google Scholar 

  • Masand VH, Mahajan DT, Gramatica P, Barlow J (2014) Tautomerism and multiple modelling enhance the efficacy of QSAR: antimalarial activity of phosphoramidate and phosphorothioamidate analogues of amiprophos methyl. Med Chem Res

  • Mitra I, Roy PP, Kar S, Ojha PK, Roy K (2010) On further application of r m2 as a metric for validation of QSAR models. J Chemomet 24(1):22–33

    Article  CAS  Google Scholar 

  • Roy K, Mitra I (2012) On the use of the metric rm(2) as an effective tool for validation of QSAR models in computational drug design and predictive toxicology. Mini Rev Med Chem 12(6):491–504

    Article  CAS  PubMed  Google Scholar 

  • Roy K, Roy PP, Leonard JT (2008) Exploring the impact of size of training sets for the development of predictive QSAR models. Chemomet Intel Lab Sys 90:31–42

    Article  CAS  Google Scholar 

  • Sahigara F, Mansouri K, Ballabio D, Mauri A, Consonni V, Todeschini R (2012) Comparison of different approaches to define the applicability domain of QSAR models. Molecules 17(5):4791–4810

    Article  CAS  PubMed  Google Scholar 

  • Schuurmann G, Ebert RU, Chen J, Wang B, Kuhne R (2008) External validation and prediction employing the predictive squared correlation coefficient test set activity mean vs training set activity mean. J Chem Inf Model 48(11):2140–2145

    Article  PubMed  Google Scholar 

  • Scior T, Medina-Franco JL, Do QT, Martinez-Mayorga K, Yunes Rojas JA, Bernard P (2009) How to recognize and workaround pitfalls in QSAR studies: a critical review. Curr Med Chem 16(32):4297–4313

    Article  CAS  PubMed  Google Scholar 

  • Selassie CD (2003) History of Quantitative Structure-Activity Relationships. In Burger’s Medicinal Chemistry and Drug Discovery, 6 ed.; Abraham, D. J., Ed. JohnWiley&Sons, Inc.: 2003; Vol. 1

  • Sushko I, Novotarskyi S, Korner R, Pandey AK, Cherkasov A, Li J, Gramatica P, Hansen K, Schroeter T, Muller KR, Xi L, Liu H, Yao X, Oberg T, Hormozdiari F, Dao P, Sahinalp C, Todeschini R, Polishchuk P, Artemenko A, Kuz’min V, Martin TM, Young DM, Fourches D, Muratov E, Tropsha A, Baskin I, Horvath D, Marcou G, Muller C, Varnek A, Prokopenko VV, Tetko IV (2010) Applicability domains for classification problems: benchmarking of distance to models for Ames mutagenicity set. J Chem Inf Model 50(12):2094–2111

    Article  CAS  PubMed  Google Scholar 

  • Todeschini R, Consonni V, Mauri A, Pavan M (2004) Detecting “bad” regression models: multicriteria fitness functions in regression analysis. Anal Chim Acta 515(1):199–208

    Article  CAS  Google Scholar 

  • Tropsha A (2010) Best practices for QSAR model development, validation, and exploitation. Mol Inform 29:476–488

    Article  CAS  Google Scholar 

  • Turcotte V, Fortin S, Vevey F, Coulombe Y, Lacroix J, Cote MF, Masson JY, R CG (2012) Synthesis, biological evaluation, and structure-activity relationships of novel substituted N-phenyl ureidobenzenesulfonate derivatives blocking cell cycle progression in S-phase and inducing DNA double-strand breaks. J Med Chem 55(13):6194–6208

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Van Drie JH (2007) Computer-aided drug design: the next 20 years. J Comput Aided Mol Des 21(10–11):591–601

    Article  CAS  PubMed  Google Scholar 

  • Yuriev E, Agostino M, Ramsland PA (2011) Challenges and advances in computational docking: 2009 in review. J Mol Recognit 24(2):149–164

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgments

We are thankful to QSARINS and RapidMiner developing teams for providing the evaluation and free versions of the softwares. One of the authors (VHM) is thankful to Dr. Paola Gramatica, Italy for providing QSARINS v1.2 and later versions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vijay H. Masand.

Electronic supplementary material

Below is the link to the electronic supplementary material.

44_2014_1193_MOESM1_ESM.docx

Figures showing training and prediction sets distribution for different model for the datasets, figures S1–S3, brief explanation of the statistical symbols and the developed models is available in the supplementary file. (DOCX 49 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Masand, V.H., Mahajan, D.T., Nazeruddin, G.M. et al. Effect of information leakage and method of splitting (rational and random) on external predictive ability and behavior of different statistical parameters of QSAR model. Med Chem Res 24, 1241–1264 (2015). https://doi.org/10.1007/s00044-014-1193-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00044-014-1193-8

Keywords

Navigation