Skip to main content
Log in

Reliably assessing prediction reliability for high dimensional QSAR data

  • Full-Length Paper
  • Published:
Molecular Diversity Aims and scope Submit manuscript

Abstract

Predictability and prediction reliability are of utmost important to characterize a good Quantitative structure–activity relationships (QSAR) model. However, validation methods are insufficient to guarantee the prediction reliability of QSAR models. Moreover, high dimensional samples also pose great challenge to traditional methods in terms of predictive power. Therefore, this study presents a predictive classifier (i.e., TreeEC) that can assess prediction reliability with high confidence, especially for facing high dimensional QSAR data. Two approaches for assessing prediction reliability are provided, i.e., applicability domain and prediction confidence. We demonstrate that the applicability domain has difficulty to guarantee the models’ prediction reliability, where samples intensively close to the domain center are often poor predicted than those outside the domain. Instead, prediction confidence is more promising for assessing prediction reliability. Based on a large data set assessed by prediction confidence, external samples assessed with high confidence greater than 95 % can be reliably predicted with an accuracy of 94 %, in contrast to the average accuracy of 84 %. We also illustrate that TreeEC are less affected by high dimensionality than other popular methods according to 11 public data sets. A free version of TreeEC with a user-friendly interface can also be downloading from website http://pharminfo.zju.edu.cn/computation/TreeEC/TreeEC.html.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Huang J, Ma G, Muhammad I, Cheng Y (2007) Identifying P-glycoprotein substrates using a support vector machine optimized by a particle swarm. J Chem Inf Model 47:1638–1647. doi:10.1021/ci700083n

    Google Scholar 

  2. Eriksson L, Jaworska J, Worth AP, Cronin MTD, McDowell RM, Gramatica P (2003) Methods for reliability and uncertainty assessment and for applicability evaluations of classification-and regression-based QSARs. Environ Health Perspect 111:1361–1375. doi:10.1289/ehp.5758

    Google Scholar 

  3. He L, Jurs PC (2005) Assessing the reliability of a QSAR model’s predictions. J Mol Graph Model 23:503–523. doi:10.1016/j.jmgm.2005.03.003

    Article  PubMed  CAS  Google Scholar 

  4. Huang J, Fan X (2011) Why QSAR fails: an empirical evaluation using conventional computational approach. Mol Pharmaceut 8:600–608. doi:10.1021/mp100423u

    Article  CAS  Google Scholar 

  5. Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22:4–37. doi:10.1109/34.824819

    Article  Google Scholar 

  6. Tetko IV, Bruneau P, Mewes HW, Rohrer DC, Poda GI (2006) Can we estimate the accuracy of ADME–Tox predictions? Drug Discov Today 11:700–707. doi:10.1016/j.drudis.2006.06.013

    Article  PubMed  CAS  Google Scholar 

  7. Maggiora GM (2006) On outliers and activity cliffs-why QSAR often disappoints. J Chem Inf Model 46:1535. doi:10.1021/ci060117s

    Article  PubMed  CAS  Google Scholar 

  8. Johnson SR (2008) The trouble with QSAR (or how I learned to stop worrying and embrace fallacy). J Chem Inf Model 48:25–26. doi:10.1021/ci700332k

    Article  PubMed  CAS  Google Scholar 

  9. Doweyko AM (2008) QSAR: dead or alive? J Comput Aided Mol Des 22:81–89. doi:10.1007/s10822-007-9162-7

    Article  PubMed  CAS  Google Scholar 

  10. Stouch TR, Kenyon JR, Johnson SR, Chen XQ, Doweyko A, Li Y (2003) In silico ADME/Tox: why models fail. J Comput Aided Mol Des 17:83–92. doi:10.1023/A:1025358319677

    Article  PubMed  CAS  Google Scholar 

  11. Jaworska J, Nikolova-Jeliazkova N, Aldenberg T (2005) QSAR applicability domain estimation by projection of the training set in descriptor space: a review. ATLA Altern Lab Anim 33:445–459

    CAS  Google Scholar 

  12. Dimitrov S, Dimitrova G, Pavlov T, Dimitrova N, Patlewicz G, Niemela J, Mekenyan O (2005) A stepwise approach for defining the applicability domain of SAR and QSAR models. J Chem Inf Model 45:839–849. doi:10.1021/ci0500381

    Article  PubMed  CAS  Google Scholar 

  13. Roy K, Mitra I, Kar S, Ojha P, Das RN, Kabir H (2012) Comparative studies on some metrics for external validation of QSPR models. J Chem Inf Model 52:396–408. doi:10.1021/ci200520g

    Article  PubMed  CAS  Google Scholar 

  14. Stanton DT, Jurs PC (1990) Development and use of charged partial surface area structural descriptors in computer-assisted quantitative structure–property relationship studies. Anal Chem 62:2323–2329. doi:10.1021/ac00220a013

    Article  CAS  Google Scholar 

  15. Talete (2012) Dragon 6. http://www.talete.mi.it/index.htm. Accessed 01 Dec 2012

  16. Li Z, Han L, Xue Y, Yap C, Li H, Jiang L, Chen Y (2007) MODEL—Molecular descriptor lab: a web-based server for computing structural and physicochemical features of compounds. Biotechnol Bioeng 97:389–396. doi:10.1002/bit.21214

    Article  PubMed  CAS  Google Scholar 

  17. Topliss JG, Costello RJ (1972) Chance correlations in structure–activity studies using multiple regression analysis. J Med Chem 15:1066–1068. doi:10.1021/jm00280a017

    Article  PubMed  CAS  Google Scholar 

  18. Clarke R, Ressom H, Wang A, Xuan J, Liu M, Gehan E, Wang Y (2008) The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat Rev Cancer 8:37–49. doi:10.1038/nrc2294

    Article  PubMed  CAS  Google Scholar 

  19. Shao L, Wu L, Fan X, Cheng Y (2010) Consensus ranking approach to understanding the underlying mechanism with QSAR. J Chem Inf Model 50:1941–1948. doi:10.1021/ci100305g

    Article  PubMed  CAS  Google Scholar 

  20. Huang J, Fang H, Fan X (2010) Decision forest for classification of gene expression data. Comput Biol Med 40:698–704. doi:10.1016/j.compbiomed.2010.06.004

    Article  PubMed  Google Scholar 

  21. Vapnik V (2000) The nature of statistical learning theory. Springer, New York

    Google Scholar 

  22. Wang Y, Miller D, Clarke R (2008) Approaches to working in high-dimensional data spaces: gene expression microarrays. Brit J Cancer 98:1023–1028. doi:10.1038/sj.bjc.6604207

    Article  PubMed  CAS  Google Scholar 

  23. Bruce CL, Melville JL, Pickett SD, Hirst JD (2007) Contemporary QSAR classifiers compared. J Chem Inf Model 47:219–227. doi:10.1021/ci600332j

    Google Scholar 

  24. Breiman L (2001) Random forests. Mach Learn 45:5–32. doi:10.1007/0-387-21529-8_16

    Article  Google Scholar 

  25. Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22:85–126. doi:10.1023/B:AIRE.0000045502.10941.a9

    Article  Google Scholar 

  26. Sutherland JJ, O’Brien LA, Weaver DF (2004) A comparison of methods for modeling quantitative structure–activity relationships. J Med Chem 47:5541–5554. doi:10.1021/jm0497141

    Article  PubMed  CAS  Google Scholar 

  27. Cerius2. Accelrys Software Inc., San Diego, CA. http://accelrys.com/

  28. Yuan H, Huang J, Cao C (2009) Prediction of skin sensitization with a particle swarm optimized support vector machine. Int J Mol Sci 10:3237–3254. doi:10.3390/ijms10073237

    Article  PubMed  Google Scholar 

  29. Talete srl, DRAGON for Windows (software for molecular descriptor calculations). Version 5.4—2006. http://www.talete.mi.it/

  30. Li Q, Jørgensen FS, Oprea T, Brunak S, Taboureau O (2008) hERG classification model based on a combination of support vector machine method and GRIND descriptors. Mol Pharmaceut 5:117–127. doi:10.1021/mp700124e

    Article  CAS  Google Scholar 

  31. Breiman L (1998) Classification and regression trees. Chapman & Hall/CRC, London

  32. Breiman L (1996) Bagging predictors. Mach Learn 24:123–140. doi:10.1007/BF00058655

    Google Scholar 

  33. Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. Proceedings of the Thirteenth International Conference on, Machine Learning, 148–156.

  34. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27

    Article  Google Scholar 

  35. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11:10–18

    Article  Google Scholar 

  36. Hawkins D, Basak S, Mills D (2003) Assessing model fit by cross-validation. J Chem Inf Comput Sci 43:579–586. doi:10.1021/ci025626i

    Article  PubMed  CAS  Google Scholar 

  37. Golbraikh A, Tropsha A (2002) Beware of q2!. J Mol Graph Model 20:269–276. doi:10.1016/S1093-3263(01)00123-1

    Article  PubMed  CAS  Google Scholar 

  38. Tong W, Xie Q, Hong H, Shi L, Fang H, Perkins R (2004) Assessment of prediction confidence and domain extrapolation of two structure–activity relationship models for predicting estrogen receptor binding activity. Environ Health Perspect 112:1249–1254. doi:10.1289/ehp.7125

    Google Scholar 

Download references

Acknowledgments

This study was supported by the National S&T Major Project (No. 2012ZX09505001-001), the National Science Foundation of China (No. 81173465), the China Postdoctoral Science Foundation (No. 2012T50559), Fundamental Research Funds for the Central Universities, and Scientific Research Fund of Zhejiang Provincial Education Department (No. Y201122590).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaohui Fan.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (doc 173 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, J., Fan, X. Reliably assessing prediction reliability for high dimensional QSAR data. Mol Divers 17, 63–73 (2013). https://doi.org/10.1007/s11030-012-9415-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11030-012-9415-9

Keywords

Navigation