Reliably assessing prediction reliability for high dimensional QSAR data

Huang, Jianping; Fan, Xiaohui

doi:10.1007/s11030-012-9415-9

Reliably assessing prediction reliability for high dimensional QSAR data

Full-Length Paper
Published: 19 December 2012

Volume 17, pages 63–73, (2013)
Cite this article

Molecular Diversity Aims and scope Submit manuscript

Jianping Huang¹ &
Xiaohui Fan¹

13 Citations
Explore all metrics

Abstract

Predictability and prediction reliability are of utmost important to characterize a good Quantitative structure–activity relationships (QSAR) model. However, validation methods are insufficient to guarantee the prediction reliability of QSAR models. Moreover, high dimensional samples also pose great challenge to traditional methods in terms of predictive power. Therefore, this study presents a predictive classifier (i.e., TreeEC) that can assess prediction reliability with high confidence, especially for facing high dimensional QSAR data. Two approaches for assessing prediction reliability are provided, i.e., applicability domain and prediction confidence. We demonstrate that the applicability domain has difficulty to guarantee the models’ prediction reliability, where samples intensively close to the domain center are often poor predicted than those outside the domain. Instead, prediction confidence is more promising for assessing prediction reliability. Based on a large data set assessed by prediction confidence, external samples assessed with high confidence greater than 95 % can be reliably predicted with an accuracy of 94 %, in contrast to the average accuracy of 84 %. We also illustrate that TreeEC are less affected by high dimensionality than other popular methods according to 11 public data sets. A free version of TreeEC with a user-friendly interface can also be downloading from website http://pharminfo.zju.edu.cn/computation/TreeEC/TreeEC.html.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An automated framework for QSAR model building

Article Open access 16 January 2018

Prediction reliability of QSAR models: an overview of various validation tools

Article 10 March 2022

QSAR—An Important In-Silico Tool in Drug Design and Discovery

References

Huang J, Ma G, Muhammad I, Cheng Y (2007) Identifying P-glycoprotein substrates using a support vector machine optimized by a particle swarm. J Chem Inf Model 47:1638–1647. doi:10.1021/ci700083n
Google Scholar
Eriksson L, Jaworska J, Worth AP, Cronin MTD, McDowell RM, Gramatica P (2003) Methods for reliability and uncertainty assessment and for applicability evaluations of classification-and regression-based QSARs. Environ Health Perspect 111:1361–1375. doi:10.1289/ehp.5758
Google Scholar
He L, Jurs PC (2005) Assessing the reliability of a QSAR model’s predictions. J Mol Graph Model 23:503–523. doi:10.1016/j.jmgm.2005.03.003
Article PubMed CAS Google Scholar
Huang J, Fan X (2011) Why QSAR fails: an empirical evaluation using conventional computational approach. Mol Pharmaceut 8:600–608. doi:10.1021/mp100423u
Article CAS Google Scholar
Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22:4–37. doi:10.1109/34.824819
Article Google Scholar
Tetko IV, Bruneau P, Mewes HW, Rohrer DC, Poda GI (2006) Can we estimate the accuracy of ADME–Tox predictions? Drug Discov Today 11:700–707. doi:10.1016/j.drudis.2006.06.013
Article PubMed CAS Google Scholar
Maggiora GM (2006) On outliers and activity cliffs-why QSAR often disappoints. J Chem Inf Model 46:1535. doi:10.1021/ci060117s
Article PubMed CAS Google Scholar
Johnson SR (2008) The trouble with QSAR (or how I learned to stop worrying and embrace fallacy). J Chem Inf Model 48:25–26. doi:10.1021/ci700332k
Article PubMed CAS Google Scholar
Doweyko AM (2008) QSAR: dead or alive? J Comput Aided Mol Des 22:81–89. doi:10.1007/s10822-007-9162-7
Article PubMed CAS Google Scholar
Stouch TR, Kenyon JR, Johnson SR, Chen XQ, Doweyko A, Li Y (2003) In silico ADME/Tox: why models fail. J Comput Aided Mol Des 17:83–92. doi:10.1023/A:1025358319677
Article PubMed CAS Google Scholar
Jaworska J, Nikolova-Jeliazkova N, Aldenberg T (2005) QSAR applicability domain estimation by projection of the training set in descriptor space: a review. ATLA Altern Lab Anim 33:445–459
CAS Google Scholar
Dimitrov S, Dimitrova G, Pavlov T, Dimitrova N, Patlewicz G, Niemela J, Mekenyan O (2005) A stepwise approach for defining the applicability domain of SAR and QSAR models. J Chem Inf Model 45:839–849. doi:10.1021/ci0500381
Article PubMed CAS Google Scholar
Roy K, Mitra I, Kar S, Ojha P, Das RN, Kabir H (2012) Comparative studies on some metrics for external validation of QSPR models. J Chem Inf Model 52:396–408. doi:10.1021/ci200520g
Article PubMed CAS Google Scholar
Stanton DT, Jurs PC (1990) Development and use of charged partial surface area structural descriptors in computer-assisted quantitative structure–property relationship studies. Anal Chem 62:2323–2329. doi:10.1021/ac00220a013
Article CAS Google Scholar
Talete (2012) Dragon 6. http://www.talete.mi.it/index.htm. Accessed 01 Dec 2012
Li Z, Han L, Xue Y, Yap C, Li H, Jiang L, Chen Y (2007) MODEL—Molecular descriptor lab: a web-based server for computing structural and physicochemical features of compounds. Biotechnol Bioeng 97:389–396. doi:10.1002/bit.21214
Article PubMed CAS Google Scholar
Topliss JG, Costello RJ (1972) Chance correlations in structure–activity studies using multiple regression analysis. J Med Chem 15:1066–1068. doi:10.1021/jm00280a017
Article PubMed CAS Google Scholar
Clarke R, Ressom H, Wang A, Xuan J, Liu M, Gehan E, Wang Y (2008) The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat Rev Cancer 8:37–49. doi:10.1038/nrc2294
Article PubMed CAS Google Scholar
Shao L, Wu L, Fan X, Cheng Y (2010) Consensus ranking approach to understanding the underlying mechanism with QSAR. J Chem Inf Model 50:1941–1948. doi:10.1021/ci100305g
Article PubMed CAS Google Scholar
Huang J, Fang H, Fan X (2010) Decision forest for classification of gene expression data. Comput Biol Med 40:698–704. doi:10.1016/j.compbiomed.2010.06.004
Article PubMed Google Scholar
Vapnik V (2000) The nature of statistical learning theory. Springer, New York
Google Scholar
Wang Y, Miller D, Clarke R (2008) Approaches to working in high-dimensional data spaces: gene expression microarrays. Brit J Cancer 98:1023–1028. doi:10.1038/sj.bjc.6604207
Article PubMed CAS Google Scholar
Bruce CL, Melville JL, Pickett SD, Hirst JD (2007) Contemporary QSAR classifiers compared. J Chem Inf Model 47:219–227. doi:10.1021/ci600332j
Google Scholar
Breiman L (2001) Random forests. Mach Learn 45:5–32. doi:10.1007/0-387-21529-8_16
Article Google Scholar
Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22:85–126. doi:10.1023/B:AIRE.0000045502.10941.a9
Article Google Scholar
Sutherland JJ, O’Brien LA, Weaver DF (2004) A comparison of methods for modeling quantitative structure–activity relationships. J Med Chem 47:5541–5554. doi:10.1021/jm0497141
Article PubMed CAS Google Scholar
Cerius2. Accelrys Software Inc., San Diego, CA. http://accelrys.com/
Yuan H, Huang J, Cao C (2009) Prediction of skin sensitization with a particle swarm optimized support vector machine. Int J Mol Sci 10:3237–3254. doi:10.3390/ijms10073237
Article PubMed Google Scholar
Talete srl, DRAGON for Windows (software for molecular descriptor calculations). Version 5.4—2006. http://www.talete.mi.it/
Li Q, Jørgensen FS, Oprea T, Brunak S, Taboureau O (2008) hERG classification model based on a combination of support vector machine method and GRIND descriptors. Mol Pharmaceut 5:117–127. doi:10.1021/mp700124e
Article CAS Google Scholar
Breiman L (1998) Classification and regression trees. Chapman & Hall/CRC, London
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140. doi:10.1007/BF00058655
Google Scholar
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. Proceedings of the Thirteenth International Conference on, Machine Learning, 148–156.
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27
Article Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11:10–18
Article Google Scholar
Hawkins D, Basak S, Mills D (2003) Assessing model fit by cross-validation. J Chem Inf Comput Sci 43:579–586. doi:10.1021/ci025626i
Article PubMed CAS Google Scholar
Golbraikh A, Tropsha A (2002) Beware of q2!. J Mol Graph Model 20:269–276. doi:10.1016/S1093-3263(01)00123-1
Article PubMed CAS Google Scholar
Tong W, Xie Q, Hong H, Shi L, Fang H, Perkins R (2004) Assessment of prediction confidence and domain extrapolation of two structure–activity relationship models for predicting estrogen receptor binding activity. Environ Health Perspect 112:1249–1254. doi:10.1289/ehp.7125
Google Scholar

Download references

Acknowledgments

This study was supported by the National S&T Major Project (No. 2012ZX09505001-001), the National Science Foundation of China (No. 81173465), the China Postdoctoral Science Foundation (No. 2012T50559), Fundamental Research Funds for the Central Universities, and Scientific Research Fund of Zhejiang Provincial Education Department (No. Y201122590).

Author information

Authors and Affiliations

Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, 866 YuHangTang Rd, Hangzhou, 310058, China
Jianping Huang & Xiaohui Fan

Authors

Jianping Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohui Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaohui Fan.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (doc 173 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, J., Fan, X. Reliably assessing prediction reliability for high dimensional QSAR data. Mol Divers 17, 63–73 (2013). https://doi.org/10.1007/s11030-012-9415-9

Download citation

Received: 05 September 2012
Accepted: 03 December 2012
Published: 19 December 2012
Issue Date: February 2013
DOI: https://doi.org/10.1007/s11030-012-9415-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reliably assessing prediction reliability for high dimensional QSAR data

Abstract

Access this article

Similar content being viewed by others

An automated framework for QSAR model building

Prediction reliability of QSAR models: an overview of various validation tools

QSAR—An Important In-Silico Tool in Drug Design and Discovery

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (doc 173 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Reliably assessing prediction reliability for high dimensional QSAR data

Abstract

Access this article

Similar content being viewed by others

An automated framework for QSAR model building

Prediction reliability of QSAR models: an overview of various validation tools

QSAR—An Important In-Silico Tool in Drug Design and Discovery

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (doc 173 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation