Abstract
Predictability and prediction reliability are of utmost important to characterize a good Quantitative structure–activity relationships (QSAR) model. However, validation methods are insufficient to guarantee the prediction reliability of QSAR models. Moreover, high dimensional samples also pose great challenge to traditional methods in terms of predictive power. Therefore, this study presents a predictive classifier (i.e., TreeEC) that can assess prediction reliability with high confidence, especially for facing high dimensional QSAR data. Two approaches for assessing prediction reliability are provided, i.e., applicability domain and prediction confidence. We demonstrate that the applicability domain has difficulty to guarantee the models’ prediction reliability, where samples intensively close to the domain center are often poor predicted than those outside the domain. Instead, prediction confidence is more promising for assessing prediction reliability. Based on a large data set assessed by prediction confidence, external samples assessed with high confidence greater than 95 % can be reliably predicted with an accuracy of 94 %, in contrast to the average accuracy of 84 %. We also illustrate that TreeEC are less affected by high dimensionality than other popular methods according to 11 public data sets. A free version of TreeEC with a user-friendly interface can also be downloading from website http://pharminfo.zju.edu.cn/computation/TreeEC/TreeEC.html.
Similar content being viewed by others
References
Huang J, Ma G, Muhammad I, Cheng Y (2007) Identifying P-glycoprotein substrates using a support vector machine optimized by a particle swarm. J Chem Inf Model 47:1638–1647. doi:10.1021/ci700083n
Eriksson L, Jaworska J, Worth AP, Cronin MTD, McDowell RM, Gramatica P (2003) Methods for reliability and uncertainty assessment and for applicability evaluations of classification-and regression-based QSARs. Environ Health Perspect 111:1361–1375. doi:10.1289/ehp.5758
He L, Jurs PC (2005) Assessing the reliability of a QSAR model’s predictions. J Mol Graph Model 23:503–523. doi:10.1016/j.jmgm.2005.03.003
Huang J, Fan X (2011) Why QSAR fails: an empirical evaluation using conventional computational approach. Mol Pharmaceut 8:600–608. doi:10.1021/mp100423u
Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22:4–37. doi:10.1109/34.824819
Tetko IV, Bruneau P, Mewes HW, Rohrer DC, Poda GI (2006) Can we estimate the accuracy of ADME–Tox predictions? Drug Discov Today 11:700–707. doi:10.1016/j.drudis.2006.06.013
Maggiora GM (2006) On outliers and activity cliffs-why QSAR often disappoints. J Chem Inf Model 46:1535. doi:10.1021/ci060117s
Johnson SR (2008) The trouble with QSAR (or how I learned to stop worrying and embrace fallacy). J Chem Inf Model 48:25–26. doi:10.1021/ci700332k
Doweyko AM (2008) QSAR: dead or alive? J Comput Aided Mol Des 22:81–89. doi:10.1007/s10822-007-9162-7
Stouch TR, Kenyon JR, Johnson SR, Chen XQ, Doweyko A, Li Y (2003) In silico ADME/Tox: why models fail. J Comput Aided Mol Des 17:83–92. doi:10.1023/A:1025358319677
Jaworska J, Nikolova-Jeliazkova N, Aldenberg T (2005) QSAR applicability domain estimation by projection of the training set in descriptor space: a review. ATLA Altern Lab Anim 33:445–459
Dimitrov S, Dimitrova G, Pavlov T, Dimitrova N, Patlewicz G, Niemela J, Mekenyan O (2005) A stepwise approach for defining the applicability domain of SAR and QSAR models. J Chem Inf Model 45:839–849. doi:10.1021/ci0500381
Roy K, Mitra I, Kar S, Ojha P, Das RN, Kabir H (2012) Comparative studies on some metrics for external validation of QSPR models. J Chem Inf Model 52:396–408. doi:10.1021/ci200520g
Stanton DT, Jurs PC (1990) Development and use of charged partial surface area structural descriptors in computer-assisted quantitative structure–property relationship studies. Anal Chem 62:2323–2329. doi:10.1021/ac00220a013
Talete (2012) Dragon 6. http://www.talete.mi.it/index.htm. Accessed 01 Dec 2012
Li Z, Han L, Xue Y, Yap C, Li H, Jiang L, Chen Y (2007) MODEL—Molecular descriptor lab: a web-based server for computing structural and physicochemical features of compounds. Biotechnol Bioeng 97:389–396. doi:10.1002/bit.21214
Topliss JG, Costello RJ (1972) Chance correlations in structure–activity studies using multiple regression analysis. J Med Chem 15:1066–1068. doi:10.1021/jm00280a017
Clarke R, Ressom H, Wang A, Xuan J, Liu M, Gehan E, Wang Y (2008) The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat Rev Cancer 8:37–49. doi:10.1038/nrc2294
Shao L, Wu L, Fan X, Cheng Y (2010) Consensus ranking approach to understanding the underlying mechanism with QSAR. J Chem Inf Model 50:1941–1948. doi:10.1021/ci100305g
Huang J, Fang H, Fan X (2010) Decision forest for classification of gene expression data. Comput Biol Med 40:698–704. doi:10.1016/j.compbiomed.2010.06.004
Vapnik V (2000) The nature of statistical learning theory. Springer, New York
Wang Y, Miller D, Clarke R (2008) Approaches to working in high-dimensional data spaces: gene expression microarrays. Brit J Cancer 98:1023–1028. doi:10.1038/sj.bjc.6604207
Bruce CL, Melville JL, Pickett SD, Hirst JD (2007) Contemporary QSAR classifiers compared. J Chem Inf Model 47:219–227. doi:10.1021/ci600332j
Breiman L (2001) Random forests. Mach Learn 45:5–32. doi:10.1007/0-387-21529-8_16
Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22:85–126. doi:10.1023/B:AIRE.0000045502.10941.a9
Sutherland JJ, O’Brien LA, Weaver DF (2004) A comparison of methods for modeling quantitative structure–activity relationships. J Med Chem 47:5541–5554. doi:10.1021/jm0497141
Cerius2. Accelrys Software Inc., San Diego, CA. http://accelrys.com/
Yuan H, Huang J, Cao C (2009) Prediction of skin sensitization with a particle swarm optimized support vector machine. Int J Mol Sci 10:3237–3254. doi:10.3390/ijms10073237
Talete srl, DRAGON for Windows (software for molecular descriptor calculations). Version 5.4—2006. http://www.talete.mi.it/
Li Q, Jørgensen FS, Oprea T, Brunak S, Taboureau O (2008) hERG classification model based on a combination of support vector machine method and GRIND descriptors. Mol Pharmaceut 5:117–127. doi:10.1021/mp700124e
Breiman L (1998) Classification and regression trees. Chapman & Hall/CRC, London
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140. doi:10.1007/BF00058655
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. Proceedings of the Thirteenth International Conference on, Machine Learning, 148–156.
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11:10–18
Hawkins D, Basak S, Mills D (2003) Assessing model fit by cross-validation. J Chem Inf Comput Sci 43:579–586. doi:10.1021/ci025626i
Golbraikh A, Tropsha A (2002) Beware of q2!. J Mol Graph Model 20:269–276. doi:10.1016/S1093-3263(01)00123-1
Tong W, Xie Q, Hong H, Shi L, Fang H, Perkins R (2004) Assessment of prediction confidence and domain extrapolation of two structure–activity relationship models for predicting estrogen receptor binding activity. Environ Health Perspect 112:1249–1254. doi:10.1289/ehp.7125
Acknowledgments
This study was supported by the National S&T Major Project (No. 2012ZX09505001-001), the National Science Foundation of China (No. 81173465), the China Postdoctoral Science Foundation (No. 2012T50559), Fundamental Research Funds for the Central Universities, and Scientific Research Fund of Zhejiang Provincial Education Department (No. Y201122590).
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Huang, J., Fan, X. Reliably assessing prediction reliability for high dimensional QSAR data. Mol Divers 17, 63–73 (2013). https://doi.org/10.1007/s11030-012-9415-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11030-012-9415-9