Abstract
The characteristics of the proposed algorithm are (a) the use of a new formula for the quality of the QSPRs (b) the outlier (atypical) character is defined using a classic criterion (c) the condition for elimination of the outliers includes the quality of the equation (d) only ‘the most atypical’ molecule is eliminated and all calculations are automatically repeated (e) the elimination of outliers is stopped if the condition for elimination is not fulfilled or if the number of the eliminated molecules exceeds a predetermined limit. The second situation in (e) was encountered once in the four examples discussed. The number of descriptors in ‘the best’ equation and the number of outliers removed can not be a priori predicted. The text proposes also a criterion for the identification of ‘outliers for lead hopping’. There were no molecules of this type in the four examples discussed. The initial number of molecules in the calibration sets was 50, 60, 133 and 54 respectively, the number of descriptors in ‘the best’ equations was 5, 5, 9, and 9 respectively and the number of eliminated outliers was 0, 0, 8, and 6 respectively. If there were outliers, the best equation obtained in the presence of the outliers and the best equation obtained in the absence of outliers, were very different.
Similar content being viewed by others
References
V. Barnett, D. Roberts, Commun. Stat. 22, 2703 (1993)
M. Frigge, D.C. Hoaglin, B. Iglewicz, Am. Statist. 43, 50 (1989)
M.B. Kremer, R.D. Martin, Comput. Intell. Finan. Eng. 29, 212 (1998)
K. Carling, Comput. Stat. Data Anal. 33, 249 (2000)
V. Saltenis, Informatica 15, 399 (2004)
A.G. Steele, B.M. Wood, R.J. Douglas, Metrologia 42, 32 (2005)
Q. Zhou, S. Li, X. Li, W. Wang, Z. Wang, Clin. Chim. Acta 372, 94 (2006)
J.-L. Faulon, A. Bender, Handbook of Chemoinformatics Algorithms (CRC Press, Boca Raton, 2010)
L. Tarko, MATCH Commun. Math. Comput. Chem. 75, 511 (2016)
L. Tarko, MATCH Commun. Math. Comput. Chem. 78, 565 (2017)
M. Hrubaru, L. Tarko, Rev. Chim. (Bucharest) 79, 887 (2019)
L.D. Grigoreva, V.Y. Grigorev, A.V. Yarkov, Moscow Univ. Chem. Bull. 74, 1 (2019)
G.H. Schmid, V.M. Csizmadia, P.G. Mezey, I.G. Csizmadia, Can. J. Chem. 54, 3330 (1976)
A. Lehman, Jmp For Basic Univariate And Multivariate Statistics:A Step-by-step Guide (Cary, NC: SAS Press 2005, p. 123)
M. Kendall, Biometrika 30, 81 (1938)
N. Draper, H. Smith, Applied Regression Analysis, 2d edn. (Wiley, NY, 1981)
E.S. Pearson, C.C. Sekar, Biometrika 28, 308 (1936)
A. C. R. Sodero, N. C. Romeiro, E. F. F. da Cunha, U. de O. Magalhães, R. B. de Alencastro, C. R. Rodrigues, L. M. Cabral, H. C. Castro, M G. Albuquerque, Molecules 17, 7415 (2012)
L. Tarko, I. Lupescu, D. Gropoşilă - Constantinescu, ARKIVOC xiii, 22 (2006)
D. Kim, S.-I. Hong, D.-S. Lee, Int. J. Mol. Sci. 7, 485 (2006)
L. Tarko, J. Math. Chem. 47, 174 (2010)
D.S. Cao, Y.Z. Liang, O.S. Xu, H.D. Li, X. Chen, J. Comput. Chem. 31, 592 (2010)
A. Cherkasov, E.N. Muratov, D. Fourches, A. Varnek, I.I. Baskin, M. Cronin, J. Dearden, P. Gramatica, Y.C. Martin, R. Todeschini, V. Consonni, V.E. Kuzmin, R. Cramer, R. Benigni, C. Yang, J. Rathman, L. Terfloth, J. Gasteiger, A. Richard, A. Tropsha, J. Med. Chem. 57, 4977 (2014)
On-line Accelrys documentation of the software QSAR+http://www.esi.umontreal.ca/accelrys/life/cerius46/qsar/working_with_stats.html
O. Maimon, L. Rokach, Data mining and knowledge discovery handbook, vol. 2 (Springer, Berlin, 2005)
C. C. Aggarwal, Outlier analysis. in Data Mining (Springer 2015)
F. Ruggiu, Anal. Chem. 86, 2510 (2014)
F.E. Grubbs, Ann. Math. Statis. 21, 27 (1950)
L. Tarko, J. Math. Chem. 52, 948 (2014)
L. Zhao, W. Wang, A. Sedykh, H. Zhu, ACS Omega. 2, 2805 (2017)
PCModel program is available from J. J. Gajewski, K. E. Gilbert, Serena Software, Box 3076, Bloomington, IN, USA
MOPAC program is available from J. J. P. Stewart,15210 Paddington Circle, Colorado Springs, CO 80921; MrMOPAC@OpenMOPAC.net http://www.openmopac.net/, accessed in March 2019
J.J.P. Stewart, J. Mol. Model. 13, 1173 (2007)
L. Tarko, MATCH Commun. Math. Comput. Chem. 77, 245 (2017)
DRAGON program is available from Talete srl., via V Pisani, 13-20124, Milano, Italy; http://www.talete.mi.it
J.G. Topliss, J. Med. Chem. 15, 1006 (1972)
A. Tropsha, Mol. Inf. 29, 476 (2010)
C. Michael, M.C. Hutter, J. Chem. Inf. Model. 51, 3099 (2011)
M.T.D. Cronin, T.W. Schultz, J. Mol. Struct. THEOCHEM. 622, 39 (2003)
R.D. Cramer, R.J. Lilek, S. Guessregen, S.J. Clark, B. Wendt, R.D. Clark, J. Med. Chem. 47, 6777 (2004)
J.C. Saeh, P.D. Lynep, B.K. Takasaki, D.A. Cosgrove, J. Chem. Inf. Comput. Sci. 45, 1122 (2005)
L.H. Hall, T.A. Vaughn, Med. Chem. Res. 7, 407 (1997)
K. Roy, G. Ghosh G., Int. Elec. J. Mol. Des. 2, 599 (2003)
R.C. Geary, Incorp. Statist. 5, 115 (1954)
T.A. Roy, A.J. Krueger, C.R. Makerer, W. Neil, A.M. Arroyo, J.J. Yang, SAR and QSAR Env. Res. 9, 171 (1998)
O. Ivanciuc, T. Ivanciuc, A.T. Balaban, Int. Elec. J. Mol. Des. 1, 559 (2002)
L. Tarko L., ARKIVOC, xi, 24 (2008)
M.C. Hemmer, V. Steinhauer, J. Gasteiger, Vibrat. Spect. 19, 151 (1999)
K. Fukui, Theory of Orientation and Stereoselection (Springer, Berlin, 1975)
J. Gálvez, R. Garcìa, M.T. Salabert, R. Soler, J. Chem. Inf. Comput. Sci. 34, 520 (1994)
L. Tako, S. Calafeteanu, Rev. Chim. 49, 169 (1998)
M. Randic, J. Chem. Inf. Comput. Sci. 41, 607 (2001)
T.M. Krygowski, M. Cyranski, A. Ciesielski, B. Swirska, P. Leszczynski, J. Chem. Inf. Comput. Sci. 36, 1135 (1996)
V. Consonni, R. Todeschini, M. Pavan, J. Chem. Inf. Comput. Sci. 42, 682 (2002)
V. Consonni, R. Todeschini, M. Pavan, P. Gramatica, J. Chem. Inf. Comput. Sci. 42, 693 (2002)
N. Trinajstic, D. Babic, S. Nikolic, D. Plavsic, D. Amic, Z. Mihalic, J. Chem. Inf. Comput. Sci. 34, 368 (1994)
P.A.P. Moran, Biometrika 37, 17 (1950)
R. Todeschini, M. Lasagni, E. Marengo, J. Chemom. 8, 263 (1994)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
There is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Tarko, L. Novel criteria for elimination of the outliers in QSPR studies, when the ‘forward stepwise’ procedure is used. J Math Chem 57, 1770–1796 (2019). https://doi.org/10.1007/s10910-019-01036-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10910-019-01036-x