Abstract
The modelability index of a dataset of molecules is a measurement of the capacity of the dataset to be modeled using a QSAR algorithm. This measure allows to predict the correct classification rate of the dataset counting the nearest neighbors to the molecules of the dataset belonging to their same class. In this paper, we propose a new measure for the prediction of the modelability of datasets based on the use of the nearest neighbors based rivality index and the centroids based rivality index. These indexes take into account the noise that the nearest neighbor belonging to a different class could generate in the results of the QSAR classification algorithm. Using thirty benchmark datasets, two types of dataset representation and six different algorithms, we show the excellent behavior of the proposed indexes, obtaining correlations with values of R2 greater than 0.9 between the correct classification rate obtained in the classification processes using five folds cross-validation and the modelability index calculated using the centroid based rivality index.
Similar content being viewed by others
References
K. Roy, S. Kar, R.N. Das, A Primer on QSAR/QSPR Modeling (SpringerBriefs in Molecular Science, Springer, New York, 2015)
G.M. Maggiora, On outliers and activity cliffs: why QSAR often disappoints. J. Chem. Inf. Model. 46, 1535 (2006)
A. Cherkasov, E.N. Muratov, D. Fourches, A. Varnek, I.I. Baskin, M. Cronin, J. Dearden, P. Gramatica, Y.C. Martin, R. Todeschini, V. Consonni, V.E. Kuz’min, R. Cramer, R. Benigni, C. Yang, J. Rathman, L. Terfloth, J. Gasteiger, A. Richard, A. Tropsha, QSAR modeling: where have you been? where are you going to? J. Chem. Inf. Model. 54, 1–4 (2014)
F. Sahigara, K. Mansouri, D. Ballabio, A. Mauri, V. Consonni, R. Todeschini, Comparison of different approaches to define the applicability domain of QSAR models. Molecules 17, 4791–4810 (2012)
I.H. Witten, E. Frank, M.A. Hall, C.J. Pal, Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmman, Springer, Cambridge, 2017)
G. Cerruela García, N. García-Pedrajas, I. Luque Ruiz, M.A. Gómez-Nieto, An ensemble approach for in silico prediction of Ames mutagenicity. J. Math. Chem. 56, 2085–2098 (2018)
A. Tropsha, Best practices for QSAR model development, validation, and exploitation. Mol. Inform. 29, 476–488 (2010)
F. Adilova, A. Ikramov, Data set analysis for the calculation of the QSAR models predictive efficiency based on activity cliffs. Adv. Tech. Biol. Med. 5, 1–3 (2017)
A. Golbraikh, E. Muratov, D. Fourches, A. Tropsha, Data set modelability by QSAR. J. Med. Chem. 57, 4977–5010 (2014)
I. Luque Ruiz, M.A. Gómez-Nieto, Study of the Datasets Modelability: modelability, rivality and weighted modelability indexes. J. Chem. Inf. Model. 58, 1798–1814 (2018)
Chembench. Carolina Exploratory Center for Cheminformatics Research (CECCR). https://chembench.mml.unc.edu/. Accessed May, 2018
A. Dalby, J.G. Nourse, W.D. Hounshell, A.K.I. Gushurt, D.L. Grier, B.A. Leland, J. Laufer, Description of several chemical structure file formats used by computer programs developed at molecular design limited. J. Chem. Inf. Comput. Sci. 32, 244–245 (1992)
C.W. Yap, PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 32, 1466–1474 (2011)
Daylight. Chemical Information System, Inc. Fingerprints-Screening and Similarity. http://www.daylight.com/dayhtml/doc/theory/theory.finger.html. Accessed May 2018
Matlab and Simulink. Matlab 2017Rb. https://www.mathworks.com/products/matlab.html. Accessed May 2018
Statistics and Machine Learning Toolbox. Matlab 2017Rb. https://www.mathworks.com/products/statistics.html. Accessed May 2018
N.G. Zagouruiko, I.A. Borisova, V.V. Dyubanov, O.A. Kutnenko, Methods of recognition based on the function of rival similarity. Pattern Recognit. Image Anal. 18, 1–6 (2008)
Funding
Any funding supported the manuscript.
Author information
Authors and Affiliations
Contributions
ILR and MAGN have shared all the design and experimental tasks and the development of the study and manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no competing interests.
Availability of data and material
Word file including the results of the predictions models built using fingerprint and similarity matrixes as input data to the algorithms.
Rights and permissions
About this article
Cite this article
Luque Ruiz, I., Gómez-Nieto, M.Á. Prediction of the datasets modelability for the building of QSAR classification models by means of the centroid based rivality index. J Math Chem 57, 1374–1393 (2019). https://doi.org/10.1007/s10910-018-0972-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10910-018-0972-8