Abstract
Machine learning based statistical models have played a significant role in increasing the speed and accuracy with which the chemical and physical properties of chemical compounds can be predicted as compared to the experimental, and traditional ab initio and quantum mechanical approaches. The transformative impact that these techniques have, in the field of chemical sciences has completely changed the way experiments are designed. The last decade has seen the prominence of computer-aided molecular design based on machine learning algorithms. The major challenge has been the generation of machine-readable data in the form of descriptors and observations for training the model, which can again be time-consuming and computationally expensive if atomic coordinates based molecular encoding approach is used. In this study, we have tried to solve this problem using SMILES representation of molecules for generating various topological, physicochemical, electronic and steric descriptors using open-source cheminformatics packages. With the aid of the data generated using these packages, we have been able to develop a simple and explainable quantitative structure property relationship model using artificial neural network based on 7 numerical descriptors and 1 categorical descriptor for predicting the empirical polarity of a wide diversity of organic solvents. Since polarity is the representation of various solute–solvent and solvent–solvent interactions taking place in an organic transformation, its intuition beforehand will definitely help a chemist in a better experimental design.
Graphical abstract
An ANN algorithm based on 8 descriptors was successfully employed to predict the ET(30) values of organic solvents.
Similar content being viewed by others
Data availability
The datasets and model algorithms can be accessed from this link: https://github.com/v-saini/SMILES-EP.git.
References
Sun D, Gao W, Hu H, Zhou S (2022) Why 90% of clinical drug development fails and how to improve it? Acta Pharm Sin B 12:3049–3062. https://doi.org/10.1016/j.apsb.2022.02.002
Geerlings P, De Proft F, Langenaeker W (2003) Conceptual density functional theory. Chem Rev 103:1793–1874. https://doi.org/10.1021/cr990029p
Varnek A, Baskin I (2012) Machine learning methods for property prediction in chemoinformatics: quo vadis? J Chem Inf Model 52:1413–1437. https://doi.org/10.1021/ci200409x
Kulik HJ, Sigman MS (2021) Advancing discovery in chemistry with artificial intelligence: from reaction outcomes to new materials and catalysts. Acc Chem Res 54:2335–2336. https://doi.org/10.1021/acs.accounts.1c00232
Iype E, Urolagin S (2019) Machine learning model for non-equilibrium structures and energies of simple molecules. J Chem Phys 150:024307. https://doi.org/10.1063/1.5054968
Boobier S, Hose DRJ, Blacker AJ, Nguyen BN (2020) Machine learning with physicochemical relationships: solubility prediction in organic solvents and water. Nat Commun 11:5753. https://doi.org/10.1038/s41467-020-19594-z
Galushka M, Swain C, Browne F, Mulvenna MD, Bond R, Gray D (2021) Prediction of chemical compounds properties using a deep learning model. Neural Comput Appl 33:13345–13366. https://doi.org/10.1007/s00521-021-05961-4
Datta R, Das D, Das S (2021) Efficient lipophilicity prediction of molecules employing deep-learning models. Chemometr Intell Lab Syst 213:104309. https://doi.org/10.1016/j.chemolab.2021.104309
Saini V, Sharma A, Nivatia D (2022) A machine learning approach for predicting the nucleophilicity of organic molecules. Phys Chem Chem Phys 24:1821–1829. https://doi.org/10.1039/D1CP05072A
Boobier S, Liu Y, Sharma K, Hose DRJ, Blacker AJ, Kapur N, Nguyen BN (2021) Predicting solvent-dependent nucleophilicity parameter with a causal structure property relationship. J Chem Inf Model 61:4890–4899. https://doi.org/10.1021/acs.jcim.1c00610
Hoffmann G, Balcilar M, Tognetti V, Héroux P, Gaüzère B, Adam S, Joubert L (2020) Predicting experimental electrophilicities from quantum and topological descriptors: a machine learning approach. J Comput Chem 41:2124–2136. https://doi.org/10.1002/jcc.26376
Ahneman DT, Estrada JG, Lin S, Dreher SD, Doyle AG (2018) Predicting reaction performance in C–N cross-coupling using machine learning. Science 360:186–190. https://doi.org/10.1126/science.aar5169
Zahrt AF, Henle JJ, Rose BT, Wang Y, Darrow WT, Denmark SE (2019) Prediction of higher-selectivity catalysts by computer-driven workflow and machine learning. Science 363:eaau5631. https://doi.org/10.1126/science.aau5631
Beker W, Gajewska EP, Badowski T, Grzybowski BA (2019) Prediction of major regio-, site-, and diastereoisomers in Diels-Alder reactions by using machine-learning: the importance of physically meaningful descriptors. Angew Chem Int Ed 58:4515–4519. https://doi.org/10.1002/anie.201806920
St. John PC, Guan Y, Kim Y, Kim S, Paton RS (2020) Prediction of organic homolytic bond dissociation enthalpies at near chemical accuracy with sub-second computational cost. Nat Commun 11:2328. https://doi.org/10.1038/s41467-020-16201-z
Jorner K, Brinck T, Norrby P-O, Buttar D (2021) Machine learning meets mechanistic modelling for accurate prediction of experimental activation energies. Chem Sci 12:1163–1175. https://doi.org/10.1039/D0SC04896H
Stokes JM, Yang K, Swanson K, Jin W, Cubillos-Ruiz A, Donghia NM, MacNair CR, French S, Carfrae LA, Bloom-Ackermann Z, Tran VM, Chiappino-Pepe A, Badran AH, Andrews IW, Chory EJ, Church GM, Brown ED, Jaakkola TS, Barzilay R, Collins JJ (2020) A deep learning approach to antibiotic discovery. Cell 180:688-702.e613. https://doi.org/10.1016/j.cell.2020.01.021
Li J, Tong X-Y, Zhu L-D, Zhang H-Y (2020) A machine learning method for drug combination prediction. Front Genet 11:1–9. https://doi.org/10.3389/fgene.2020.01000
Gentile F, Yaacoub JC, Gleave J, Fernandez M, Ton A-T, Ban F, Stern A, Cherkasov A (2022) Artificial intelligence–enabled virtual screening of ultra-large chemical libraries with deep docking. Nat Protoc 17:672–697. https://doi.org/10.1038/s41596-021-00659-2
Potts DS, Bregante DT, Adams JS, Torres C, Flaherty DW (2021) Influence of solvent structure and hydrogen bonding on catalysis at solid–liquid interfaces. Chem Soc Rev 50:12308–12337. https://doi.org/10.1039/D1CS00539A
Reichardt C (2007) Solvents and solvent effects: an introduction. Org Process Res Dev 11:105–113. https://doi.org/10.1021/op0680082
Reichardt C (1988) Solvents and solvent effects in organic chemistry. VCH Publishers, Weinheim
Watarai H, Suzuki N (1974) Keto-enol tautomerization rates of acetylacetone in mixed aqueous media. J Inorg Nucl Chem 36:1815–1820. https://doi.org/10.1016/0022-1902(74)80516-6
Ferrari E, Saladini M, Pignedoli F, Spagnolo F, Benassi R (2011) Solvent effect on keto–enol tautomerism in a new β-diketone: a comparison between experimental data and different theoretical approaches. New J Chem 35:2840–2847. https://doi.org/10.1039/C1NJ20576E
Industry ESSF (1984) Solvent problems in industry. Elsevier Applied Science, London
Reichardt C (1994) Solvatochromic dyes as solvent polarity indicators. Chem Rev 94:2319–2358. https://doi.org/10.1021/cr00032a005
Marcus Y (1993) The properties of organic liquids that are relevant to their use as solvating solvents. Chem Soc Rev 22:409–416. https://doi.org/10.1039/CS9932200409
Reichardt C (2004) Pyridinium N-phenolate betaine dyes as empirical indicators of solvent polarity: some new findings. Pure Appl Chem 76:1903–1919. https://doi.org/10.1351/pac200476101903
Reichardt C (2008) Pyridinium-N-phenolate betaine dyes as empirical indicators of solvent polarity: some new findings. Pure Appl Chem 80:1415–1432. https://doi.org/10.1351/pac200880071415
Cerón-Carrasco JP, Jacquemin D, Laurence C, Planchat A, Reichardt C, Sraïdi K (2014) Solvent polarity scales: determination of new ET(30) values for 84 organic solvents. J Phys Org Chem 27:512–518. https://doi.org/10.1002/poc.3293
Saini V, Kumar R (2022) A machine learning approach for predicting the empirical polarity of organic solvents. New J Chem 46:16981–16989. https://doi.org/10.1039/d2nj02513b
Geerlings P, Chamorro E, Chattaraj PK, De Proft F, Gázquez JL, Liu S, Morell C, Toro-Labbé A, Vela A, Ayers P (2020) Conceptual density functional theory: status, prospects, issues. Theor Chem Acc 139:36. https://doi.org/10.1007/s00214-020-2546-7
Karelson M, Lobanov VS, Katritzky AR (1996) Quantum-chemical descriptors in QSAR/QSPR studies. Chem Rev 96:1027–1044. https://doi.org/10.1021/cr950202r
Nakajima M, Nemoto T (2021) Machine learning enabling prediction of the bond dissociation enthalpy of hypervalent iodine from SMILES. Sci Rep 11:20207. https://doi.org/10.1038/s41598-021-99369-8
Frisch MJ, Trucks GW, Schlegel HB, Scuseria GE, Robb MA, Cheeseman JR, Scalmani G, Barone V, Petersson GA, Nakatsuji H, Li X, Caricato M, Marenich AV, Bloino J, Janesko BG, Gomperts R, Mennucci B, Hratchian HP, Ortiz JV, Izmaylov AF, Sonnenberg JL, Williams, Ding F, Lipparini F, Egidi F, Goings J, Peng B, Petrone A, Henderson T, Ranasinghe D, Zakrzewski VG, Gao J, Rega N, Zheng G, Liang W, Hada M, Ehara M, Toyota K, Fukuda R, Hasegawa J, Ishida M, Nakajima T, Honda Y, Kitao O, Nakai H, Vreven T, Throssell K, Montgomery Jr. JA, Peralta JE, Ogliaro F, Bearpark MJ, Heyd JJ, Brothers EN, Kudin KN, Staroverov VN, Keith TA, Kobayashi R, Normand J, Raghavachari K, Rendell AP, Burant JC, Iyengar SS, Tomasi J, Cossi M, Millam JM, Klene M, Adamo C, Cammi R, Ochterski JW, Martin RL, Morokuma K, Farkas O, Foresman JB, Fox DJ (2016) Gaussian 16 Rev. C.01. Gaussian 16 Rev C01, Gaussian, Inc, Wallingford CT.
Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36. https://doi.org/10.1021/ci00057a005
Landrum G (2016) Rdkit: Open-source cheminformatics software, 2016. http://www.rdkit.org/, https://github.com/rdkit/rdkit 149:150.
Moriwaki H, Tian Y-S, Kawashita N, Takagi T (2018) Mordred: a molecular descriptor calculator. J Cheminform 10:4. https://doi.org/10.1186/s13321-018-0258-y
Medina-Franco JL, Sánchez-Cruz N, López-López E, Díaz-Eufracio BI (2022) Progress on open chemoinformatic tools for expanding and exploring the chemical space. J Comput Aided Mol Des 36:341–354. https://doi.org/10.1007/s10822-021-00399-1
Pinheiro GA, Mucelini J, Soares MD, Prati RC, Da Silva JLF, Quiles MG (2020) Machine learning prediction of nine molecular properties based on the SMILES representation of the QM9 quantum-chemistry dataset. J Phys Chem A 124:9854–9866. https://doi.org/10.1021/acs.jpca.0c05969
Maser MR, Cui AY, Ryou S, DeLano TJ, Yue Y, Reisman SE (2021) Multilabel classification models for the prediction of cross-coupling reaction conditions. J Chem Inf Model 61:156–166. https://doi.org/10.1021/acs.jcim.0c01234
Lever J, Krzywinski M, Altman N (2016) Model selection and overfitting. Nat Methods 13:703–704. https://doi.org/10.1038/nmeth.3968
Mitchell JBO (2014) Machine learning methods in chemoinformatics. WIREs Comput Mol Sci 4:468–481. https://doi.org/10.1002/wcms.1183
Kananenka AA, Yao K, Corcelli SA, Skinner JL (2019) Machine learning for vibrational spectroscopic maps. J Chem Theory Comput 15:6850–6858. https://doi.org/10.1021/acs.jctc.9b00698
Dybowski R (2020) Interpretable machine learning as a tool for scientific discovery in chemistry. New J Chem 44:20914–20920. https://doi.org/10.1039/D0NJ02592E
Rudin C (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell 1:206–215. https://doi.org/10.1038/s42256-019-0048-x
Lipton ZC (2018) The Mythos of Model Interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 16:31–57. https://doi.org/10.1145/3236386.3241340
Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B (2019) Definitions, methods, and applications in interpretable machine learning. Proc Natl Acad Sci 116:22071–22080. https://doi.org/10.1073/pnas.1900654116
Hall LH, Kier LB (1995) Electrotopological state indices for atom types: a novel combination of electronic, topological, and valence state information. J Chem Inf Comput Sci 35:1039–1045. https://doi.org/10.1021/ci00028a014
Gasteiger J, Marsili M (1978) A new model for calculating atomic charges in molecules. Tetrahedron Lett 19:3181–3184. https://doi.org/10.1016/S0040-4039(01)94977-9
Sanderson RT (1983) Electronegativity and bond energy. J Am Chem Soc 105:2259–2261. https://doi.org/10.1021/ja00346a026
Basak SC, Mills D (2005) Development of quantitative structure-activity relationship models for vapor pressure estimation using computed molecular descriptors. ARKIVOC 2005:308–320. https://doi.org/10.3998/ark.5550190.0006.a23
Balaban AT (1982) Highly discriminating distance-based topological index. Chem Phys Lett 89:399–404. https://doi.org/10.1016/0009-2614(82)80009-2
Funding
This work was supported by Department of Science and Technology, Ministry of Science and Technology, India (Grant Number – [DST/INSPIRE/04/2017/002529]).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing Interests
The author has no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Saini, V. Machine learning prediction of empirical polarity using SMILES encoding of organic solvents. Mol Divers 27, 2331–2343 (2023). https://doi.org/10.1007/s11030-022-10559-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11030-022-10559-6