Abstract
Infrared spectroscopy is a crucial analytical tool in organic chemistry, but interpreting IR data can be challenging. This study provides a comprehensive analysis of five machine learning models: logistic regression, KNN (k-nearest neighbors), SVM (support vector machine), random forest, and MLP (multilayer perceptron), and their effectiveness in interpreting IR spectra. The simple KNN model outperformed the more complex SVM model in execution time and F1 score, proving the potential of simpler models in interpreting the IR data. The combination of original spectra with its corresponding derivatives improved the performance of all models with a minimal increase in execution time. Denoising of the IR data was investigated but did not significantly improve performance. Although the MLP model showed better performance than the KNN model, its longer execution time is substantial. Ultimately, KNN is recommended for rapid results with minimal performance compromise, while MLP is suggested for projects prioritizing accuracy despite longer execution time.
Graphical abstract
Similar content being viewed by others
References
Bagherian M, Sabeti E, Wang K, Sartor MA, Nikolovska-Coleska Z, Najarian K (2021) Machine learning approaches and databases for prediction of drug–target interaction: a survey paper. Brief Bioinform 22:247–269. https://doi.org/10.1093/bib/bbz157
Bai X, Zhang L, Kang C, Quan B, Zheng Y, Zhang X, Song J, Xia T, Wang M (2022) Near-infrared spectroscopy and machine learning-based technique to predict quality-related parameters in instant tea. Sci Rep 12:3833. https://doi.org/10.1038/s41598-022-07652-z
Balabin RM, Lomakina EI, Safieva RZ (2011) Neural network (ANN) approach to biodiesel analysis: analysis of biodiesel density, kinematic viscosity, methanol and water contents using near infrared (NIR) spectroscopy. Fuel 90:2007–2015. https://doi.org/10.1016/j.fuel.2010.11.038
Baranwal M, Magner A, Elvati P, Saldinger J, Violi A, Hero AO (2020) A deep learning architecture for metabolic pathway prediction. Bioinformatics 36:2547–2553. https://doi.org/10.1093/bioinformatics/btz954
Bojko AD, Kozlov SK, Burykina JV, Ilyushenkova VV, Ananikov VP (2022) Fully automated unconstrained analysis of high-resolution, mass spectrometry data with machine learning. J Am Chem Soc 32:14590–14606. https://doi.org/10.1021/jacs.2c03631
Cha M, Emre EST, Xiao X, Kim JY, Bogdan P, Van Epps JS, Violi A, Kotov NA (2022) Unifying structural descriptors for biological and bioinspired nanoscale complexes. Nat Comput Sci 2:243–252. https://doi.org/10.1038/s43588-022-00229-w
CIRpy: open-source; https://cirpy.readthedocs.io. Accessed 01 Sept 2023
Dawes A, Mukerji RJ, Davis MP, Holtorn PD, Webb SM, Sivaraman Bh, Hoffmann SV, Shaw DA, Mason NJ (2007) Morphological study into the temperature dependence of solid ammonia under astrochemical conditions using vacuum ultraviolet and Fourier-transform infrared spectroscopy. J Chem Phys 126:244711. https://doi.org/10.1063/1.2743426
Enders AA, North NM, Velez-Alvarez J, Allen HC (2021) Functional group identification for FTIR spectra using image-based machine learning models. Anal Chem 28:9711–9718. https://doi.org/10.1021/acs.analchem.1c00867
Esterhuizen AJ, Goldsmith BR, Linic S (2020) Theory-guided machine learning finds geometric structure-property relationships for chemisorption on subsurface alloys. Chem 6:3100–3117. https://doi.org/10.1016/j.chempr.2020.09.001
Esterhuizen JA, Goldsmith BR, Linic S (2022) Interpretable machine learning for knowledge generation in heterogeneous catalysis. Nat Catal 5:175–184. https://doi.org/10.1038/s41929-022-00744-z
Fine JA, Rajasekar AA, Jethava KP, Chopra G (2020) Spectral deep learning for prediction and prospective validation of functional groups. Chem Sci 11:4618–4630. https://doi.org/10.1039/C9SC06240H
Gao H, Struble TJ, Coley CW, Wang Y, Green WH, Jensen KF (2018) Using machine learning to predict suitable conditions for organic reactions. ACS Cent Sci 4:1465–1476. https://doi.org/10.1021/acscentsci.8b00357
Gao P, Zhang J, Peng Q, Zhang J, Glezakou V-A (2020) General protocol for the accurate prediction of molecular 13C/1H NMR chemical shits via machine learning augmented DFT. J Chem Inf Model 60:3746–3754. https://doi.org/10.1021/acs.jcim.0c00388
Hanwell D, Curtis DE, Lonie DC, Vandermeersch T, Zurek E, Hutchison GR (2012) Avogadro: an advanced semantic chemical editor, visualization, and analysis platform. J Cheminformatics 4:17. https://doi.org/10.1186/1758-2946-4-17
Heidrich D, Koehler A, Ramirez-Castrillon M, Pagani DM, Ferrao MF, Scroferneker ML, Corbellini VA (2021) Rapid classification of chromoblastomycosis agents genera by infrared spectroscopy and chemometrics supervised by sequencing of rDNA regions. Spectrochim Acta A Mol Biomol Spectrosc 254:119647. https://doi.org/10.1016/j.saa.2021.119647
Jia W, Yang Z, Yang M, Cheng L, Lei Z, Wang X (2021) Machine learning enhanced spectrum recognition based on computer vision (SRCV) for intelligent NMR data extraction. J Chem Inf Model 61:21–25. https://doi.org/10.1021/acs.jcim.0c01046
Lansford JL, Vlachos DG (2020) Infrared spectroscopy data- and physics-driven machine learning for characterization surface microstructure of complex materials. Nat Commun 11:1513. https://doi.org/10.1038/s41467-020-15340-7
Li C, Cong Y, Deng W (2022) Identifying molecular functional groups of organic compounds by deep learning of NMR data. Magn Res Chem 60:1061–1069. https://doi.org/10.1002/mrc.5292
Linstrom PJ, Mallard WG (2005) NIST chemistry WebBook, NIST standard reference database number 69, National Institute of Standards and Technology, Gaithersburg MD, 20899.
Mancini M, Mircoli A, Potena D, Diamantini C, Duca D, Toscano G (2020) Prediction of pellet quality through machine learning techniques and near-infrared spectroscopy. Comput Ind Eng 147:106566. https://doi.org/10.1016/j.cie.2020.106566
Martinez-Trevino H, Uc-Cetina V, Fernandez-Herrera MA, Merino G (2020) Prediction of natural product classes using machine learning and 13C NMR spectroscopic data. J Chem Inf Model 60:3376–3386. https://doi.org/10.1021/acs.jcim.0c00293
Matyszczak G, Wrzecionek M, Gadomska-Gajadhur A, Ruśkowski P (2020) Kinetics of polycondensation of sebacic acid with glycerol. Org Process Res Dev 24:1104–1111. https://doi.org/10.1021/acs.oprd.0c00110
McGill C, Forsuelo M, Yanfei G, Green WH (2021) Predicting infrared spectra with message passing neural networks. J Chem Inf Model 61:2694–2609. https://doi.org/10.1021/acs.jcim.1c00055
McNaughton AD, Joshi RP, Knutson CR, Fnu A, Luebke KJ, Malerich JP, Madrid PB, Kumar N (2023) Machine learning models for predicting molecular UV-Vis spectra with quantum mechanical properties. J Chem Inf Model 63:1462–1471. https://doi.org/10.1021/acs.jcim.2c01662
Ning Y, Zhang H, Zhang Q, Zhang X (2020) Rapid identification and quantitative pit mud by near infrared spectroscopy with chemometrics. Vib Spectrosc 110:103116. https://doi.org/10.1016/j.vibspec.2020.103116
Ozturk S, Bowler A, Rady A, Watson NJ (2023) Near-infrared spectroscopy and machine learning for classification of food powders during a continuous process. J Food Eng 341:111339. https://doi.org/10.1016/j.jfoodeng.2022.111339
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Pollice R, dos Passos Gomez G, Aldeghi M, Hickman RJ, Krenn M, Lavigne C, Lindner-D’Addario M, Nigam AK, Ser CT, Yao Z, Aspuru-Guzik A (2021) Data-driven strategies for accelerated materials design. Acc Chem Res 54:849–860. https://doi.org/10.1021/acs.accounts.0c00785
RDKit: open-source; http://www.rdkit.org. Accessed 01 Sept 2023
Šašić S, Segtnan VH, Ozaki Y (2002) Self-modeling cure resolution study of temperature-dependent near-infrared spectra of water and the investigation of water structure. J Phys Chem A 106:760–766. https://doi.org/10.1021/jp013436p
Šašić S, Muszynski A, Ozaki Y (2000) A new possibility of the generalized two-dimensional correlation spectroscopy. 2. Sample-sample and wavenumber-wavenumber correlations of temperature-dependent near-infrared spectra of oleic acid in the pure liquid state. J Phys Chem A 104:6388–6394. https://doi.org/10.1021/jp0005118
Silverstein R, Webster M (2005) Spectrometric identification of organic compounds, 7th edn. Wiley, New Jersey
Tziolas N, Ordoudi SA, Tavlaridis A, Karyotis K, Zalidis G, Mourtzinos I (2021) Rapid assessment of anthocyanins content of onion waste through visible-near-short-wave and mid-infrared spectroscopy combined with machine learning techniques. Sustainability 13:6588. https://doi.org/10.3390/su13126588
Wang Z, Feng X, Liu J, Lu M, Li M (2020) Functional groups prediction from infrared spectra based on computer-assist approaches. Microchem J 159:105395. https://doi.org/10.1016/j.microc.2020.105395
Wei JN, Duvenaud D, Aspuru-Guzik A (2016) Neural networks for the prediction of organic chemistry reactions. ACS Cent Sci 2:725–732. https://doi.org/10.1021/acscentsci.6b00219
Ye S, Zhong K, Zhang J, Hu W, Hirst JD, Zhang G, Mukamel S, Jiang J (2020) A machine learning protocol for predicting protein infrared spectra. J Am Chem Soc 142:19071–19077. https://doi.org/10.1021/jacs.0c06530
Yin J, Lei Q, Li X, Zhang X, Meng X, Jiang Y, Tian L, Zhou S, Li Z (2023) A novel neural network-based alloy design strategy: gated recurrent unit machine learning modeling integrated with orthogonal experiment design and data augmentation. Acta Mater 243:118420. https://doi.org/10.1016/j.actamat.2022.118420
Zhang J, Duan Y, Sato H, Tsuji H, Noda I, Yan S, Ozaki Y (2005) Crystal modifications and thermal behavior of Poly(L-lactic acid) revealed by infrared spectroscopy. Macromolecules 38:8012–8021. https://doi.org/10.1021/ma051232r
Zinchik S, Jiang S, Friis S, Long F, Hogstedt L, Zavala VM, Bar-Ziv E (2021) Accurate characterization of mixed plastic waste using machine learning and fast infrared spectroscopy. ACS Sustain Chem Eng 9:14143–14151. https://doi.org/10.1021/acssuschemeng.1c04281
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Contributions
MK was involved in conceptualization, data curation, formal analysis, investigation, methodology, validation, visualization, and writing—original draft. GM contributed to supervision, resources, and writing—original draft.
Corresponding author
Ethics declarations
Conflict of interest
There are no conflicts to declare.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Krzyżanowski, M., Matyszczak, G. Machine learning prediction of organic moieties from the IR spectra, enhanced by additionally using the derivative IR data. Chem. Pap. 78, 3149–3173 (2024). https://doi.org/10.1007/s11696-024-03301-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11696-024-03301-z