Analysis of NIR spectroscopic data using decision trees and their ensembles
- 49 Downloads
Decision trees and their ensembles became quite popular for data analysis during the past decade. One of the main reasons for that is current boom in big data, where traditional statistical methods (such as, e.g., multiple linear regression) are not very efficient. However, in chemometrics these methods are still not very widespread, first of all because of several limitations related to the ratio between number of variables and observations. This paper presents several examples on how decision trees and their ensembles can be used in analysis of NIR spectroscopic data both for regression and classification. We will try to consider all important aspects including optimization and validation of models, evaluation of results, treating missing data and selection of most important variables. The performance and outcome of the decision tree-based methods are compared with more traditional approach based on partial least squares.
KeywordsNIR spectroscopy Decision trees Classification and regression trees Random forests
- 2.Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. Boca Raton: Taylor & Francis; 1984.Google Scholar
- 7.Menze BH, Kelm BM, Masuch R, Himmelreich U, Bachert P, Petrich W, Hamprecht FA. A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinform. 2009;10:213. https://doi.org/10.1186/1471-2105-10-213.CrossRefGoogle Scholar
- 9.Douglas RK, Nawar S, Cipullo S, Alamar MC, Coulon F, Mouazen AM. Evaluation of vis-NIR reflectance spectroscopy sensitivity to weathering for enhanced assessment of oil contaminated soils. Sci Total Environ. 2018;626:1108–20. https://doi.org/10.1016/j.scitotenv.2018.01.122.CrossRefPubMedGoogle Scholar
- 10.R Core Team. R: A language and environment for statistical computing, R foundation for statistical computing, Vienna, Austria, 2018. https://www.R-project.org/. Accessed 19 Nov 2018.
- 11.Tecator dataset. http://lib.stat.cmu.edu/datasets/tecator. Accessed 19 Nov 2018.
- 15.Oliveri P, López MI, Casolino MC, Ruisánchez I, Callao MP, Medini L, Lanteri S. Partial least squares density modeling (PLS-DM)—A new class-modeling strategy applied to the authentication of olives in brine by near-infrared spectroscopy. Anal Chim Acta. 2014;851:30–6. https://doi.org/10.1016/j.aca.2014.09.013.CrossRefPubMedGoogle Scholar
- 22.Gini C. On the measure of concentration with special reference to income and statistics. Colo Coll Publ Gen Ser. 1936;208:73–9.Google Scholar
- 25.R. Genuer, J.-M. Poggi, C. Tuleau. Random forests: some methodological insights, ArXiv08113619 Stat. 2008. http://arxiv.org/abs/0811.3619. Accessed 8 Aug 2018.