A comparison of different chemometrics approaches for the robust classification of electronic nose data
Accurate detection of certain chemical vapours is important, as these may be diagnostic for the presence of weapons, drugs of misuse or disease. In order to achieve this, chemical sensors could be deployed remotely. However, the readout from such sensors is a multivariate pattern, and this needs to be interpreted robustly using powerful supervised learning methods. Therefore, in this study, we compared the classification accuracy of four pattern recognition algorithms which include linear discriminant analysis (LDA), partial least squares-discriminant analysis (PLS-DA), random forests (RF) and support vector machines (SVM) which employed four different kernels. For this purpose, we have used electronic nose (e-nose) sensor data (Wedge et al., Sensors Actuators B Chem 143:365–372, 2009). In order to allow direct comparison between our four different algorithms, we employed two model validation procedures based on either 10-fold cross-validation or bootstrapping. The results show that LDA (91.56 % accuracy) and SVM with a polynomial kernel (91.66 % accuracy) were very effective at analysing these e-nose data. These two models gave superior prediction accuracy, sensitivity and specificity in comparison to the other techniques employed. With respect to the e-nose sensor data studied here, our findings recommend that SVM with a polynomial kernel should be favoured as a classification method over the other statistical models that we assessed. SVM with non-linear kernels have the advantage that they can be used for classifying non-linear as well as linear mapping from analytical data space to multi-group classifications and would thus be a suitable algorithm for the analysis of most e-nose sensor data.
KeywordsLinear discriminant analysis Partial least squares-discriminant analysis Random forests Support vector machines Bootstrapping Cross-validation
The authors would like to thank to PhastID (grant agreement no. 258238) which is a European project supported within the Seventh Framework Programme for Research and Technological Development for funding and for the studentship for PSG. Additionally, the authors would like to thank the reviewers for their useful comments and suggestions which have helped us improve our manuscript.
- 4.Manly BFJ (1986) Multivariate statistical methods: a primer. Chapman and HallGoogle Scholar
- 6.Dobrokhotov V, Oakes L, Sowell D, Larin A, Hall J, Kengne A, Bakharev P, Corti G, Cantrell T, Prakash T, Williams J, McIlroy DN (2012) Toward the nanospring-based artificial olfactory system for trace-detection of flammable and explosive vapors. Sensors Actuators B Chem 168:138–148CrossRefGoogle Scholar
- 23.Koza JR (1992) Genetic programming: on the programming of computers by means of natural selection. MIT Press, CambridgeGoogle Scholar
- 25.Kohavi R (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, Montreal. Morgan Kaufmann, p 7Google Scholar
- 27.Pearce TC, Manuel SM (2003) Chemical sensor array optimization: geometric and information theoretic approaches. In: T.C. P, S. SS, T NH, W GJ (eds) Handbook of machine olfaction—electronic nose technology. Wiley, WeinheimGoogle Scholar
- 28.Team RDC (2008) R: A language and environment for statistical computing. R Foundation for Statistical Computing. http://www.R-project.org.
- 31.Dixon SJ, Brereton RG (2009) Comparison of performance of five common classifiers represented as boundary methods: Euclidean distance to centroids, linear discriminant analysis, quadratic discriminant analysis, learning vector quantization and support vector machines, as dependent on data structure. Chemometr Intell Lab 95:1–17CrossRefGoogle Scholar
- 32.Efron B, Tibshirani R (1997) Improvements on cross-validation: the 632 + bootstrap method. JASA 92:548–560Google Scholar
- 35.Gunn SR (1998) Support vector machines for classification and regression. Technical Report. http://ce.sharif.ir/courses/85-86/2/ce725/resources/root/LECTURES/SVM.pdf.
- 36.Ben-Hur A, Weston J (2010) A user’s guide to support vector machines. Technical report. http://pyml.sourceforge.net/doc/howto.pdf. 609
- 39.Goodacre R, Broadhurst D, Smilde AK, Kristal BS, Baker JD, Beger R, Bessant C, Connor S, Calmani G, Craig A, Ebbels T, Kell DB, Manetti C, Newton J, Paternostro G, Somorjai R, Sjostrom M, Trygg J, Wulfert F (2007) Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics 3:231–241CrossRefGoogle Scholar
- 41.Kuhn M (2008) Building predictive models in R using the caret package. J Stat Softw 28:1–26Google Scholar
- 42.Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2:18–22Google Scholar
- 43.Karatzoglou A, Meyer D, Hornik K (2006) Support vector machines in R. J Stat Softw 15:1–28Google Scholar