Analytical and Bioanalytical Chemistry

, Volume 406, Issue 29, pp 7581–7590 | Cite as

A comparison of different chemometrics approaches for the robust classification of electronic nose data

  • Piotr S. Gromski
  • Elon Correa
  • Andrew A. Vaughan
  • David C. Wedge
  • Michael L. Turner
  • Royston Goodacre
Research Paper

Abstract

Accurate detection of certain chemical vapours is important, as these may be diagnostic for the presence of weapons, drugs of misuse or disease. In order to achieve this, chemical sensors could be deployed remotely. However, the readout from such sensors is a multivariate pattern, and this needs to be interpreted robustly using powerful supervised learning methods. Therefore, in this study, we compared the classification accuracy of four pattern recognition algorithms which include linear discriminant analysis (LDA), partial least squares-discriminant analysis (PLS-DA), random forests (RF) and support vector machines (SVM) which employed four different kernels. For this purpose, we have used electronic nose (e-nose) sensor data (Wedge et al., Sensors Actuators B Chem 143:365–372, 2009). In order to allow direct comparison between our four different algorithms, we employed two model validation procedures based on either 10-fold cross-validation or bootstrapping. The results show that LDA (91.56 % accuracy) and SVM with a polynomial kernel (91.66 % accuracy) were very effective at analysing these e-nose data. These two models gave superior prediction accuracy, sensitivity and specificity in comparison to the other techniques employed. With respect to the e-nose sensor data studied here, our findings recommend that SVM with a polynomial kernel should be favoured as a classification method over the other statistical models that we assessed. SVM with non-linear kernels have the advantage that they can be used for classifying non-linear as well as linear mapping from analytical data space to multi-group classifications and would thus be a suitable algorithm for the analysis of most e-nose sensor data.

Keywords

Linear discriminant analysis Partial least squares-discriminant analysis Random forests Support vector machines Bootstrapping Cross-validation 

Notes

Acknowledgments

The authors would like to thank to PhastID (grant agreement no. 258238) which is a European project supported within the Seventh Framework Programme for Research and Technological Development for funding and for the studentship for PSG. Additionally, the authors would like to thank the reviewers for their useful comments and suggestions which have helped us improve our manuscript.

Supplementary material

216_2014_8216_MOESM1_ESM.pdf (9.3 mb)
ESM 1 (PDF 9475 kb)

References

  1. 1.
    Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1:67–82CrossRefGoogle Scholar
  2. 2.
    Rock F, Barsan N, Weimar U (2008) Electronic nose: current status and future trends. Chem Rev 108:705–725CrossRefGoogle Scholar
  3. 3.
    Scott SM, James D, Ali Z (2006) Data analysis for electronic nose systems. Microchim Acta 156:183–207CrossRefGoogle Scholar
  4. 4.
    Manly BFJ (1986) Multivariate statistical methods: a primer. Chapman and HallGoogle Scholar
  5. 5.
    Jurs PC, Bakken GA, McClelland HE (2000) Computational methods for the analysis of chemical sensor array data from volatile analytes. Chem Rev 100:2649–2678CrossRefGoogle Scholar
  6. 6.
    Dobrokhotov V, Oakes L, Sowell D, Larin A, Hall J, Kengne A, Bakharev P, Corti G, Cantrell T, Prakash T, Williams J, McIlroy DN (2012) Toward the nanospring-based artificial olfactory system for trace-detection of flammable and explosive vapors. Sensors Actuators B Chem 168:138–148CrossRefGoogle Scholar
  7. 7.
    Dragonieri S, Schot R, Mertens BJA, Le Cessie S, Gauw SA, Spanevello A, Resta O, Willard NP, Vink TJ, Rabe KF, Bel EH, Sterk PJ (2007) An electronic nose in the discrimination of patients with asthma and controls. J Allergy Clin Immunol 120:856–862CrossRefGoogle Scholar
  8. 8.
    Wold S, Sjostrom M, Eriksson L (2001) PLS-regression: a basic tool of chemometrics. Chemometr Intell Lab 58:109–130CrossRefGoogle Scholar
  9. 9.
    Cynkar W, Dambergs R, Smith P, Cozzolino D (2010) Classification of Tempranillo wines according to geographic origin: combination of mass spectrometry based electronic nose and chemometrics. Anal Chim Acta 660:227–231CrossRefGoogle Scholar
  10. 10.
    Di Natale C, Macagnano A, Martinelli E, Paolesse R, D’Arcangelo G, Roscioni C, Finazzi-Agro A, D’Amico A (2003) Lung cancer identification by the analysis of breath by means of an array of non-selective gas sensors. Biosens Bioelectron 18:1209–1218CrossRefGoogle Scholar
  11. 11.
    Bernabei M, Pennazza G, Santortico M, Corsi C, Roscioni C, Paolesse R, Di Natale C, D’Amico A (2008) A preliminary study on the possibility to diagnose urinary tract cancers by an electronic nose. Sens Actuators B-Chem 131:1–4CrossRefGoogle Scholar
  12. 12.
    Brereton RG (2009) Chemometrics for pattern recognition. Wiley, ChichesterCrossRefGoogle Scholar
  13. 13.
    Breiman L (2001) Random forests. Mach Learn 45:5–32CrossRefGoogle Scholar
  14. 14.
    Pardo M, Sberveglieri G (2008) Random forests and nearest shrunken centroids for the classification of sensor array data. Sens Actuators B-Chem 131:93–99CrossRefGoogle Scholar
  15. 15.
    Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10:988–999CrossRefGoogle Scholar
  16. 16.
    Pardo M, Sberveglieri G (2005) Classification of electronic nose data with support vector machines. Sensors Actuators B Chem 107:730–737CrossRefGoogle Scholar
  17. 17.
    Gualdron O, Brezmes J, Llobet E, Amari A, Vilanova X, Bouchikhi B, Correig X (2007) Variable selection for support vector machine based multisensor systems. Sensors Actuators B Chem 122:259–268CrossRefGoogle Scholar
  18. 18.
    Machado RF, Laskowski D, Deffenderfer O, Burch T, Zheng S, Mazzone PJ, Mekhail T, Jennings C, Stoller JK, Pyle J, Duncan J, Dweik RA, Erzurum SC (2005) Detection of lung cancer by sensor array analyses of exhaled breath. Am J Respir Crit Care Med 171:1286–1291CrossRefGoogle Scholar
  19. 19.
    Sattlecker M, Bessant C, Smith J, Stone N (2010) Investigation of support vector machines and Raman spectroscopy for lymph node diagnostics. Analyst 135:895–901CrossRefGoogle Scholar
  20. 20.
    Distante C, Ancona N, Siciliano P (2003) Support vector machines for olfactory signals recognition. Sensors Actuators B Chem 88:30–39CrossRefGoogle Scholar
  21. 21.
    Wedge DC, Das A, Dost R, Kettle J, Madec MB, Morrison JJ, Grell M, Kell DB, Richardson TH, Yeates S, Turner ML (2009) Real-time vapour sensing using an OFET-based electronic nose and genetic programming. Sensors Actuators B Chem 143:365–372CrossRefGoogle Scholar
  22. 22.
    Gilbert RJ, Goodacre R, Woodward AM, Kell DB (1997) Genetic programming: a novel method for the quantitative analysis of pyrolysis mass spectral data. Anal Chem 69:4381–4389CrossRefGoogle Scholar
  23. 23.
    Koza JR (1992) Genetic programming: on the programming of computers by means of natural selection. MIT Press, CambridgeGoogle Scholar
  24. 24.
    Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2:121–167CrossRefGoogle Scholar
  25. 25.
    Kohavi R (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, Montreal. Morgan Kaufmann, p 7Google Scholar
  26. 26.
    Efron B (1979) 1977 Rietz lecture-bootstrap methods: another look at the jackknife. Ann Stat 7:1–26CrossRefGoogle Scholar
  27. 27.
    Pearce TC, Manuel SM (2003) Chemical sensor array optimization: geometric and information theoretic approaches. In: T.C. P, S. SS, T NH, W GJ (eds) Handbook of machine olfaction—electronic nose technology. Wiley, WeinheimGoogle Scholar
  28. 28.
    Team RDC (2008) R: A language and environment for statistical computing. R Foundation for Statistical Computing. http://www.R-project.org.
  29. 29.
    Brereton RG (2006) Consequences of sample size, variable selection, and model validation and optimisation, for predicting classification ability from analytical data. Trac-Trend Anal Chem 25:1103–1111CrossRefGoogle Scholar
  30. 30.
    Brereton RG, Lloyd GR (2014) Partial least squares discriminant analysis: taking the magic away. J Chemometrics 28:213–225CrossRefGoogle Scholar
  31. 31.
    Dixon SJ, Brereton RG (2009) Comparison of performance of five common classifiers represented as boundary methods: Euclidean distance to centroids, linear discriminant analysis, quadratic discriminant analysis, learning vector quantization and support vector machines, as dependent on data structure. Chemometr Intell Lab 95:1–17CrossRefGoogle Scholar
  32. 32.
    Efron B, Tibshirani R (1997) Improvements on cross-validation: the 632 + bootstrap method. JASA 92:548–560Google Scholar
  33. 33.
    Jain AK, Dubes RC, Chen CC (1987) Bootstrap techniques for error estimation. IEEE Trans Pattern Anal Mach Intell 9:628–633CrossRefGoogle Scholar
  34. 34.
    Xu Y, Zomer S, Brereton RG (2006) Support vector machines: a recent method for classification in chemometrics. Crit Rev Anal Chem 36:177–188CrossRefGoogle Scholar
  35. 35.
    Gunn SR (1998) Support vector machines for classification and regression. Technical Report. http://ce.sharif.ir/courses/85-86/2/ce725/resources/root/LECTURES/SVM.pdf.
  36. 36.
    Ben-Hur A, Weston J (2010) A user’s guide to support vector machines. Technical report. http://pyml.sourceforge.net/doc/howto.pdf. 609
  37. 37.
    Westerhuis JA, Hoefsloot HCJ, Smit S, Vis DJ, Smilde AK, van Velzen EJJ, van Duijnhoven JPM, van Dorsten FA (2008) Assessment of PLSDA cross validation. Metabolomics 4:81–89CrossRefGoogle Scholar
  38. 38.
    Goodacre R, Timmins EM, Burton R, Kaderbhai N, Woodward AM, Kell DB, Rooney PJ (1998) Rapid identification of urinary tract infection bacteria using hyperspectral whole-organism fingerprinting and artificial neural networks. Microbiology 144:1157–1170CrossRefGoogle Scholar
  39. 39.
    Goodacre R, Broadhurst D, Smilde AK, Kristal BS, Baker JD, Beger R, Bessant C, Connor S, Calmani G, Craig A, Ebbels T, Kell DB, Manetti C, Newton J, Paternostro G, Somorjai R, Sjostrom M, Trygg J, Wulfert F (2007) Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics 3:231–241CrossRefGoogle Scholar
  40. 40.
    Venables WN, Ripley BD (2002) Modern applied statistics with S. Springer, New YorkCrossRefGoogle Scholar
  41. 41.
    Kuhn M (2008) Building predictive models in R using the caret package. J Stat Softw 28:1–26Google Scholar
  42. 42.
    Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2:18–22Google Scholar
  43. 43.
    Karatzoglou A, Meyer D, Hornik K (2006) Support vector machines in R. J Stat Softw 15:1–28Google Scholar
  44. 44.
    Gromski PS, Xu Y, Correa E, Ellis DI, Turner ML, Goodacre R (2014) A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data. Anal Chim Acta 829:1–8CrossRefGoogle Scholar
  45. 45.
    Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422CrossRefGoogle Scholar
  46. 46.
    Chapelle O, Vapnik V, Bousquet O, Mukherjee S (2002) Choosing multiple parameters for support vector machines. Mach Learn 46:131–159CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Piotr S. Gromski
    • 1
  • Elon Correa
    • 1
  • Andrew A. Vaughan
    • 1
  • David C. Wedge
    • 2
  • Michael L. Turner
    • 3
  • Royston Goodacre
    • 1
  1. 1.School of Chemistry, Manchester Institute of BiotechnologyThe University of ManchesterManchesterUK
  2. 2.Cancer Genome ProjectWellcome Trust Sanger InstituteHinxtonUK
  3. 3.School of ChemistryThe University of ManchesterManchesterUK

Personalised recommendations