Assessing and improving the stability of chemometric models in small sample size situations

Beleites, Claudia; Salzer, Reiner

doi:10.1007/s00216-007-1818-6

Assessing and improving the stability of chemometric models in small sample size situations

Original Paper
Published: 29 January 2008

Volume 390, pages 1261–1271, (2008)
Cite this article

Analytical and Bioanalytical Chemistry Aims and scope Submit manuscript

Claudia Beleites¹ &
Reiner Salzer¹

2617 Accesses
46 Citations
18 Altmetric
Explore all metrics

Abstract

Small sample sizes are very common in multivariate analysis. Sample sizes of 10–100 statistically independent objects (rejects from processes or loading dock analysis, or patients with a rare disease), each with hundreds of data points, cause unstable models with poor predictive quality. Model stability is assessed by comparing models that were built using slightly varying training data. Iterated k-fold cross-validation is used for this purpose. Aggregation stabilizes models. It is possible to assess the quality of the aggregated model without calculating further models. The validation and aggregation methods investigated in this study apply to regression as well as to classification. These techniques are useful for analyzing data with large numbers of variates, e.g., any spectral data like FT-IR, Raman, UV/VIS, fluorescence, AAS, and MS. FT-IR images of tumor tissue were used in this study. Some tissue types occur frequently, while some are very rare. They are classified using LDA. Initial models were severely unstable. Aggregation stabilizes the predictions. The hit rate increased from 67% to 82%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Benchmark study of feature selection strategies for multi-omics data

Article Open access 05 October 2022

RRegrs: an R package for computer-aided model selection with multiple regression models

Article Open access 15 September 2015

The influence of scaling metabolomics data on model classification accuracy

Article 08 October 2014

References

Krafft C, Sobottka SB, Geiger KD, Schackert G, Salzer R (2007) Anal Bioanal Chem 387:1669–1677
Article CAS Google Scholar
Krafft C, Thümmler K, Sobottka SB, Schackert G, Salzer R (2006) Biopolymers 82:301–305
Article CAS Google Scholar
Beleites C, Steiner G, Sowa MG, Baumgartner R, Sobottka S, Schackert G, Salzer R (2005) Vib Spectrosc 38:143–149
Article CAS Google Scholar
Bryden HL, Longworth HR, Cunningham SA (2005) Nature 438:655–657
Article CAS Google Scholar
Cunningham SA, Kanzow T, Rayner D, Baringer MO, Johns WE, Marotzke J, Longworth HR, Grant EM, Hirschi JJM, Beal LM, Meinen CS, Bryden HL (2007) Science 317:935–938
Article CAS Google Scholar
Schiermeier Q (2007) Nature 448:844–845
Article CAS Google Scholar
Church JA (2007) Science 317:908–909
Article CAS Google Scholar
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning; data mining, inference and prediction. Springer, New York
Google Scholar
Forthofer RN, Lee ES, Hernandez M (2007) Biostatistics, 2nd edn. Elsevier, Amsterdam
Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman & Hall, New York
Google Scholar
Asuncion A, Newman D (2005) UCI machine learning repository. http://archive.ics.uci.edu/ml/. Accessed 24 December 2007
Beleites C, Baumgartner R, Bowman C, Somorjai R, Steiner G, Salzer R, Sowa MG (2005) Chem Intell Lab Syst 79:91–100
Article CAS Google Scholar
Kohavi R (1995) In: Mellish CS (ed) Proc 14th Int Joint Conf Artificial Intelligence, Montréal, Québec, Canada, 20–25 August 1995. Morgan Kaufmann, San Francisco, CA, pp 1137–1145
Breiman L (1996) Machine Learning 24:123–140
Google Scholar
Breiman L (1996) Out-of-bag estimation. Technical report, Statistics Department, University of California, Berkeley, CA
Beleites C (2003) Chemometrische Auswertung von IR-Images und -Maps. Master’s thesis, Dresden University of Technology, Dresden
Huberty CJ (1994) Applied discriminant analysis. Wiley, New York
Google Scholar
Nikulin A, Dolenko B, Bezabeh T, Somorjai R (1998) NMR Biomed 11:209–216
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Analytical Chemistry, Dresden University of Technology, Bergstrasse 66, 01062, Dresden, Germany
Claudia Beleites & Reiner Salzer

Authors

Claudia Beleites
View author publications
You can also search for this author in PubMed Google Scholar
Reiner Salzer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Claudia Beleites.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Beleites, C., Salzer, R. Assessing and improving the stability of chemometric models in small sample size situations. Anal Bioanal Chem 390, 1261–1271 (2008). https://doi.org/10.1007/s00216-007-1818-6

Download citation

Received: 04 October 2007
Revised: 07 December 2007
Accepted: 14 December 2007
Published: 29 January 2008
Issue Date: March 2008
DOI: https://doi.org/10.1007/s00216-007-1818-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Assessing and improving the stability of chemometric models in small sample size situations

Abstract

Access this article

Similar content being viewed by others

Benchmark study of feature selection strategies for multi-omics data

RRegrs: an R package for computer-aided model selection with multiple regression models

The influence of scaling metabolomics data on model classification accuracy

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Assessing and improving the stability of chemometric models in small sample size situations

Abstract

Access this article

Similar content being viewed by others

Benchmark study of feature selection strategies for multi-omics data

RRegrs: an R package for computer-aided model selection with multiple regression models

The influence of scaling metabolomics data on model classification accuracy

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation