Abstract
Small sample sizes are very common in multivariate analysis. Sample sizes of 10–100 statistically independent objects (rejects from processes or loading dock analysis, or patients with a rare disease), each with hundreds of data points, cause unstable models with poor predictive quality. Model stability is assessed by comparing models that were built using slightly varying training data. Iterated k-fold cross-validation is used for this purpose. Aggregation stabilizes models. It is possible to assess the quality of the aggregated model without calculating further models. The validation and aggregation methods investigated in this study apply to regression as well as to classification. These techniques are useful for analyzing data with large numbers of variates, e.g., any spectral data like FT-IR, Raman, UV/VIS, fluorescence, AAS, and MS. FT-IR images of tumor tissue were used in this study. Some tissue types occur frequently, while some are very rare. They are classified using LDA. Initial models were severely unstable. Aggregation stabilizes the predictions. The hit rate increased from 67% to 82%.
Similar content being viewed by others
References
Krafft C, Sobottka SB, Geiger KD, Schackert G, Salzer R (2007) Anal Bioanal Chem 387:1669–1677
Krafft C, Thümmler K, Sobottka SB, Schackert G, Salzer R (2006) Biopolymers 82:301–305
Beleites C, Steiner G, Sowa MG, Baumgartner R, Sobottka S, Schackert G, Salzer R (2005) Vib Spectrosc 38:143–149
Bryden HL, Longworth HR, Cunningham SA (2005) Nature 438:655–657
Cunningham SA, Kanzow T, Rayner D, Baringer MO, Johns WE, Marotzke J, Longworth HR, Grant EM, Hirschi JJM, Beal LM, Meinen CS, Bryden HL (2007) Science 317:935–938
Schiermeier Q (2007) Nature 448:844–845
Church JA (2007) Science 317:908–909
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning; data mining, inference and prediction. Springer, New York
Forthofer RN, Lee ES, Hernandez M (2007) Biostatistics, 2nd edn. Elsevier, Amsterdam
Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman & Hall, New York
Asuncion A, Newman D (2005) UCI machine learning repository. http://archive.ics.uci.edu/ml/. Accessed 24 December 2007
Beleites C, Baumgartner R, Bowman C, Somorjai R, Steiner G, Salzer R, Sowa MG (2005) Chem Intell Lab Syst 79:91–100
Kohavi R (1995) In: Mellish CS (ed) Proc 14th Int Joint Conf Artificial Intelligence, Montréal, Québec, Canada, 20–25 August 1995. Morgan Kaufmann, San Francisco, CA, pp 1137–1145
Breiman L (1996) Machine Learning 24:123–140
Breiman L (1996) Out-of-bag estimation. Technical report, Statistics Department, University of California, Berkeley, CA
Beleites C (2003) Chemometrische Auswertung von IR-Images und -Maps. Master’s thesis, Dresden University of Technology, Dresden
Huberty CJ (1994) Applied discriminant analysis. Wiley, New York
Nikulin A, Dolenko B, Bezabeh T, Somorjai R (1998) NMR Biomed 11:209–216
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Beleites, C., Salzer, R. Assessing and improving the stability of chemometric models in small sample size situations. Anal Bioanal Chem 390, 1261–1271 (2008). https://doi.org/10.1007/s00216-007-1818-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00216-007-1818-6