This Comment describes some of the common pitfalls encountered in deriving and validating predictive statistical models from high-dimensional data. It offers a fresh perspective on some key statistical issues, providing some guidelines to avoid pitfalls, and to help unfamiliar readers better assess the reliability and significance of their results.
References
Kalinin, S. V., Sumpter, B. G. & Archibald, R. K. Nat. Mater. 14, 973–980 (2015).
Marx, V. Nature 498, 255–260 (2013).
Mattmann, C. A. Nature 493, 473–475 (2013).
Fodor, S. P. et al. Science 251, 767–773 (1991).
Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Science 270, 467–470 (1995).
Perou, C. M. et al. Proc. Natl Acad. Sci. USA 96, 9212–9217 (1999).
Wheeler, D. A. et al. Nature 452, 872–876 (2008).
Nagalakshmi, U. et al. Science 320, 1344–1349 (2008).
van ’t Veer, L. J. et al. Nature 415, 530–536 (2002).
Guo, S. et al. Nat. Genet. 49, 635–642 (2017).
Gerlinger, M. et al. N. Engl. J. Med. 366, 883–892 (2012).
Xu, R. H. et al. Nat. Mater. 16, 1155–1161 (2017).
Storey, J. D. & Tibshirani, R. Proc. Natl Acad. Sci. USA 100, 9440–9445 (2003).
Leek, J. T. et al. Nat. Rev. Genet. 11, 733–739 (2010).
Teschendorff, A. E., Zhuang, J. & Widschwendter, M. Bioinformatics 27, 1496–1505 (2011).
Simon, R., Radmacher, M. D., Dobbin, K. & McShane, L. M. J. Natl Cancer Inst. 95, 14–18 (2003).
Ioannidis, J. P. PLoS Med. 2, e124 (2005).
Jager, L. R. & Leek, J. T. Biostatistics 15, 1–12 (2014).
Sebastiani, P. et al. Science 333, 404 (2011).
Ioannidis, J. P. et al. Nat. Genet. 41, 149–155 (2009).
Seoighe, C., Tosh, N. J. & Greally, J. M. Nat. Genet. 50, 1062–1063 (2018).
Jacob, L. & Speed, T. P. Genome Biol. 19, 97 (2018).
Nieuwenhuis, S., Forstmann, B. U. & Wagenmakers, E. J. Nat. Neurosci. 14, 1105–1107 (2011).
Qin, L. X., Huang, H. C. & Begg, C. B. J. Clin. Oncol. 34, 3931–3938 (2016).
Ernst, J. & Kellis, M. Nat. Biotechnol. 33, 364–376 (2015).
Vapnik, V. N. Statistical Learning Theory (Wiley, New York, 1998).
Bishop, C. M. Pattern Recognition and Machine Learning (Springer, New York, 2006).
Friedman, J., Hastie, T. & Tibshirani, R. J. Stat. Softw. 33, 1–22 (2010).
Webb, S. Nature 554, 555–557 (2018).
Bishop, C. M. Neural Networks for Pattern Recognition (Oxford Univ. Press, Oxford, 1995).
Varma, S. & Simon, R. BMC Bioinform. 7, 91 (2006).
Teschendorff, A. E. et al. Genome Biol. 7, R101 (2006).
Ambroise, C. & McLachlan, G. J. Proc. Natl Acad. Sci. USA 99, 6562–6566 (2002).
Reunanen, J. J. Mach. Learn. Res. 3, 1371–1382 (2003).
Efron, B. & Tibshirani, R. J. J. Am. Stat. Assoc. 92, 548–560 (1997).
Simon, R. J. Natl Cancer Inst. 97, 866–867 (2005).
Biton, A. et al. Cell Rep. 9, 1235–1245 (2014).
Leek, J. T. & Storey, J. D. PLoS Genet. 3, 1724–1735 (2007).
Horvath, S. Genome Biol. 14, R115 (2013).
Leek, J. T. & Storey, J. D. Proc. Natl Acad. Sci. USA 105, 18718–18723 (2008).
Galea, M. H., Blamey, R. W., Elston, C. E. & Ellis, I. O. Breast Cancer Res. Treat. 22, 207–219 (1992).
Bartlett, T. E. et al. PLoS ONE 10, e0143178 (2015).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Teschendorff, A.E. Avoiding common pitfalls in machine learning omic data science. Nat. Mater. 18, 422–427 (2019). https://doi.org/10.1038/s41563-018-0241-z
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41563-018-0241-z
- Springer Nature Limited
This article is cited by
-
Quantifying the stochastic component of epigenetic aging
Nature Aging (2024)
-
A meta-analysis of immune-cell fractions at high resolution reveals novel associations with common phenotypes and health outcomes
Genome Medicine (2023)
-
Serum biomarker-based early detection of pancreatic ductal adenocarcinomas with ensemble learning
Communications Medicine (2023)
-
A new blood based epigenetic age predictor for adolescents and young adults
Scientific Reports (2023)
-
MAPK inhibitor sensitivity scores predict sensitivity driven by the immune infiltration in pediatric low-grade gliomas
Nature Communications (2023)