Avoiding common pitfalls in machine learning omic data science

Teschendorff, Andrew E.

doi:10.1038/s41563-018-0241-z

Avoiding common pitfalls in machine learning omic data science

Comment
Published: 26 November 2018

Volume 18, pages 422–427, (2019)
Cite this article

From

View current issue Submit your manuscript

Andrew E. Teschendorff^1,2

8261 Accesses
69 Citations
24 Altmetric
Explore all metrics

This Comment describes some of the common pitfalls encountered in deriving and validating predictive statistical models from high-dimensional data. It offers a fresh perspective on some key statistical issues, providing some guidelines to avoid pitfalls, and to help unfamiliar readers better assess the reliability and significance of their results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

**Fig. 1: The curse of dimensionality and overfitting.**

**Fig. 2: Avoiding bias when training and evaluating molecular predictors.**

**Fig. 3: Unknown confounders and class prediction.**

**Fig. 4: Avoiding bias when comparing feature selection methods.**

References

Kalinin, S. V., Sumpter, B. G. & Archibald, R. K. Nat. Mater. 14, 973–980 (2015).
Article CAS Google Scholar
Marx, V. Nature 498, 255–260 (2013).
Article CAS Google Scholar
Mattmann, C. A. Nature 493, 473–475 (2013).
Article CAS Google Scholar
Fodor, S. P. et al. Science 251, 767–773 (1991).
Article CAS Google Scholar
Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Science 270, 467–470 (1995).
Article CAS Google Scholar
Perou, C. M. et al. Proc. Natl Acad. Sci. USA 96, 9212–9217 (1999).
Article CAS Google Scholar
Wheeler, D. A. et al. Nature 452, 872–876 (2008).
Article CAS Google Scholar
Nagalakshmi, U. et al. Science 320, 1344–1349 (2008).
Article CAS Google Scholar
van ’t Veer, L. J. et al. Nature 415, 530–536 (2002).
Article Google Scholar
Guo, S. et al. Nat. Genet. 49, 635–642 (2017).
Article CAS Google Scholar
Gerlinger, M. et al. N. Engl. J. Med. 366, 883–892 (2012).
Article CAS Google Scholar
Xu, R. H. et al. Nat. Mater. 16, 1155–1161 (2017).
Article CAS Google Scholar
Storey, J. D. & Tibshirani, R. Proc. Natl Acad. Sci. USA 100, 9440–9445 (2003).
Article CAS Google Scholar
Leek, J. T. et al. Nat. Rev. Genet. 11, 733–739 (2010).
Article CAS Google Scholar
Teschendorff, A. E., Zhuang, J. & Widschwendter, M. Bioinformatics 27, 1496–1505 (2011).
Article CAS Google Scholar
Simon, R., Radmacher, M. D., Dobbin, K. & McShane, L. M. J. Natl Cancer Inst. 95, 14–18 (2003).
Article CAS Google Scholar
Ioannidis, J. P. PLoS Med. 2, e124 (2005).
Article Google Scholar
Jager, L. R. & Leek, J. T. Biostatistics 15, 1–12 (2014).
Article Google Scholar
Sebastiani, P. et al. Science 333, 404 (2011).
Article CAS Google Scholar
Ioannidis, J. P. et al. Nat. Genet. 41, 149–155 (2009).
Article CAS Google Scholar
Seoighe, C., Tosh, N. J. & Greally, J. M. Nat. Genet. 50, 1062–1063 (2018).
Article CAS Google Scholar
Jacob, L. & Speed, T. P. Genome Biol. 19, 97 (2018).
Article Google Scholar
Nieuwenhuis, S., Forstmann, B. U. & Wagenmakers, E. J. Nat. Neurosci. 14, 1105–1107 (2011).
Article CAS Google Scholar
Qin, L. X., Huang, H. C. & Begg, C. B. J. Clin. Oncol. 34, 3931–3938 (2016).
Article Google Scholar
Ernst, J. & Kellis, M. Nat. Biotechnol. 33, 364–376 (2015).
Article CAS Google Scholar
Vapnik, V. N. Statistical Learning Theory (Wiley, New York, 1998).
Google Scholar
Bishop, C. M. Pattern Recognition and Machine Learning (Springer, New York, 2006).
Friedman, J., Hastie, T. & Tibshirani, R. J. Stat. Softw. 33, 1–22 (2010).
Article Google Scholar
Webb, S. Nature 554, 555–557 (2018).
Article CAS Google Scholar
Bishop, C. M. Neural Networks for Pattern Recognition (Oxford Univ. Press, Oxford, 1995).
Google Scholar
Varma, S. & Simon, R. BMC Bioinform. 7, 91 (2006).
Article Google Scholar
Teschendorff, A. E. et al. Genome Biol. 7, R101 (2006).
Article Google Scholar
Ambroise, C. & McLachlan, G. J. Proc. Natl Acad. Sci. USA 99, 6562–6566 (2002).
Article CAS Google Scholar
Reunanen, J. J. Mach. Learn. Res. 3, 1371–1382 (2003).
Google Scholar
Efron, B. & Tibshirani, R. J. J. Am. Stat. Assoc. 92, 548–560 (1997).
Google Scholar
Simon, R. J. Natl Cancer Inst. 97, 866–867 (2005).
Article CAS Google Scholar
Biton, A. et al. Cell Rep. 9, 1235–1245 (2014).
Article CAS Google Scholar
Leek, J. T. & Storey, J. D. PLoS Genet. 3, 1724–1735 (2007).
Article CAS Google Scholar
Horvath, S. Genome Biol. 14, R115 (2013).
Article Google Scholar
Leek, J. T. & Storey, J. D. Proc. Natl Acad. Sci. USA 105, 18718–18723 (2008).
Article CAS Google Scholar
Galea, M. H., Blamey, R. W., Elston, C. E. & Ellis, I. O. Breast Cancer Res. Treat. 22, 207–219 (1992).
Article CAS Google Scholar
Bartlett, T. E. et al. PLoS ONE 10, e0143178 (2015).
Article Google Scholar

Download references

Author information

Authors and Affiliations

Statistical Cancer Genomics, UCL Cancer Institute and Department of Woman’s Cancer, University College London, London, UK
Andrew E. Teschendorff
CAS Key Lab of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, Shanghai Institute for Biological Sciences, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
Andrew E. Teschendorff

Authors

Andrew E. Teschendorff
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrew E. Teschendorff.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Teschendorff, A.E. Avoiding common pitfalls in machine learning omic data science. Nat. Mater. 18, 422–427 (2019). https://doi.org/10.1038/s41563-018-0241-z

Download citation

Published: 26 November 2018
Issue Date: May 2019
DOI: https://doi.org/10.1038/s41563-018-0241-z
Springer Nature Limited

This article is cited by

Quantifying the stochastic component of epigenetic aging
- Huige Tong
- Varun B. Dwaraka
- Andrew E. Teschendorff
Nature Aging (2024)
A meta-analysis of immune-cell fractions at high resolution reveals novel associations with common phenotypes and health outcomes
- Qi Luo
- Varun B. Dwaraka
- Andrew E. Teschendorff
Genome Medicine (2023)
Serum biomarker-based early detection of pancreatic ductal adenocarcinomas with ensemble learning
- Nuno R. Nené
- Alexander Ney
- John F. Timms
Communications Medicine (2023)
A new blood based epigenetic age predictor for adolescents and young adults
- Håvard Aanes
- Øyvind Bleka
- Veslemøy Rolseth
Scientific Reports (2023)
MAPK inhibitor sensitivity scores predict sensitivity driven by the immune infiltration in pediatric low-grade gliomas
- Romain Sigaud
- Thomas K. Albert
- Till Milde
Nature Communications (2023)

Associated content

Machine Learning in Medicine

Focus 18 April 2019

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Avoiding common pitfalls in machine learning omic data science

From

Access this article

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Quantifying the stochastic component of epigenetic aging

A meta-analysis of immune-cell fractions at high resolution reveals novel associations with common phenotypes and health outcomes

Serum biomarker-based early detection of pancreatic ductal adenocarcinomas with ensemble learning

A new blood based epigenetic age predictor for adolescents and young adults

MAPK inhibitor sensitivity scores predict sensitivity driven by the immune infiltration in pediatric low-grade gliomas

Search

Navigation