Skip to main content

Advertisement

Log in

Avoiding common pitfalls in machine learning omic data science

  • Comment
  • Published:

From Nature Materials

View current issue Submit your manuscript

This Comment describes some of the common pitfalls encountered in deriving and validating predictive statistical models from high-dimensional data. It offers a fresh perspective on some key statistical issues, providing some guidelines to avoid pitfalls, and to help unfamiliar readers better assess the reliability and significance of their results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1: The curse of dimensionality and overfitting.
Fig. 2: Avoiding bias when training and evaluating molecular predictors.
Fig. 3: Unknown confounders and class prediction.
Fig. 4: Avoiding bias when comparing feature selection methods.

References

  1. Kalinin, S. V., Sumpter, B. G. & Archibald, R. K. Nat. Mater. 14, 973–980 (2015).

    Article  CAS  Google Scholar 

  2. Marx, V. Nature 498, 255–260 (2013).

    Article  CAS  Google Scholar 

  3. Mattmann, C. A. Nature 493, 473–475 (2013).

    Article  CAS  Google Scholar 

  4. Fodor, S. P. et al. Science 251, 767–773 (1991).

    Article  CAS  Google Scholar 

  5. Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Science 270, 467–470 (1995).

    Article  CAS  Google Scholar 

  6. Perou, C. M. et al. Proc. Natl Acad. Sci. USA 96, 9212–9217 (1999).

    Article  CAS  Google Scholar 

  7. Wheeler, D. A. et al. Nature 452, 872–876 (2008).

    Article  CAS  Google Scholar 

  8. Nagalakshmi, U. et al. Science 320, 1344–1349 (2008).

    Article  CAS  Google Scholar 

  9. van ’t Veer, L. J. et al. Nature 415, 530–536 (2002).

    Article  Google Scholar 

  10. Guo, S. et al. Nat. Genet. 49, 635–642 (2017).

    Article  CAS  Google Scholar 

  11. Gerlinger, M. et al. N. Engl. J. Med. 366, 883–892 (2012).

    Article  CAS  Google Scholar 

  12. Xu, R. H. et al. Nat. Mater. 16, 1155–1161 (2017).

    Article  CAS  Google Scholar 

  13. Storey, J. D. & Tibshirani, R. Proc. Natl Acad. Sci. USA 100, 9440–9445 (2003).

    Article  CAS  Google Scholar 

  14. Leek, J. T. et al. Nat. Rev. Genet. 11, 733–739 (2010).

    Article  CAS  Google Scholar 

  15. Teschendorff, A. E., Zhuang, J. & Widschwendter, M. Bioinformatics 27, 1496–1505 (2011).

    Article  CAS  Google Scholar 

  16. Simon, R., Radmacher, M. D., Dobbin, K. & McShane, L. M. J. Natl Cancer Inst. 95, 14–18 (2003).

    Article  CAS  Google Scholar 

  17. Ioannidis, J. P. PLoS Med. 2, e124 (2005).

    Article  Google Scholar 

  18. Jager, L. R. & Leek, J. T. Biostatistics 15, 1–12 (2014).

    Article  Google Scholar 

  19. Sebastiani, P. et al. Science 333, 404 (2011).

    Article  CAS  Google Scholar 

  20. Ioannidis, J. P. et al. Nat. Genet. 41, 149–155 (2009).

    Article  CAS  Google Scholar 

  21. Seoighe, C., Tosh, N. J. & Greally, J. M. Nat. Genet. 50, 1062–1063 (2018).

    Article  CAS  Google Scholar 

  22. Jacob, L. & Speed, T. P. Genome Biol. 19, 97 (2018).

    Article  Google Scholar 

  23. Nieuwenhuis, S., Forstmann, B. U. & Wagenmakers, E. J. Nat. Neurosci. 14, 1105–1107 (2011).

    Article  CAS  Google Scholar 

  24. Qin, L. X., Huang, H. C. & Begg, C. B. J. Clin. Oncol. 34, 3931–3938 (2016).

    Article  Google Scholar 

  25. Ernst, J. & Kellis, M. Nat. Biotechnol. 33, 364–376 (2015).

    Article  CAS  Google Scholar 

  26. Vapnik, V. N. Statistical Learning Theory (Wiley, New York, 1998).

    Google Scholar 

  27. Bishop, C. M. Pattern Recognition and Machine Learning (Springer, New York, 2006).

  28. Friedman, J., Hastie, T. & Tibshirani, R. J. Stat. Softw. 33, 1–22 (2010).

    Article  Google Scholar 

  29. Webb, S. Nature 554, 555–557 (2018).

    Article  CAS  Google Scholar 

  30. Bishop, C. M. Neural Networks for Pattern Recognition (Oxford Univ. Press, Oxford, 1995).

    Google Scholar 

  31. Varma, S. & Simon, R. BMC Bioinform. 7, 91 (2006).

    Article  Google Scholar 

  32. Teschendorff, A. E. et al. Genome Biol. 7, R101 (2006).

    Article  Google Scholar 

  33. Ambroise, C. & McLachlan, G. J. Proc. Natl Acad. Sci. USA 99, 6562–6566 (2002).

    Article  CAS  Google Scholar 

  34. Reunanen, J. J. Mach. Learn. Res. 3, 1371–1382 (2003).

    Google Scholar 

  35. Efron, B. & Tibshirani, R. J. J. Am. Stat. Assoc. 92, 548–560 (1997).

    Google Scholar 

  36. Simon, R. J. Natl Cancer Inst. 97, 866–867 (2005).

    Article  CAS  Google Scholar 

  37. Biton, A. et al. Cell Rep. 9, 1235–1245 (2014).

    Article  CAS  Google Scholar 

  38. Leek, J. T. & Storey, J. D. PLoS Genet. 3, 1724–1735 (2007).

    Article  CAS  Google Scholar 

  39. Horvath, S. Genome Biol. 14, R115 (2013).

    Article  Google Scholar 

  40. Leek, J. T. & Storey, J. D. Proc. Natl Acad. Sci. USA 105, 18718–18723 (2008).

    Article  CAS  Google Scholar 

  41. Galea, M. H., Blamey, R. W., Elston, C. E. & Ellis, I. O. Breast Cancer Res. Treat. 22, 207–219 (1992).

    Article  CAS  Google Scholar 

  42. Bartlett, T. E. et al. PLoS ONE 10, e0143178 (2015).

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrew E. Teschendorff.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Teschendorff, A.E. Avoiding common pitfalls in machine learning omic data science. Nat. Mater. 18, 422–427 (2019). https://doi.org/10.1038/s41563-018-0241-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41563-018-0241-z

  • Springer Nature Limited

This article is cited by

Navigation