Skip to main content

Genomic Applications of the Neyman–Pearson Classification Paradigm

  • Chapter
  • First Online:

Abstract

The Neyman–Pearson (NP) classification paradigm addresses an important binary classification problem where users want to minimize type II error while controlling type I error under some specified level α, usually a small number. This problem is often faced in many genomic applications involving binary classification tasks. The terminology Neyman–Pearson classification paradigm arises from its connection to the Neyman–Pearson paradigm in hypothesis testing. The NP paradigm is applicable when one type of error (e.g., type I error) is far more important than the other type (e.g., type II error), and users have a specific target bound for the former. In this chapter, we review the NP classification literature, with a focus on the genomic applications as well as our contribution to the NP classification theory and algorithms. We also provide simulation examples and a genomic case study to demonstrate how to use the NP classification algorithm in practice.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Audibert, J., Tsybakov, A.: Fast learning rates for plug-in classifiers under the margin condition. Annals of Statistics 35, 608–633 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  2. Bi, J., Xiong, T., Yu, S., Dundar, M., Rao, R.B.: An improved multi-task learning approach with applications in medical diagnosis. In: Machine Learning and Knowledge Discovery in Databases, pp. 117–132. Springer (2008)

    Google Scholar 

  3. Blanchard, G., Lee, G., Scott, C.: Semi-supervised novelty detection. Journal of Machine Learning Research 11, 2973–3009 (2010)

    MathSciNet  MATH  Google Scholar 

  4. Booij, B.B., Lindahl, T., Wetterberg, P., Skaane, N.V., Sæbø, S., Feten, G., Rye, P.D., Kristiansen, L.I., Hagen, N., Jensen, M., et al.: A gene expression pattern in blood for the early detection of Alzheimer’s disease. Journal of Alzheimer’s Disease 23 (1), 109–119 (2011)

    Google Scholar 

  5. Boyle, A.P., Song, L., Lee, B.K., London, D., Keefe, D., Birney, E., Iyer, V.R., Crawford, G.E., Furey, T.S.: High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. Genome research 21 (3), 456–464 (2011)

    Article  Google Scholar 

  6. Breiman, L.: Random forests. Machine learning 45 (1), 5–32 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  7. Bulyk, M.L., et al.: Computational prediction of transcription-factor binding site locations. Genome biology 5 (1), 201–201 (2004)

    Article  Google Scholar 

  8. Cannon, A., Howse, J., Hush, D., Scovel, C.: Learning with the Neyman-Pearson and min-max criteria. Technical Report LA-UR-02-2951 (2002)

    Google Scholar 

  9. Casasent, D., Chen, X.: Radial basis function neural networks for nonlinear fisher discrimination and Neyman-Pearson classification. Neural Networks 16 (5–6), 529–535 (2003)

    Article  Google Scholar 

  10. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20 (3), 273–297 (1995)

    MATH  Google Scholar 

  11. Cox, D.R.: The regression analysis of binary sequences. Journal of the Royal Statistical Society. Series B (Methodological) pp. 215–242 (1958)

    Google Scholar 

  12. Degner, J.F., Pai, A.A., Pique-Regi, R., Veyrieras, J.B., Gaffney, D.J., Pickrell, J.K., De Leon, S., Michelini, K., Lewellen, N., Crawford, G.E., et al.: DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482 (7385), 390–394 (2012)

    Article  Google Scholar 

  13. Dümbgen, L., Igl, B., Munk, A.: P-values for classification. Electronic Journal of Statistics 2, 468–493 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  14. Elkan, C.: The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence pp. 973–978 (2001)

    Google Scholar 

  15. Feng, Y., Li, J., Tong, X.: nproc: Neyman-Pearson Receiver Operator Curve (2016). URL http://CRAN.R-project.org/package=nproc. R package version 0.1

  16. Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Haussler, D.: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16 (10), 906–914 (2000)

    Article  Google Scholar 

  17. Galas, D.J., Schmitz, A.: DNase footprinting a simple method for the detection of protein-DNA binding specificity. Nucleic acids research 5 (9), 3157–3170 (1978)

    Article  Google Scholar 

  18. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science 286 (5439), 531–537 (1999)

    Google Scholar 

  19. Han, M., Chen, D., Sun, Z.: Analysis to Neyman-Pearson classification with convex loss function. Anal. Theory Appl. 24 (1), 18–28 (2008). DOI 10.1007/s10496-008-0018-3

    Article  MathSciNet  MATH  Google Scholar 

  20. He, H.H., Meyer, C.A., Chen, M.W., Zang, C., Liu, Y., Rao, P.K., Fei, T., Xu, H., Long, H., Liu, X.S., et al.: Refined DNase-seq protocol and data analysis reveals intrinsic bias in transcription factor footprint identification. Nature methods 11 (1), 73–78 (2014)

    Article  Google Scholar 

  21. Huang, H., Liu, C.C., Zhou, X.J.: Bayesian approach to transforming public gene expression repositories into disease diagnosis databases. Proceedings of the National Academy of Sciences 107 (15), 6823–6828 (2010)

    Article  Google Scholar 

  22. Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., et al.: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine 7 (6), 673–679 (2001)

    Article  Google Scholar 

  23. Koltchinskii, V.: Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems (2008)

    MATH  Google Scholar 

  24. Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: A review of classification techniques. Informatica 31, 249–268 (2007)

    MathSciNet  MATH  Google Scholar 

  25. Lee, Y., Lee, C.K.: Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics 19 (9), 1132–1139 (2003)

    Article  Google Scholar 

  26. Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Machine learning: ECML-98, pp. 4–15. Springer (1998)

    Google Scholar 

  27. Liu, C.C., Hu, J., Kalakrishnan, M., Huang, H., Zhou, X.J.: Integrative disease classification based on cross-platform microarray data. BMC Bioinformatics 10 (Suppl 1), S25 (2009)

    Article  Google Scholar 

  28. Liu, F., Wee, C.Y., Chen, H., Shen, D.: Inter-modality relationship constrained multi-modality multi-task feature selection for Alzheimer’s disease and mild cognitive impairment identification. NeuroImage 84, 466–475 (2014)

    Article  Google Scholar 

  29. Ma, S., Song, X., Huang, J.: Supervised group lasso with applications to microarray data analysis. BMC bioinformatics 8 (1), 1 (2007)

    Article  Google Scholar 

  30. Mammen, E., Tsybakov, A.: Smooth discrimination analysis. Annals of Statistics 27, 1808–1829 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  31. Neph, S., Vierstra, J., Stergachis, A.B., Reynolds, A.P., Haugen, E., Vernot, B., Thurman, R.E., John, S., Sandstrom, R., Johnson, A.K., et al.: An expansive human regulatory lexicon encoded in transcription factor footprints. Nature 489 (7414), 83–90 (2012)

    Article  Google Scholar 

  32. Ng, K.L.S., Mishra, S.K.: De novo svm classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics 23 (11), 1321–1330 (2007)

    Article  Google Scholar 

  33. Park, P.J., Tian, L., Kohane, I.S.: Linking gene expression data with patient survival times using partial least squares. Bioinformatics 18 (suppl 1), S120–S127 (2002)

    Article  Google Scholar 

  34. Phillips, J.E., Corces, V.G.: Ctcf: master weaver of the genome. Cell 137 (7), 1194–1211 (2009)

    Article  Google Scholar 

  35. Platt, J., et al.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers 10 (3), 61–74 (1999)

    Google Scholar 

  36. Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J.P., Poggio, T., Gerald, W., Loda, M., Lander, E.S., Golub, T.R.: Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences 98 (26), 15,149–15,154 (2001)

    Article  Google Scholar 

  37. Rigollet, P., Tong, X.: Neyman-Pearson classification, convexity and stochastic constraints. Journal of Machine Learning Research 12, 2831–2855 (2011)

    MathSciNet  MATH  Google Scholar 

  38. Scott, C.: Comparison and design of Neyman-Pearson classifiers. Unpublished (2005)

    Google Scholar 

  39. Scott, C.: Performance measures for Neyman-Pearson classification. IEEE Transactions on Information Theory 53 (8), 2852–2863 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  40. Scott, C., Nowak, R.: A Neyman-Pearson approach to statistical learning. IEEE Transactions on Information Theory 51 (11), 3806–3819 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  41. Segal, N.H., Pavlidis, P., Antonescu, C.R., Maki, R.G., Noble, W.S., DeSantis, D., Woodruff, J.M., Lewis, J.J., Brennan, M.F., Houghton, A.N., Cordon-Cardo, C.: Classification and subtype prediction of adult soft tissue sarcoma by functional genomics. The American Journal of Pathology 163 (2), 691–700 (2003)

    Article  Google Scholar 

  42. Song, L., Zhang, Z., Grasfeder, L.L., Boyle, A.P., Giresi, P.G., Lee, B.K., Sheffield, N.C., Gräf, S., Huss, M., Keefe, D., et al.: Open chromatin defined by DNaseI and faire identifies regulatory elements that shape cell-type identity. Genome research 21 (10), 1757–1767 (2011)

    Article  Google Scholar 

  43. Specht, D.F.: Probabilistic neural networks. Neural networks 3 (1), 109–118 (1990)

    Article  Google Scholar 

  44. Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., Levy, S.: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21 (5), 631–643 (2005)

    Article  Google Scholar 

  45. Tarigan, B., van de Geer, S.: Classifiers of support vector machine type with l1 complexity regularization. Bernoulli 12, 1045–1076 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  46. Tong, X.: A plug-in approach to Neyman-Pearson classification. Journal of Machine Learning Research 14, 3011–3040 (2013)

    MathSciNet  MATH  Google Scholar 

  47. Tong, X., Feng, Y., Li, J.J.: Neyman-pearson (np) classification algorithms and np receiver operating characteristic (np-roc) curves Manuscript

    Google Scholar 

  48. Tong, X., Feng, Y., Zhao, A.: A survey on Neyman-Pearson classification and suggestions for future research. Wiley Interdisciplinary Reviews: Computational Statistics 8, 64–81 (2016)

    Article  MathSciNet  Google Scholar 

  49. Tsybakov, A.: Optimal aggregation of classifiers in statistical learning. Annals of Statistics 32, 135–166 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  50. Tsybakov, A., van de Geer, S.: Square root penalty: Adaptation to the margin in classification and in edge estimation. Annals of Statistics 33, 1203–1224 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  51. Wei, J.S., Greer, B.T., Westermann, F., Steinberg, S.M., Son, C.G., Chen, Q.R., Whiteford, C.C., Bilke, S., Krasnoselsky, A.L., Cenacchi, N., et al.: Prediction of clinical outcome using gene expression profiling and artificial neural networks for patients with neuroblastoma. Cancer research 64 (19), 6883–6891 (2004)

    Article  Google Scholar 

  52. Wu, S., Lin, K., Chen, C., M., C.: Asymmetric support vector machines: low false-positive learning under the user tolerance (2008)

    Google Scholar 

  53. Xing, E.P., Jordan, M.I., Karp, R.M., et al.: Feature selection for high-dimensional genomic microarray data. In: ICML, vol. 1, pp. 601–608. Citeseer (2001)

    Google Scholar 

  54. Yanai, I., Benjamin, H., Shmoish, M., Chalifa-Caspi, V., Shklar, M., Ophir, R., Bar-Even, A., Horn-Saban, S., Safran, M., Domany, E., et al.: Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics 21 (5), 650–659 (2005)

    Article  Google Scholar 

  55. Yang, Y.: Minimax nonparametric classification-part i: rates of convergence. IEEE Transaction Information Theory 45, 2271–2284 (1999)

    Article  MATH  Google Scholar 

  56. Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. IEEE International Conference on Data Mining p. 435 (2003)

    Google Scholar 

  57. Zhang, D., Shen, D., Initiative, A.D.N., et al.: Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage 59 (2), 895–907 (2012)

    Article  Google Scholar 

  58. Zhao, A., Feng, Y., Wang, L., Tong, X.: Neyman-Pearson classification under high dimensional settings (2015). URL http://arxiv.org/abs/1508.03106

  59. Zhou, J., Yuan, L., Liu, J., Ye, J.: A multi-task learning formulation for predicting disease progression. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 814–822. ACM (2011)

    Google Scholar 

Download references

Acknowledgements

Dr. Jingyi Jessica Li’s work was supported by the start-up fund of the UCLA Department of Statistics and the Hellman Fellowship. Dr. Xin Tong’s work was supported by Zumberge Individual Award from University of Southern California and summer research support from Marshall School of Business. We thank Dr. Yang Feng in Department of Statistics at Columbia University and Ms. Anqi Zhao in Department of Statistics at Harvard University for their help in developing the Neyman–Pearson classification algorithms. We also thank Dr. Wei Li and Mr. Sheng’en Shawn Hu in Dr. X. Shirley Liu’s group in Department of Biostatistics and Computational Biology at Dana-Farber Cancer Institute and Harvard School of Public Health for kindly sharing the data for our genomic case study in Sect. 4.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jingyi Jessica Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Li, J.J., Tong, X. (2016). Genomic Applications of the Neyman–Pearson Classification Paradigm. In: Wong, KC. (eds) Big Data Analytics in Genomics. Springer, Cham. https://doi.org/10.1007/978-3-319-41279-5_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41279-5_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41278-8

  • Online ISBN: 978-3-319-41279-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics