Genomic Applications of the Neyman–Pearson Classification Paradigm

Chapter

Abstract

The Neyman–Pearson (NP) classification paradigm addresses an important binary classification problem where users want to minimize type II error while controlling type I error under some specified level α, usually a small number. This problem is often faced in many genomic applications involving binary classification tasks. The terminology Neyman–Pearson classification paradigm arises from its connection to the Neyman–Pearson paradigm in hypothesis testing. The NP paradigm is applicable when one type of error (e.g., type I error) is far more important than the other type (e.g., type II error), and users have a specific target bound for the former. In this chapter, we review the NP classification literature, with a focus on the genomic applications as well as our contribution to the NP classification theory and algorithms. We also provide simulation examples and a genomic case study to demonstrate how to use the NP classification algorithm in practice.

Keywords

Classification Genomic applications Neyman–Pearson Statistical learning Methodology 

References

  1. 1.
    Audibert, J., Tsybakov, A.: Fast learning rates for plug-in classifiers under the margin condition. Annals of Statistics 35, 608–633 (2007)MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Bi, J., Xiong, T., Yu, S., Dundar, M., Rao, R.B.: An improved multi-task learning approach with applications in medical diagnosis. In: Machine Learning and Knowledge Discovery in Databases, pp. 117–132. Springer (2008)Google Scholar
  3. 3.
    Blanchard, G., Lee, G., Scott, C.: Semi-supervised novelty detection. Journal of Machine Learning Research 11, 2973–3009 (2010)MathSciNetMATHGoogle Scholar
  4. 4.
    Booij, B.B., Lindahl, T., Wetterberg, P., Skaane, N.V., Sæbø, S., Feten, G., Rye, P.D., Kristiansen, L.I., Hagen, N., Jensen, M., et al.: A gene expression pattern in blood for the early detection of Alzheimer’s disease. Journal of Alzheimer’s Disease 23 (1), 109–119 (2011)Google Scholar
  5. 5.
    Boyle, A.P., Song, L., Lee, B.K., London, D., Keefe, D., Birney, E., Iyer, V.R., Crawford, G.E., Furey, T.S.: High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. Genome research 21 (3), 456–464 (2011)CrossRefGoogle Scholar
  6. 6.
    Breiman, L.: Random forests. Machine learning 45 (1), 5–32 (2001)MathSciNetCrossRefMATHGoogle Scholar
  7. 7.
    Bulyk, M.L., et al.: Computational prediction of transcription-factor binding site locations. Genome biology 5 (1), 201–201 (2004)CrossRefGoogle Scholar
  8. 8.
    Cannon, A., Howse, J., Hush, D., Scovel, C.: Learning with the Neyman-Pearson and min-max criteria. Technical Report LA-UR-02-2951 (2002)Google Scholar
  9. 9.
    Casasent, D., Chen, X.: Radial basis function neural networks for nonlinear fisher discrimination and Neyman-Pearson classification. Neural Networks 16 (5–6), 529–535 (2003)CrossRefGoogle Scholar
  10. 10.
    Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20 (3), 273–297 (1995)MATHGoogle Scholar
  11. 11.
    Cox, D.R.: The regression analysis of binary sequences. Journal of the Royal Statistical Society. Series B (Methodological) pp. 215–242 (1958)Google Scholar
  12. 12.
    Degner, J.F., Pai, A.A., Pique-Regi, R., Veyrieras, J.B., Gaffney, D.J., Pickrell, J.K., De Leon, S., Michelini, K., Lewellen, N., Crawford, G.E., et al.: DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482 (7385), 390–394 (2012)CrossRefGoogle Scholar
  13. 13.
    Dümbgen, L., Igl, B., Munk, A.: P-values for classification. Electronic Journal of Statistics 2, 468–493 (2008)MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Elkan, C.: The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence pp. 973–978 (2001)Google Scholar
  15. 15.
    Feng, Y., Li, J., Tong, X.: nproc: Neyman-Pearson Receiver Operator Curve (2016). URL http://CRAN.R-project.org/package=nproc. R package version 0.1
  16. 16.
    Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Haussler, D.: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16 (10), 906–914 (2000)CrossRefGoogle Scholar
  17. 17.
    Galas, D.J., Schmitz, A.: DNase footprinting a simple method for the detection of protein-DNA binding specificity. Nucleic acids research 5 (9), 3157–3170 (1978)CrossRefGoogle Scholar
  18. 18.
    Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science 286 (5439), 531–537 (1999)Google Scholar
  19. 19.
    Han, M., Chen, D., Sun, Z.: Analysis to Neyman-Pearson classification with convex loss function. Anal. Theory Appl. 24 (1), 18–28 (2008). DOI 10.1007/s10496-008-0018-3MathSciNetCrossRefMATHGoogle Scholar
  20. 20.
    He, H.H., Meyer, C.A., Chen, M.W., Zang, C., Liu, Y., Rao, P.K., Fei, T., Xu, H., Long, H., Liu, X.S., et al.: Refined DNase-seq protocol and data analysis reveals intrinsic bias in transcription factor footprint identification. Nature methods 11 (1), 73–78 (2014)CrossRefGoogle Scholar
  21. 21.
    Huang, H., Liu, C.C., Zhou, X.J.: Bayesian approach to transforming public gene expression repositories into disease diagnosis databases. Proceedings of the National Academy of Sciences 107 (15), 6823–6828 (2010)CrossRefGoogle Scholar
  22. 22.
    Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., et al.: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine 7 (6), 673–679 (2001)CrossRefGoogle Scholar
  23. 23.
    Koltchinskii, V.: Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems (2008)MATHGoogle Scholar
  24. 24.
    Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: A review of classification techniques. Informatica 31, 249–268 (2007)MathSciNetMATHGoogle Scholar
  25. 25.
    Lee, Y., Lee, C.K.: Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics 19 (9), 1132–1139 (2003)CrossRefGoogle Scholar
  26. 26.
    Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Machine learning: ECML-98, pp. 4–15. Springer (1998)Google Scholar
  27. 27.
    Liu, C.C., Hu, J., Kalakrishnan, M., Huang, H., Zhou, X.J.: Integrative disease classification based on cross-platform microarray data. BMC Bioinformatics 10 (Suppl 1), S25 (2009)CrossRefGoogle Scholar
  28. 28.
    Liu, F., Wee, C.Y., Chen, H., Shen, D.: Inter-modality relationship constrained multi-modality multi-task feature selection for Alzheimer’s disease and mild cognitive impairment identification. NeuroImage 84, 466–475 (2014)CrossRefGoogle Scholar
  29. 29.
    Ma, S., Song, X., Huang, J.: Supervised group lasso with applications to microarray data analysis. BMC bioinformatics 8 (1), 1 (2007)CrossRefGoogle Scholar
  30. 30.
    Mammen, E., Tsybakov, A.: Smooth discrimination analysis. Annals of Statistics 27, 1808–1829 (1999)MathSciNetCrossRefMATHGoogle Scholar
  31. 31.
    Neph, S., Vierstra, J., Stergachis, A.B., Reynolds, A.P., Haugen, E., Vernot, B., Thurman, R.E., John, S., Sandstrom, R., Johnson, A.K., et al.: An expansive human regulatory lexicon encoded in transcription factor footprints. Nature 489 (7414), 83–90 (2012)CrossRefGoogle Scholar
  32. 32.
    Ng, K.L.S., Mishra, S.K.: De novo svm classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics 23 (11), 1321–1330 (2007)CrossRefGoogle Scholar
  33. 33.
    Park, P.J., Tian, L., Kohane, I.S.: Linking gene expression data with patient survival times using partial least squares. Bioinformatics 18 (suppl 1), S120–S127 (2002)CrossRefGoogle Scholar
  34. 34.
    Phillips, J.E., Corces, V.G.: Ctcf: master weaver of the genome. Cell 137 (7), 1194–1211 (2009)CrossRefGoogle Scholar
  35. 35.
    Platt, J., et al.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers 10 (3), 61–74 (1999)Google Scholar
  36. 36.
    Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J.P., Poggio, T., Gerald, W., Loda, M., Lander, E.S., Golub, T.R.: Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences 98 (26), 15,149–15,154 (2001)CrossRefGoogle Scholar
  37. 37.
    Rigollet, P., Tong, X.: Neyman-Pearson classification, convexity and stochastic constraints. Journal of Machine Learning Research 12, 2831–2855 (2011)MathSciNetMATHGoogle Scholar
  38. 38.
    Scott, C.: Comparison and design of Neyman-Pearson classifiers. Unpublished (2005)Google Scholar
  39. 39.
    Scott, C.: Performance measures for Neyman-Pearson classification. IEEE Transactions on Information Theory 53 (8), 2852–2863 (2007)MathSciNetCrossRefMATHGoogle Scholar
  40. 40.
    Scott, C., Nowak, R.: A Neyman-Pearson approach to statistical learning. IEEE Transactions on Information Theory 51 (11), 3806–3819 (2005)MathSciNetCrossRefMATHGoogle Scholar
  41. 41.
    Segal, N.H., Pavlidis, P., Antonescu, C.R., Maki, R.G., Noble, W.S., DeSantis, D., Woodruff, J.M., Lewis, J.J., Brennan, M.F., Houghton, A.N., Cordon-Cardo, C.: Classification and subtype prediction of adult soft tissue sarcoma by functional genomics. The American Journal of Pathology 163 (2), 691–700 (2003)CrossRefGoogle Scholar
  42. 42.
    Song, L., Zhang, Z., Grasfeder, L.L., Boyle, A.P., Giresi, P.G., Lee, B.K., Sheffield, N.C., Gräf, S., Huss, M., Keefe, D., et al.: Open chromatin defined by DNaseI and faire identifies regulatory elements that shape cell-type identity. Genome research 21 (10), 1757–1767 (2011)CrossRefGoogle Scholar
  43. 43.
    Specht, D.F.: Probabilistic neural networks. Neural networks 3 (1), 109–118 (1990)CrossRefGoogle Scholar
  44. 44.
    Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., Levy, S.: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21 (5), 631–643 (2005)CrossRefGoogle Scholar
  45. 45.
    Tarigan, B., van de Geer, S.: Classifiers of support vector machine type with l1 complexity regularization. Bernoulli 12, 1045–1076 (2006)MathSciNetCrossRefMATHGoogle Scholar
  46. 46.
    Tong, X.: A plug-in approach to Neyman-Pearson classification. Journal of Machine Learning Research 14, 3011–3040 (2013)MathSciNetMATHGoogle Scholar
  47. 47.
    Tong, X., Feng, Y., Li, J.J.: Neyman-pearson (np) classification algorithms and np receiver operating characteristic (np-roc) curves ManuscriptGoogle Scholar
  48. 48.
    Tong, X., Feng, Y., Zhao, A.: A survey on Neyman-Pearson classification and suggestions for future research. Wiley Interdisciplinary Reviews: Computational Statistics 8, 64–81 (2016)MathSciNetCrossRefGoogle Scholar
  49. 49.
    Tsybakov, A.: Optimal aggregation of classifiers in statistical learning. Annals of Statistics 32, 135–166 (2004)MathSciNetCrossRefMATHGoogle Scholar
  50. 50.
    Tsybakov, A., van de Geer, S.: Square root penalty: Adaptation to the margin in classification and in edge estimation. Annals of Statistics 33, 1203–1224 (2005)MathSciNetCrossRefMATHGoogle Scholar
  51. 51.
    Wei, J.S., Greer, B.T., Westermann, F., Steinberg, S.M., Son, C.G., Chen, Q.R., Whiteford, C.C., Bilke, S., Krasnoselsky, A.L., Cenacchi, N., et al.: Prediction of clinical outcome using gene expression profiling and artificial neural networks for patients with neuroblastoma. Cancer research 64 (19), 6883–6891 (2004)CrossRefGoogle Scholar
  52. 52.
    Wu, S., Lin, K., Chen, C., M., C.: Asymmetric support vector machines: low false-positive learning under the user tolerance (2008)Google Scholar
  53. 53.
    Xing, E.P., Jordan, M.I., Karp, R.M., et al.: Feature selection for high-dimensional genomic microarray data. In: ICML, vol. 1, pp. 601–608. Citeseer (2001)Google Scholar
  54. 54.
    Yanai, I., Benjamin, H., Shmoish, M., Chalifa-Caspi, V., Shklar, M., Ophir, R., Bar-Even, A., Horn-Saban, S., Safran, M., Domany, E., et al.: Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics 21 (5), 650–659 (2005)CrossRefGoogle Scholar
  55. 55.
    Yang, Y.: Minimax nonparametric classification-part i: rates of convergence. IEEE Transaction Information Theory 45, 2271–2284 (1999)CrossRefMATHGoogle Scholar
  56. 56.
    Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. IEEE International Conference on Data Mining p. 435 (2003)Google Scholar
  57. 57.
    Zhang, D., Shen, D., Initiative, A.D.N., et al.: Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage 59 (2), 895–907 (2012)CrossRefGoogle Scholar
  58. 58.
    Zhao, A., Feng, Y., Wang, L., Tong, X.: Neyman-Pearson classification under high dimensional settings (2015). URL http://arxiv.org/abs/1508.03106
  59. 59.
    Zhou, J., Yuan, L., Liu, J., Ye, J.: A multi-task learning formulation for predicting disease progression. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 814–822. ACM (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Department of StatisticsUniversity of California, Los AngelesLos AngelesUSA
  2. 2.Department of Data Sciences and OperationsUniversity of Southern CaliforniaLos AngelesUSA

Personalised recommendations