Abstract
The Neyman–Pearson (NP) classification paradigm addresses an important binary classification problem where users want to minimize type II error while controlling type I error under some specified level α, usually a small number. This problem is often faced in many genomic applications involving binary classification tasks. The terminology Neyman–Pearson classification paradigm arises from its connection to the Neyman–Pearson paradigm in hypothesis testing. The NP paradigm is applicable when one type of error (e.g., type I error) is far more important than the other type (e.g., type II error), and users have a specific target bound for the former. In this chapter, we review the NP classification literature, with a focus on the genomic applications as well as our contribution to the NP classification theory and algorithms. We also provide simulation examples and a genomic case study to demonstrate how to use the NP classification algorithm in practice.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Audibert, J., Tsybakov, A.: Fast learning rates for plug-in classifiers under the margin condition. Annals of Statistics 35, 608–633 (2007)
Bi, J., Xiong, T., Yu, S., Dundar, M., Rao, R.B.: An improved multi-task learning approach with applications in medical diagnosis. In: Machine Learning and Knowledge Discovery in Databases, pp. 117–132. Springer (2008)
Blanchard, G., Lee, G., Scott, C.: Semi-supervised novelty detection. Journal of Machine Learning Research 11, 2973–3009 (2010)
Booij, B.B., Lindahl, T., Wetterberg, P., Skaane, N.V., Sæbø, S., Feten, G., Rye, P.D., Kristiansen, L.I., Hagen, N., Jensen, M., et al.: A gene expression pattern in blood for the early detection of Alzheimer’s disease. Journal of Alzheimer’s Disease 23 (1), 109–119 (2011)
Boyle, A.P., Song, L., Lee, B.K., London, D., Keefe, D., Birney, E., Iyer, V.R., Crawford, G.E., Furey, T.S.: High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. Genome research 21 (3), 456–464 (2011)
Breiman, L.: Random forests. Machine learning 45 (1), 5–32 (2001)
Bulyk, M.L., et al.: Computational prediction of transcription-factor binding site locations. Genome biology 5 (1), 201–201 (2004)
Cannon, A., Howse, J., Hush, D., Scovel, C.: Learning with the Neyman-Pearson and min-max criteria. Technical Report LA-UR-02-2951 (2002)
Casasent, D., Chen, X.: Radial basis function neural networks for nonlinear fisher discrimination and Neyman-Pearson classification. Neural Networks 16 (5–6), 529–535 (2003)
Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20 (3), 273–297 (1995)
Cox, D.R.: The regression analysis of binary sequences. Journal of the Royal Statistical Society. Series B (Methodological) pp. 215–242 (1958)
Degner, J.F., Pai, A.A., Pique-Regi, R., Veyrieras, J.B., Gaffney, D.J., Pickrell, J.K., De Leon, S., Michelini, K., Lewellen, N., Crawford, G.E., et al.: DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482 (7385), 390–394 (2012)
Dümbgen, L., Igl, B., Munk, A.: P-values for classification. Electronic Journal of Statistics 2, 468–493 (2008)
Elkan, C.: The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence pp. 973–978 (2001)
Feng, Y., Li, J., Tong, X.: nproc: Neyman-Pearson Receiver Operator Curve (2016). URL http://CRAN.R-project.org/package=nproc. R package version 0.1
Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Haussler, D.: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16 (10), 906–914 (2000)
Galas, D.J., Schmitz, A.: DNase footprinting a simple method for the detection of protein-DNA binding specificity. Nucleic acids research 5 (9), 3157–3170 (1978)
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science 286 (5439), 531–537 (1999)
Han, M., Chen, D., Sun, Z.: Analysis to Neyman-Pearson classification with convex loss function. Anal. Theory Appl. 24 (1), 18–28 (2008). DOI 10.1007/s10496-008-0018-3
He, H.H., Meyer, C.A., Chen, M.W., Zang, C., Liu, Y., Rao, P.K., Fei, T., Xu, H., Long, H., Liu, X.S., et al.: Refined DNase-seq protocol and data analysis reveals intrinsic bias in transcription factor footprint identification. Nature methods 11 (1), 73–78 (2014)
Huang, H., Liu, C.C., Zhou, X.J.: Bayesian approach to transforming public gene expression repositories into disease diagnosis databases. Proceedings of the National Academy of Sciences 107 (15), 6823–6828 (2010)
Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., et al.: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine 7 (6), 673–679 (2001)
Koltchinskii, V.: Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems (2008)
Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: A review of classification techniques. Informatica 31, 249–268 (2007)
Lee, Y., Lee, C.K.: Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics 19 (9), 1132–1139 (2003)
Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Machine learning: ECML-98, pp. 4–15. Springer (1998)
Liu, C.C., Hu, J., Kalakrishnan, M., Huang, H., Zhou, X.J.: Integrative disease classification based on cross-platform microarray data. BMC Bioinformatics 10 (Suppl 1), S25 (2009)
Liu, F., Wee, C.Y., Chen, H., Shen, D.: Inter-modality relationship constrained multi-modality multi-task feature selection for Alzheimer’s disease and mild cognitive impairment identification. NeuroImage 84, 466–475 (2014)
Ma, S., Song, X., Huang, J.: Supervised group lasso with applications to microarray data analysis. BMC bioinformatics 8 (1), 1 (2007)
Mammen, E., Tsybakov, A.: Smooth discrimination analysis. Annals of Statistics 27, 1808–1829 (1999)
Neph, S., Vierstra, J., Stergachis, A.B., Reynolds, A.P., Haugen, E., Vernot, B., Thurman, R.E., John, S., Sandstrom, R., Johnson, A.K., et al.: An expansive human regulatory lexicon encoded in transcription factor footprints. Nature 489 (7414), 83–90 (2012)
Ng, K.L.S., Mishra, S.K.: De novo svm classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics 23 (11), 1321–1330 (2007)
Park, P.J., Tian, L., Kohane, I.S.: Linking gene expression data with patient survival times using partial least squares. Bioinformatics 18 (suppl 1), S120–S127 (2002)
Phillips, J.E., Corces, V.G.: Ctcf: master weaver of the genome. Cell 137 (7), 1194–1211 (2009)
Platt, J., et al.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers 10 (3), 61–74 (1999)
Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J.P., Poggio, T., Gerald, W., Loda, M., Lander, E.S., Golub, T.R.: Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences 98 (26), 15,149–15,154 (2001)
Rigollet, P., Tong, X.: Neyman-Pearson classification, convexity and stochastic constraints. Journal of Machine Learning Research 12, 2831–2855 (2011)
Scott, C.: Comparison and design of Neyman-Pearson classifiers. Unpublished (2005)
Scott, C.: Performance measures for Neyman-Pearson classification. IEEE Transactions on Information Theory 53 (8), 2852–2863 (2007)
Scott, C., Nowak, R.: A Neyman-Pearson approach to statistical learning. IEEE Transactions on Information Theory 51 (11), 3806–3819 (2005)
Segal, N.H., Pavlidis, P., Antonescu, C.R., Maki, R.G., Noble, W.S., DeSantis, D., Woodruff, J.M., Lewis, J.J., Brennan, M.F., Houghton, A.N., Cordon-Cardo, C.: Classification and subtype prediction of adult soft tissue sarcoma by functional genomics. The American Journal of Pathology 163 (2), 691–700 (2003)
Song, L., Zhang, Z., Grasfeder, L.L., Boyle, A.P., Giresi, P.G., Lee, B.K., Sheffield, N.C., Gräf, S., Huss, M., Keefe, D., et al.: Open chromatin defined by DNaseI and faire identifies regulatory elements that shape cell-type identity. Genome research 21 (10), 1757–1767 (2011)
Specht, D.F.: Probabilistic neural networks. Neural networks 3 (1), 109–118 (1990)
Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., Levy, S.: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21 (5), 631–643 (2005)
Tarigan, B., van de Geer, S.: Classifiers of support vector machine type with l1 complexity regularization. Bernoulli 12, 1045–1076 (2006)
Tong, X.: A plug-in approach to Neyman-Pearson classification. Journal of Machine Learning Research 14, 3011–3040 (2013)
Tong, X., Feng, Y., Li, J.J.: Neyman-pearson (np) classification algorithms and np receiver operating characteristic (np-roc) curves Manuscript
Tong, X., Feng, Y., Zhao, A.: A survey on Neyman-Pearson classification and suggestions for future research. Wiley Interdisciplinary Reviews: Computational Statistics 8, 64–81 (2016)
Tsybakov, A.: Optimal aggregation of classifiers in statistical learning. Annals of Statistics 32, 135–166 (2004)
Tsybakov, A., van de Geer, S.: Square root penalty: Adaptation to the margin in classification and in edge estimation. Annals of Statistics 33, 1203–1224 (2005)
Wei, J.S., Greer, B.T., Westermann, F., Steinberg, S.M., Son, C.G., Chen, Q.R., Whiteford, C.C., Bilke, S., Krasnoselsky, A.L., Cenacchi, N., et al.: Prediction of clinical outcome using gene expression profiling and artificial neural networks for patients with neuroblastoma. Cancer research 64 (19), 6883–6891 (2004)
Wu, S., Lin, K., Chen, C., M., C.: Asymmetric support vector machines: low false-positive learning under the user tolerance (2008)
Xing, E.P., Jordan, M.I., Karp, R.M., et al.: Feature selection for high-dimensional genomic microarray data. In: ICML, vol. 1, pp. 601–608. Citeseer (2001)
Yanai, I., Benjamin, H., Shmoish, M., Chalifa-Caspi, V., Shklar, M., Ophir, R., Bar-Even, A., Horn-Saban, S., Safran, M., Domany, E., et al.: Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics 21 (5), 650–659 (2005)
Yang, Y.: Minimax nonparametric classification-part i: rates of convergence. IEEE Transaction Information Theory 45, 2271–2284 (1999)
Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. IEEE International Conference on Data Mining p. 435 (2003)
Zhang, D., Shen, D., Initiative, A.D.N., et al.: Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage 59 (2), 895–907 (2012)
Zhao, A., Feng, Y., Wang, L., Tong, X.: Neyman-Pearson classification under high dimensional settings (2015). URL http://arxiv.org/abs/1508.03106
Zhou, J., Yuan, L., Liu, J., Ye, J.: A multi-task learning formulation for predicting disease progression. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 814–822. ACM (2011)
Acknowledgements
Dr. Jingyi Jessica Li’s work was supported by the start-up fund of the UCLA Department of Statistics and the Hellman Fellowship. Dr. Xin Tong’s work was supported by Zumberge Individual Award from University of Southern California and summer research support from Marshall School of Business. We thank Dr. Yang Feng in Department of Statistics at Columbia University and Ms. Anqi Zhao in Department of Statistics at Harvard University for their help in developing the Neyman–Pearson classification algorithms. We also thank Dr. Wei Li and Mr. Sheng’en Shawn Hu in Dr. X. Shirley Liu’s group in Department of Biostatistics and Computational Biology at Dana-Farber Cancer Institute and Harvard School of Public Health for kindly sharing the data for our genomic case study in Sect. 4.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Li, J.J., Tong, X. (2016). Genomic Applications of the Neyman–Pearson Classification Paradigm. In: Wong, KC. (eds) Big Data Analytics in Genomics. Springer, Cham. https://doi.org/10.1007/978-3-319-41279-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-41279-5_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41278-8
Online ISBN: 978-3-319-41279-5
eBook Packages: Computer ScienceComputer Science (R0)