Genomic Applications of the Neyman–Pearson Classification Paradigm

Li, Jingyi Jessica; Tong, Xin

doi:10.1007/978-3-319-41279-5_4

Genomic Applications of the Neyman–Pearson Classification Paradigm

Jingyi Jessica Li² &
Xin Tong³

Chapter
First Online: 25 October 2016

2985 Accesses
2 Citations

Abstract

The Neyman–Pearson (NP) classification paradigm addresses an important binary classification problem where users want to minimize type II error while controlling type I error under some specified level α, usually a small number. This problem is often faced in many genomic applications involving binary classification tasks. The terminology Neyman–Pearson classification paradigm arises from its connection to the Neyman–Pearson paradigm in hypothesis testing. The NP paradigm is applicable when one type of error (e.g., type I error) is far more important than the other type (e.g., type II error), and users have a specific target bound for the former. In this chapter, we review the NP classification literature, with a focus on the genomic applications as well as our contribution to the NP classification theory and algorithms. We also provide simulation examples and a genomic case study to demonstrate how to use the NP classification algorithm in practice.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Audibert, J., Tsybakov, A.: Fast learning rates for plug-in classifiers under the margin condition. Annals of Statistics 35, 608–633 (2007)
Article MathSciNet MATH Google Scholar
Bi, J., Xiong, T., Yu, S., Dundar, M., Rao, R.B.: An improved multi-task learning approach with applications in medical diagnosis. In: Machine Learning and Knowledge Discovery in Databases, pp. 117–132. Springer (2008)
Google Scholar
Blanchard, G., Lee, G., Scott, C.: Semi-supervised novelty detection. Journal of Machine Learning Research 11, 2973–3009 (2010)
MathSciNet MATH Google Scholar
Booij, B.B., Lindahl, T., Wetterberg, P., Skaane, N.V., Sæbø, S., Feten, G., Rye, P.D., Kristiansen, L.I., Hagen, N., Jensen, M., et al.: A gene expression pattern in blood for the early detection of Alzheimer’s disease. Journal of Alzheimer’s Disease 23 (1), 109–119 (2011)
Google Scholar
Boyle, A.P., Song, L., Lee, B.K., London, D., Keefe, D., Birney, E., Iyer, V.R., Crawford, G.E., Furey, T.S.: High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. Genome research 21 (3), 456–464 (2011)
Article Google Scholar
Breiman, L.: Random forests. Machine learning 45 (1), 5–32 (2001)
Article MathSciNet MATH Google Scholar
Bulyk, M.L., et al.: Computational prediction of transcription-factor binding site locations. Genome biology 5 (1), 201–201 (2004)
Article Google Scholar
Cannon, A., Howse, J., Hush, D., Scovel, C.: Learning with the Neyman-Pearson and min-max criteria. Technical Report LA-UR-02-2951 (2002)
Google Scholar
Casasent, D., Chen, X.: Radial basis function neural networks for nonlinear fisher discrimination and Neyman-Pearson classification. Neural Networks 16 (5–6), 529–535 (2003)
Article Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20 (3), 273–297 (1995)
MATH Google Scholar
Cox, D.R.: The regression analysis of binary sequences. Journal of the Royal Statistical Society. Series B (Methodological) pp. 215–242 (1958)
Google Scholar
Degner, J.F., Pai, A.A., Pique-Regi, R., Veyrieras, J.B., Gaffney, D.J., Pickrell, J.K., De Leon, S., Michelini, K., Lewellen, N., Crawford, G.E., et al.: DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482 (7385), 390–394 (2012)
Article Google Scholar
Dümbgen, L., Igl, B., Munk, A.: P-values for classification. Electronic Journal of Statistics 2, 468–493 (2008)
Article MathSciNet MATH Google Scholar
Elkan, C.: The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence pp. 973–978 (2001)
Google Scholar
Feng, Y., Li, J., Tong, X.: nproc: Neyman-Pearson Receiver Operator Curve (2016). URL http://CRAN.R-project.org/package=nproc. R package version 0.1
Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Haussler, D.: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16 (10), 906–914 (2000)
Article Google Scholar
Galas, D.J., Schmitz, A.: DNase footprinting a simple method for the detection of protein-DNA binding specificity. Nucleic acids research 5 (9), 3157–3170 (1978)
Article Google Scholar
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science 286 (5439), 531–537 (1999)
Google Scholar
Han, M., Chen, D., Sun, Z.: Analysis to Neyman-Pearson classification with convex loss function. Anal. Theory Appl. 24 (1), 18–28 (2008). DOI 10.1007/s10496-008-0018-3
Article MathSciNet MATH Google Scholar
He, H.H., Meyer, C.A., Chen, M.W., Zang, C., Liu, Y., Rao, P.K., Fei, T., Xu, H., Long, H., Liu, X.S., et al.: Refined DNase-seq protocol and data analysis reveals intrinsic bias in transcription factor footprint identification. Nature methods 11 (1), 73–78 (2014)
Article Google Scholar
Huang, H., Liu, C.C., Zhou, X.J.: Bayesian approach to transforming public gene expression repositories into disease diagnosis databases. Proceedings of the National Academy of Sciences 107 (15), 6823–6828 (2010)
Article Google Scholar
Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., et al.: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine 7 (6), 673–679 (2001)
Article Google Scholar
Koltchinskii, V.: Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems (2008)
MATH Google Scholar
Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: A review of classification techniques. Informatica 31, 249–268 (2007)
MathSciNet MATH Google Scholar
Lee, Y., Lee, C.K.: Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics 19 (9), 1132–1139 (2003)
Article Google Scholar
Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Machine learning: ECML-98, pp. 4–15. Springer (1998)
Google Scholar
Liu, C.C., Hu, J., Kalakrishnan, M., Huang, H., Zhou, X.J.: Integrative disease classification based on cross-platform microarray data. BMC Bioinformatics 10 (Suppl 1), S25 (2009)
Article Google Scholar
Liu, F., Wee, C.Y., Chen, H., Shen, D.: Inter-modality relationship constrained multi-modality multi-task feature selection for Alzheimer’s disease and mild cognitive impairment identification. NeuroImage 84, 466–475 (2014)
Article Google Scholar
Ma, S., Song, X., Huang, J.: Supervised group lasso with applications to microarray data analysis. BMC bioinformatics 8 (1), 1 (2007)
Article Google Scholar
Mammen, E., Tsybakov, A.: Smooth discrimination analysis. Annals of Statistics 27, 1808–1829 (1999)
Article MathSciNet MATH Google Scholar
Neph, S., Vierstra, J., Stergachis, A.B., Reynolds, A.P., Haugen, E., Vernot, B., Thurman, R.E., John, S., Sandstrom, R., Johnson, A.K., et al.: An expansive human regulatory lexicon encoded in transcription factor footprints. Nature 489 (7414), 83–90 (2012)
Article Google Scholar
Ng, K.L.S., Mishra, S.K.: De novo svm classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics 23 (11), 1321–1330 (2007)
Article Google Scholar
Park, P.J., Tian, L., Kohane, I.S.: Linking gene expression data with patient survival times using partial least squares. Bioinformatics 18 (suppl 1), S120–S127 (2002)
Article Google Scholar
Phillips, J.E., Corces, V.G.: Ctcf: master weaver of the genome. Cell 137 (7), 1194–1211 (2009)
Article Google Scholar
Platt, J., et al.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers 10 (3), 61–74 (1999)
Google Scholar
Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J.P., Poggio, T., Gerald, W., Loda, M., Lander, E.S., Golub, T.R.: Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences 98 (26), 15,149–15,154 (2001)
Article Google Scholar
Rigollet, P., Tong, X.: Neyman-Pearson classification, convexity and stochastic constraints. Journal of Machine Learning Research 12, 2831–2855 (2011)
MathSciNet MATH Google Scholar
Scott, C.: Comparison and design of Neyman-Pearson classifiers. Unpublished (2005)
Google Scholar
Scott, C.: Performance measures for Neyman-Pearson classification. IEEE Transactions on Information Theory 53 (8), 2852–2863 (2007)
Article MathSciNet MATH Google Scholar
Scott, C., Nowak, R.: A Neyman-Pearson approach to statistical learning. IEEE Transactions on Information Theory 51 (11), 3806–3819 (2005)
Article MathSciNet MATH Google Scholar
Segal, N.H., Pavlidis, P., Antonescu, C.R., Maki, R.G., Noble, W.S., DeSantis, D., Woodruff, J.M., Lewis, J.J., Brennan, M.F., Houghton, A.N., Cordon-Cardo, C.: Classification and subtype prediction of adult soft tissue sarcoma by functional genomics. The American Journal of Pathology 163 (2), 691–700 (2003)
Article Google Scholar
Song, L., Zhang, Z., Grasfeder, L.L., Boyle, A.P., Giresi, P.G., Lee, B.K., Sheffield, N.C., Gräf, S., Huss, M., Keefe, D., et al.: Open chromatin defined by DNaseI and faire identifies regulatory elements that shape cell-type identity. Genome research 21 (10), 1757–1767 (2011)
Article Google Scholar
Specht, D.F.: Probabilistic neural networks. Neural networks 3 (1), 109–118 (1990)
Article Google Scholar
Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., Levy, S.: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21 (5), 631–643 (2005)
Article Google Scholar
Tarigan, B., van de Geer, S.: Classifiers of support vector machine type with l1 complexity regularization. Bernoulli 12, 1045–1076 (2006)
Article MathSciNet MATH Google Scholar
Tong, X.: A plug-in approach to Neyman-Pearson classification. Journal of Machine Learning Research 14, 3011–3040 (2013)
MathSciNet MATH Google Scholar
Tong, X., Feng, Y., Li, J.J.: Neyman-pearson (np) classification algorithms and np receiver operating characteristic (np-roc) curves Manuscript
Google Scholar
Tong, X., Feng, Y., Zhao, A.: A survey on Neyman-Pearson classification and suggestions for future research. Wiley Interdisciplinary Reviews: Computational Statistics 8, 64–81 (2016)
Article MathSciNet Google Scholar
Tsybakov, A.: Optimal aggregation of classifiers in statistical learning. Annals of Statistics 32, 135–166 (2004)
Article MathSciNet MATH Google Scholar
Tsybakov, A., van de Geer, S.: Square root penalty: Adaptation to the margin in classification and in edge estimation. Annals of Statistics 33, 1203–1224 (2005)
Article MathSciNet MATH Google Scholar
Wei, J.S., Greer, B.T., Westermann, F., Steinberg, S.M., Son, C.G., Chen, Q.R., Whiteford, C.C., Bilke, S., Krasnoselsky, A.L., Cenacchi, N., et al.: Prediction of clinical outcome using gene expression profiling and artificial neural networks for patients with neuroblastoma. Cancer research 64 (19), 6883–6891 (2004)
Article Google Scholar
Wu, S., Lin, K., Chen, C., M., C.: Asymmetric support vector machines: low false-positive learning under the user tolerance (2008)
Google Scholar
Xing, E.P., Jordan, M.I., Karp, R.M., et al.: Feature selection for high-dimensional genomic microarray data. In: ICML, vol. 1, pp. 601–608. Citeseer (2001)
Google Scholar
Yanai, I., Benjamin, H., Shmoish, M., Chalifa-Caspi, V., Shklar, M., Ophir, R., Bar-Even, A., Horn-Saban, S., Safran, M., Domany, E., et al.: Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics 21 (5), 650–659 (2005)
Article Google Scholar
Yang, Y.: Minimax nonparametric classification-part i: rates of convergence. IEEE Transaction Information Theory 45, 2271–2284 (1999)
Article MATH Google Scholar
Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. IEEE International Conference on Data Mining p. 435 (2003)
Google Scholar
Zhang, D., Shen, D., Initiative, A.D.N., et al.: Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage 59 (2), 895–907 (2012)
Article Google Scholar
Zhao, A., Feng, Y., Wang, L., Tong, X.: Neyman-Pearson classification under high dimensional settings (2015). URL http://arxiv.org/abs/1508.03106
Zhou, J., Yuan, L., Liu, J., Ye, J.: A multi-task learning formulation for predicting disease progression. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 814–822. ACM (2011)
Google Scholar

Download references

Acknowledgements

Dr. Jingyi Jessica Li’s work was supported by the start-up fund of the UCLA Department of Statistics and the Hellman Fellowship. Dr. Xin Tong’s work was supported by Zumberge Individual Award from University of Southern California and summer research support from Marshall School of Business. We thank Dr. Yang Feng in Department of Statistics at Columbia University and Ms. Anqi Zhao in Department of Statistics at Harvard University for their help in developing the Neyman–Pearson classification algorithms. We also thank Dr. Wei Li and Mr. Sheng’en Shawn Hu in Dr. X. Shirley Liu’s group in Department of Biostatistics and Computational Biology at Dana-Farber Cancer Institute and Harvard School of Public Health for kindly sharing the data for our genomic case study in Sect. 4.

Author information

Authors and Affiliations

Department of Statistics, University of California, Los Angeles, Los Angeles, CA, USA
Jingyi Jessica Li
Department of Data Sciences and Operations, University of Southern California, Los Angeles, CA, USA
Xin Tong

Authors

Jingyi Jessica Li
View author publications
You can also search for this author in PubMed Google Scholar
Xin Tong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jingyi Jessica Li .

Editor information

Editors and Affiliations

Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong
Ka-Chun Wong

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Li, J.J., Tong, X. (2016). Genomic Applications of the Neyman–Pearson Classification Paradigm. In: Wong, KC. (eds) Big Data Analytics in Genomics. Springer, Cham. https://doi.org/10.1007/978-3-319-41279-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-41279-5_4
Published: 25 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41278-8
Online ISBN: 978-3-319-41279-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics