A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification

Algamal, Zakariya Yahya; Lee, Muhammad Hisyam

doi:10.1007/s11634-018-0334-1

A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification

Regular Article
Published: 07 August 2018

Volume 13, pages 753–771, (2019)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

1430 Accesses
45 Citations
Explore all metrics

Abstract

The common issues of high-dimensional gene expression data are that many of the genes may not be relevant, and there exists a high correlation among genes. Gene selection has been proven to be an effective way to improve the results of many classification methods. Sparse logistic regression using least absolute shrinkage and selection operator (lasso) or using smoothly clipped absolute deviation is one of the most widely applicable methods in cancer classification for gene selection. However, this method faces a critical challenge in practical applications when there are high correlations among genes. To address this problem, a two-stage sparse logistic regression is proposed, with the aim of obtaining an efficient subset of genes with high classification capabilities by combining the screening approach as a filter method and adaptive lasso with a new weight as an embedded method. In the first stage, sure independence screening method as a screening approach retains those genes representing high individual correlation with the cancer class level. In the second stage, the adaptive lasso with new weight is implemented to address the existence of high correlations among the screened genes in the first stage. Experimental results based on four publicly available gene expression datasets have shown that the proposed method significantly outperforms three state-of-the-art methods in terms of classification accuracy, G-mean, area under the curve, and stability. In addition, the results demonstrate that the top selected genes are biologically related to the cancer type. Thus, the proposed method can be useful for cancer classification using DNA gene expression data in real clinical practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Method for Cancer Genomics Feature Selection Based on LASSO-RFE

Article 20 April 2022

Chen Ai

Self-regularized Lasso for selection of most informative features in microarray cancer classification

Article 30 May 2023

Mehrdad Vatankhah & Mohammadreza Momenzadeh

An Optimize Gene Selection Approach for Cancer Classification Using Hybrid Feature Selection Methods

References

Algamal ZY, Lee MH (2015a) Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification. Expert Syst Appl 42:9326–9332
Article Google Scholar
Algamal ZY, Lee MH (2015b) Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification. Comput Biol Med 67:136–145
Article Google Scholar
Algamal ZY, Lee MH (2015c) Applying penalized binary logistic regression with correlation based elastic net for variables selection. J Mod Appl Stat Methods 14:168–179
Article Google Scholar
Algamal ZY, Lee MH (2015d) High dimensional logistic regression model using adjusted elastic net penalty. Pak J Stat Oper Res 11:667–676
Article MathSciNet Google Scholar
Algamal ZY, Lee MH (2015e) Adjusted adaptive lasso in high-dimensional Poisson regression model. Mod Appl Sci 9:170–176
Article Google Scholar
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96:6745–6750
Article Google Scholar
Asar Y (2015) Some new methods to solve multicollinearity in logistic regression. Commun Stat Simul Comput. https://doi.org/10.1080/03610918.2015.1053925
MathSciNet MATH Google Scholar
Asar Y, Genç A (2015) New shrinkage parameters for the Liu-type logistic estimators. Commun Stat Simul Comput 45:1094–1103
Article MathSciNet MATH Google Scholar
Ben Brahim A, Limam M (2016) A hybrid feature selection method based on instance learning and cooperative subset search. Pattern Recogn Lett 69:28–34
Article Google Scholar
Bielza C, Robles V, Larrañaga P (2011) Regularized logistic regression without a penalty term: an application to cancer classification with microarray data. Expert Syst Appl 38:5110–5118
Article Google Scholar
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2012) An ensemble of filters and classifiers for microarray data classification. Pattern Recogn 45:531–539
Article Google Scholar
Bootkrajang J, Kabán A (2013) Classification of mislabelled microarrays using robust sparse logistic regression. Bioinformatics 29:870–877
Article Google Scholar
Cawley GC, Talbot NLC (2006) Gene selection in cancer classification using sparse logistic regression with Bayesian regularization. Bioinformatics 22:2348–2355
Article Google Scholar
Chen Y, Wang L, Li L, Zhang H, Yuan Z (2016) Informative gene selection and the direct classification of tumors based on relative simplicity. BMC Bioinform 17:44–57
Article Google Scholar
Cui Y, Zheng CH, Yang J, Sha W (2013) Sparse maximum margin discriminant analysis for feature extraction and gene selection on gene expression data. Comput Biol Med 43:933–941
Article Google Scholar
Drotar P, Gazda J, Smekal Z (2015) An experimental comparison of feature selection methods on two-class biomedical datasets. Comput Biol Med 66:1–10
Article Google Scholar
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360
Article MathSciNet MATH Google Scholar
Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B (Stat Methodol) 70:849–911
Article MathSciNet MATH Google Scholar
Fan J, Song R (2010) Sure independence screening in generalized linear models with NP-dimensionality. Ann Stat 38:3567–3604
Article MathSciNet MATH Google Scholar
Ferreira AJ, Figueiredo MAT (2012) Efficient feature selection filters for high-dimensional data. Pattern Recogn Lett 33:1794–1804
Article Google Scholar
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33:1–22
Article Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537
Article Google Scholar
Gordon GJ, Jensen RV, Hsiao L-L, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62:4963–4967
Google Scholar
Guo S, Guo D, Chen L, Jiang Q (2016) A centroid-based gene selection method for microarray data classification. J Theor Biol 400:32–41
Article MathSciNet MATH Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Han B, Li L, Chen Y, Zhu L, Dai Q (2011) A two step method to identify clinical outcome relevant genes with microarray data. J Biomed Inf 44:229–238
Article Google Scholar
Huang HH, Liu XY, Liang Y (2016) Feature selection and cancer classification via sparse logistic regression with the hybrid L1/2 + 2 regularization. PLoS ONE 11:1–15
Google Scholar
Kalina J (2014) Classification methods for high-dimensional genetic data. Biocybern Biomed Eng 34:10–18
Article Google Scholar
Kalousis A, Prados J, Hilario M (2006) Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl Inf Syst 12:95–116
Article Google Scholar
Korkmaz S, Zararsiz G, Goksuluk D (2014) Drug/nondrug classification using support vector machines with various feature selection strategies. Comput Methods Programs Biomed 117:51–60
Article Google Scholar
Li S, Tan EC (2005) Dimension reduction-based penalized logistic regression for cancer classification using microarray data. IEEE/ACM Trans Comput Biol Bioinform 2:166–175
Article Google Scholar
Li S, Wu X, Tan M (2008) Gene selection using hybrid particle swarm optimization and genetic algorithm. Soft Comput 12:1039–1048
Article Google Scholar
Li J, Jia Y, Zhao Z (2012) Partly adaptive elastic net and its application to microarray classification. Neural Comput Appl 22:1193–1200
Article Google Scholar
Liang Y, Liu C, Luan X-Z, Leung K-S, Chan T-M, Xu Z-B, Zhang H (2013) Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification. BMC Bioinform 14:198–211
Article Google Scholar
Liao JG, Chin K-V (2007) Logistic regression for disease classification using microarray data: model selection in a large p and small n case. Bioinformatics 23:1945–1951
Article Google Scholar
Ma S, Huang J (2008) Penalized feature selection and classification in bioinformatics. Brief Bioinform 9:392–403
Article Google Scholar
Mai Q, Zou H (2013) The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika 100:229–234
Article MathSciNet MATH Google Scholar
Mao Z, Cai W, Shao X (2013) Selecting significant genes by randomization test for cancer classification using gene expression data. J Biomed Inf 46:594–601
Article Google Scholar
Özkale MR (2016) Iterative algorithms of biased estimation methods in binary logistic regression. Stat Pap 57(4):991–1016
Article MathSciNet MATH Google Scholar
Pappua V, Panagopoulosb OP, Xanthopoulosb P, Pardalosa PM (2015) Sparse proximal support vector machines for feature selection in high dimensional datasets. Expert Syst Appl 42:9183–9191
Article Google Scholar
Park MY, Hastie T (2008) Penalized logistic regression for detecting gene interactions. Biostatistics 9:30–50
Article MATH Google Scholar
Qian W, Yang Y (2013) Model selection via standard error adjusted adaptive lasso. Ann Inst Stat Math 65:295–318
Article MathSciNet MATH Google Scholar
Shevade SK, Keerthi SS (2003) A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19:2246–2253
Article Google Scholar
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1:203–209
Article Google Scholar
Sun H, Wang S (2012) Penalized logistic regression for high-dimensional DNA methylation data with case-control studies. Bioinformatics 28:1368–1375
Article Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Stat Methodol) 58:267–288
MathSciNet MATH Google Scholar
Wang SL, Li X, Zhang S, Gui J, Huang DS (2010) Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction. Comput Biol Med 40:179–189
Article Google Scholar
Yang L, Qian Y (2016) A sparse logistic regression framework by difference of convex functions programming. Appl Intell 45:241–254
Article Google Scholar
Yap Y, Zhang X, Ling MT, Wang X, Wong YC, Danchin A (2004) Classification between normal and tumor tissues based on the pair-wise gene expression ratio. BMC Cancer 4:72
Article Google Scholar
Zhang L, Qian L, Ding C, Zhou W, Li F (2015) Similarity-balanced discriminant neighbor embedding and its application to cancer classification based on gene expression data. Comput Biol Med 64:236–245
Article Google Scholar
Zheng S, Liu W (2011) An experimental comparison of gene selection by Lasso and Dantzig selector for cancer classification. Comput Biol Med 41:1033–1040
Article Google Scholar
Zhenqiu L, Feng J, Guoliang T, Suna W, Fumiaki S, Ming T (2007) Sparse logistic regression with Lp penalty for biomarker identification. Stat Appl Genet Mol Biol 6:1–22
MathSciNet MATH Google Scholar
Zhu J, Hastie T (2004) Classification of gene microarrays by penalized logistic regression. Biostatistics 5:427–443
Article MATH Google Scholar
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101:1418–1429
Article MathSciNet MATH Google Scholar
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Stat Methodol) 67:301–320
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics and Informatics, University of Mosul, Mosul, Iraq
Zakariya Yahya Algamal
Department of Mathematical Sciences, Faculty of Science, Universiti Teknologi Malaysia, Skudai, Johor, Malaysia
Muhammad Hisyam Lee

Authors

Zakariya Yahya Algamal
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Hisyam Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muhammad Hisyam Lee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Algamal, Z.Y., Lee, M.H. A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification. Adv Data Anal Classif 13, 753–771 (2019). https://doi.org/10.1007/s11634-018-0334-1

Download citation

Received: 02 February 2018
Revised: 27 April 2018
Accepted: 24 July 2018
Published: 07 August 2018
Issue Date: 01 September 2019
DOI: https://doi.org/10.1007/s11634-018-0334-1

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification

Abstract

Access this article

Similar content being viewed by others

A Method for Cancer Genomics Feature Selection Based on LASSO-RFE

Self-regularized Lasso for selection of most informative features in microarray cancer classification

An Optimize Gene Selection Approach for Cancer Classification Using Hybrid Feature Selection Methods

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification

Abstract

Access this article

Similar content being viewed by others

A Method for Cancer Genomics Feature Selection Based on LASSO-RFE

Self-regularized Lasso for selection of most informative features in microarray cancer classification

An Optimize Gene Selection Approach for Cancer Classification Using Hybrid Feature Selection Methods

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation