Skip to main content

A distributed sparse logistic regression with \(L_{1/2}\) regularization for microarray biomarker discovery in cancer classification

Abstract

Microarray is a high-throughput sequencing technology, which can be used to classify cancer types and select the highly relevant cancer biomarkers (i.e., genes). To improve the availability of ever-increasing microarray data, data-integrative analysis becomes a hot research direction. However, the complexity of gene expression data still brings many challenges to the data integration methods: (1) the relevant biomarker selection in multiple high-dimensional datasets; (2) the batch effects between datasets; (3) the high noise in features and samples; (4) the large-scale data analysis with high computational cost. To overcome these challenges, we propose a novel Distribute-based Biological data-Integrative Analysis model—DBIA. DBIA is based on the \(L_{1/2}\) regularized logistic regression (\(L_{1/2}\) LR) model and the alternating direction multiplication algorithm (ADMM) for data integration. The regularization model is an effective method for selecting latent cancer-relevant genes and improving the accuracy of cancer classification. Moreover, we adopt the \(L_{1/2}\) LR model to reduce the noise and dimensionality of the data. ADMM is employed to reduce the batch effects between datasets, analyze multiple datasets in parallel, and save the computational cost of large-scale data analysis. Experimental results on the simulation and real-world datasets demonstrate that DBIA achieves the good prediction performance with a shorter time, lower hardware requirements, and strong robustness. The genes selected by DBIA have a certain biological significance.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Data availability

The authors have not disclosed any funding.

References

  • Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y (2010) Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26(3):392–398

    Google Scholar 

  • Adeline S, Zhao B, Wee-Teng P, Ming T, Philip E, Yee-Tang W, Wan-Cheng T, Edmund L, Hin-Peng L (1999) Nat2 slow acetylator genotype is associated with increased risk of lung cancer among nonsmoking Chinese women in Singapore. Carcinogenesis 20(9):1877–1881

  • Almugren N, Alshamlan H (2019) A survey on hybrid feature selection methods in microarray gene expression data for cancer classification. IEEE Access 7:78533–78548

    Google Scholar 

  • Bai J, Zhang Y, Kang N, Jin J, Zhang Q, Shao Q, Wong Y (2019) The methylation detection and clinical significance of prdm2, prdm5 and prdm16 in breast cancer. J Clin Expe Med 18(3):283–287

    Google Scholar 

  • Baratloo A, Hosseini M, Negida A, El A (2015) Part 1: simple definition and calculation of accuracy, sensitivity and specificity. Emergency 3(2):48–49

    Google Scholar 

  • Barrett T, Wilhite S, Ledoux P, Evangelista C, Kim I, Tomashevsky M, Marshall K, Phillippy K, Sherman P, Holko M (2012) NCBI GEO: archive for functional genomics data sets-update. Nucleic Acids Res 41(D1):D991–D995

    Google Scholar 

  • Boyd S, Parikh N, Chu E (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers, Now Publishers Inc pp.7–23

  • Bystrom S, Eklund M, Hong G, Fredolini C, Eriksson M, Czene K, Hall P, Schwenk J, Gabrielson M (2018) Affinity proteomic profiling of plasma for proteins associated to area-based mammographic breast density. Breast Cancer Res 20(14):1–13

    Google Scholar 

  • Cao B, Zhao J, Yang P, Liu X, Qi J, Muhammad K (2019) Multiobjective feature selection for microarray data via distributed parallel algorithms. Futur Gener Comput Syst 100:952–981

    Google Scholar 

  • Christopher W, Pingzhao H, Jane B, Claudia S (2015) Microarray meta-analysis and cross-platform normalization: integrative genomics for robust biomarker discovery. Microarrays 4(3):389–406

    Google Scholar 

  • Cilia N, De Stefano C, Fontanella F, Raimondo S, Scotto di Freca A (2019) An experimental comparison of feature-selection and classification methods for microarray datasets. Information 10(3):109

    Google Scholar 

  • Daoud M, Mayo M (2019) A survey of neural network-based cancer prediction models from microarray data. Artif Intell Med 97:204–214

    Google Scholar 

  • Deitz A, Zheng W, Leff M, Gross M, Wen W, Doll M, Xiao G, Folsom A, Hein D (2000) N-acetyltransferase-2 genetic polymorphism, well-done meat intake, and breast cancer risk among postmenopausal women. Cancer Epidemiol Biomark Prev 9(9):905–910

    Google Scholar 

  • Fan Q, Han F, Liu H (2014) Challenges of big data analysis. Natl Sci Rev 1(2):293–314

    Google Scholar 

  • Fang K (2016) Study on the mechanism of tgf-\(L_{1}\) inhibition of mirna-196a-3p expression promoting triple negative breast cancer metastasis. Suzhou University, Master’s thesis

  • Gabay D, Mercier B (1976) A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput Math Appl 2(1):17–40

    MATH  Google Scholar 

  • Hamid J, Hu P, Roslin N, Ling V, Greenwood C, Beyene J (2009) Data integration in genetics and genomics: methods and challenges. Hum Genomics Proteomics:HGP 2009:869093

    Google Scholar 

  • Huang C, Dun J (2008) A distributed psoCsvm hybrid system with feature selection and parameter optimization. Appl Soft Comput 8(4):1381-1391

    Google Scholar 

  • Jiang Y, Hamer J, Wang C, Jiang X, Kim M, Song Y, Xia Y, Mohammed N, Sadat MN, Wang S (2019) Securelr: secure logistic regression model via a hybrid cryptographic protocol. IEEE/ACM Trans Comput Biol Bioinf 16(1):113–123

    Google Scholar 

  • Ko J, Cheng W, Chang S, Su M, Chen Y, Lee H (2000) MDM2 mRNA expression is a favorable prognostic factor in non-small-cell lung cancer. Int J Cancer 89(3):265–270

    Google Scholar 

  • Liang Y, Liu C, Luan Z, Leung S, Chan M, Xu B, Zhang H (2013) Sparse logistic regression with a \(l_{1/2}\) penalty for gene selection in cancer classification. BMC Bioinform 14(1):198

    Google Scholar 

  • Liu C, Wong S (2019) Structured penalized logistic regression for gene selection in gene expression data analysis. IEEE/ACM Trans Comput Biol Bioinf 16(1):312–321

    Google Scholar 

  • Lobo J, Jim’enez-Valverde A, Real R (2008) Auc: a misleading measure of the performance of predictive distribution models. Glob Ecol Biogeogr 17(2):145–151

    Google Scholar 

  • Mafarja M, Mirjalili S (2018) Whale optimization approaches for wrapper feature selection. Appl Soft Comput 62:441–453

    Google Scholar 

  • Mateos G, Bazerque JA, Giannakis G (2010) Distributed sparse linear regression. IEEE Trans Signal Process 58(10):5262–5276

    MathSciNet  MATH  Google Scholar 

  • Min W, Liu J, Zhang S (2018) Network-regularized sparse logistic regression models for clinical risk prediction and biomarker discovery. IEEE/ACM Trans Comput Biol Bioinf 15(3):944–953

    Google Scholar 

  • Pacifici K, Reich B, Miller D, Gardner B, Stauffer G, Singh S, McKerrow A, Collazo J (2017) Integrating multiple data sources in species distribution modeling: a framework for data fusion. Ecology 98(3):840–850

    Google Scholar 

  • Pal S, Mondal S, Das G, Khatua S, Ghosh Z (2020) Big data in biology: the hope and present-day challenges in it. Gene Rep 21:100869

    Google Scholar 

  • Park H, Shiraishi Y, Imoto S, Miyano S (2017) A novel adaptive penalized logistic regression for uncovering biomarker associated with anti-cancer drug sensitivity. IEEE/ACM Trans Comput Biol Bioinf 14(4):771–782

    Google Scholar 

  • Potharaju S, Sreedevi M (2019) Distributed feature selection (DFS) strategy for microarray gene expression data to improve the classification performance. Clin Epidemiol Glob Health 7(2):171–176

    Google Scholar 

  • Privat M, Rudewicz J, Sonnier N, Tamisier C, Ponelle-Chachuat F, Bignon Y (2018) Antioxydation and cell migration genes are identified as potential therapeutic targets in basal-like and brca1 mutated breast cancer cell lines. Int J Med Sci 15(1):46–58

    Google Scholar 

  • Rabaglino M, Conrad K (2019) Evidence for shared molecular pathways of dysregulated decidualization in preeclampsia and endometrial disorders revealed by microarray data integration. FASEB J 33(11):11682–11695

    Google Scholar 

  • Sohn I, Kim J, Jung S-H, Park C (2009) Gradient lasso for cox proportional hazards model. Bioinformatics 25(14):1775–1781

    Google Scholar 

  • Su L, Chen S, Zheng C, Wei H, Song X (2019) Meta-analysis of gene expression and identification of biological regulatory mechanisms in Alzheimer’s disease. Front Neurosci 13:633

    Google Scholar 

  • Sweeney T, Haynes W, Vallania F, Ioannidis J, Khatri P (2017) Methods to increase reproducibility in differential gene expression via meta-analysis. Nucleic Acids Res 45(1):e1–e1

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Stat Soc: Ser B (Statistical Methodology) 58(1):267–288

  • Toro-Domínguez D, Villatoro-García J, Martorell-Marugn J, Romn-Montoya Y, Alarcn-Riquelme M, Carmona-Sez P (2021) A survey of gene expression meta-analysis: methods and applications. Brief Bioinform 22(2):1694–1705

  • Urbanowicz R, Meeker M, La Cava W, Olson R, Moore J (2018) Relief-based feature selection: introduction and review. J Biomed Inform 85:189–203

    Google Scholar 

  • Wang L (2016) Construction of irx5 overexpressed breast cancer cell lines and analysis of cell function, Master’s thesis, Huaibei Normal University

  • Wang P, Zhang H, Liang Y (2018) Model selection with distributed SCAD penalty. J Appl Stat 45(11):1938–1955

    MathSciNet  MATH  Google Scholar 

  • Wang Y, Yin W, Zeng J (2019) Global convergence of ADMM in nonconvex nonsmooth optimization. J Sci Comput 78(1):29–63

    MathSciNet  MATH  Google Scholar 

  • Xu B, Hai Z, Yao W, Chang Y, Yong L (2010) \(L_{1/2}\) regularization, Science China. Inf Sci 53(6):1159–1169

    MathSciNet  MATH  Google Scholar 

  • Yang Y, Liu Y, Shu J, Zhang H, Liang Y (2019) Multi-view based integrative analysis of gene expression data for identifying biomarkers. Sci Rep 9(1):1–15

    Google Scholar 

  • Yin D, Jia Y, Yu Y, Brock M, Guo M (2012) SOX17 methylation inhibits its antagonism of Wnt signaling pathway in lung cancer. Discov Med 14(74):33–40

    Google Scholar 

  • Zhang W, Wan W, Allen G, Pang K, Anderson M, Liu Z (2013) Molecular pathway identification using biological network-regularized logistic models. BMC Genom 14:1–8

    Google Scholar 

  • Zhang S, Shao J, Yu D, Qiu X, Zhang J (2020) MatchMixeR: a cross-platform normalization method for gene expression data integration. Bioinformatics 36(8):2486–2491

    Google Scholar 

  • Zitnik M, Nguyen F, Wang B, Leskovec J, Goldenberg A, Hoffman M (2019) Machine learning for integrating data in biology and medicine: principles, practice, and opportunities. Inform Fusion 50:71–91

  • Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429

    MathSciNet  MATH  Google Scholar 

  • Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J Royal Stat Soc: Ser B (Statistical Methodology) 67(2):301–320

    MathSciNet  MATH  Google Scholar 

Download references

Funding

This work was supported in part by the major key project of Peng Cheng Laboratory (PCL2021A12), the Macau Science and Technology Development Funds Grant No. 0056/2020/FJ from the Macau Special Administrative Region of the People’s Republic of China, and the Key Project for University of Educational Commission of Guangdong Province of China Funds (Natural, Grant No. 2019GZDXM005)

Author information

Authors and Affiliations

Authors

Contributions

Ning Ai and Ziyi Yang conceived the presented idea; Ning Ai and Dong Ouyang were involved in the methodology; Ning Ai contributed to the simulation and real experiments; Ning Ai and Yong Liang discussed the results; Yong Liang was involved in the supervision; Ning Ai, Yong Liang, Haoliang Yuan, Dong Ouyang, and Yuhan Ji contributed to the final manuscript.

Corresponding author

Correspondence to Yuhan Ji.

Ethics declarations

Conflict of interest

The authors of this work declare no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ai, N., Yang, Z., Yuan, H. et al. A distributed sparse logistic regression with \(L_{1/2}\) regularization for microarray biomarker discovery in cancer classification. Soft Comput 27, 2537–2552 (2023). https://doi.org/10.1007/s00500-022-07551-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-022-07551-5

Keywords

  • Microarray data integration
  • \(L_{1/2}\) regularized logistic regression
  • ADMM algorithm
  • Cancer classification
  • Gene selection