Abstract
Microarray is a high-throughput sequencing technology, which can be used to classify cancer types and select the highly relevant cancer biomarkers (i.e., genes). To improve the availability of ever-increasing microarray data, data-integrative analysis becomes a hot research direction. However, the complexity of gene expression data still brings many challenges to the data integration methods: (1) the relevant biomarker selection in multiple high-dimensional datasets; (2) the batch effects between datasets; (3) the high noise in features and samples; (4) the large-scale data analysis with high computational cost. To overcome these challenges, we propose a novel Distribute-based Biological data-Integrative Analysis model—DBIA. DBIA is based on the \(L_{1/2}\) regularized logistic regression (\(L_{1/2}\) LR) model and the alternating direction multiplication algorithm (ADMM) for data integration. The regularization model is an effective method for selecting latent cancer-relevant genes and improving the accuracy of cancer classification. Moreover, we adopt the \(L_{1/2}\) LR model to reduce the noise and dimensionality of the data. ADMM is employed to reduce the batch effects between datasets, analyze multiple datasets in parallel, and save the computational cost of large-scale data analysis. Experimental results on the simulation and real-world datasets demonstrate that DBIA achieves the good prediction performance with a shorter time, lower hardware requirements, and strong robustness. The genes selected by DBIA have a certain biological significance.
This is a preview of subscription content, access via your institution.






Data availability
The authors have not disclosed any funding.
References
Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y (2010) Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics 26(3):392–398
Adeline S, Zhao B, Wee-Teng P, Ming T, Philip E, Yee-Tang W, Wan-Cheng T, Edmund L, Hin-Peng L (1999) Nat2 slow acetylator genotype is associated with increased risk of lung cancer among nonsmoking Chinese women in Singapore. Carcinogenesis 20(9):1877–1881
Almugren N, Alshamlan H (2019) A survey on hybrid feature selection methods in microarray gene expression data for cancer classification. IEEE Access 7:78533–78548
Bai J, Zhang Y, Kang N, Jin J, Zhang Q, Shao Q, Wong Y (2019) The methylation detection and clinical significance of prdm2, prdm5 and prdm16 in breast cancer. J Clin Expe Med 18(3):283–287
Baratloo A, Hosseini M, Negida A, El A (2015) Part 1: simple definition and calculation of accuracy, sensitivity and specificity. Emergency 3(2):48–49
Barrett T, Wilhite S, Ledoux P, Evangelista C, Kim I, Tomashevsky M, Marshall K, Phillippy K, Sherman P, Holko M (2012) NCBI GEO: archive for functional genomics data sets-update. Nucleic Acids Res 41(D1):D991–D995
Boyd S, Parikh N, Chu E (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers, Now Publishers Inc pp.7–23
Bystrom S, Eklund M, Hong G, Fredolini C, Eriksson M, Czene K, Hall P, Schwenk J, Gabrielson M (2018) Affinity proteomic profiling of plasma for proteins associated to area-based mammographic breast density. Breast Cancer Res 20(14):1–13
Cao B, Zhao J, Yang P, Liu X, Qi J, Muhammad K (2019) Multiobjective feature selection for microarray data via distributed parallel algorithms. Futur Gener Comput Syst 100:952–981
Christopher W, Pingzhao H, Jane B, Claudia S (2015) Microarray meta-analysis and cross-platform normalization: integrative genomics for robust biomarker discovery. Microarrays 4(3):389–406
Cilia N, De Stefano C, Fontanella F, Raimondo S, Scotto di Freca A (2019) An experimental comparison of feature-selection and classification methods for microarray datasets. Information 10(3):109
Daoud M, Mayo M (2019) A survey of neural network-based cancer prediction models from microarray data. Artif Intell Med 97:204–214
Deitz A, Zheng W, Leff M, Gross M, Wen W, Doll M, Xiao G, Folsom A, Hein D (2000) N-acetyltransferase-2 genetic polymorphism, well-done meat intake, and breast cancer risk among postmenopausal women. Cancer Epidemiol Biomark Prev 9(9):905–910
Fan Q, Han F, Liu H (2014) Challenges of big data analysis. Natl Sci Rev 1(2):293–314
Fang K (2016) Study on the mechanism of tgf-\(L_{1}\) inhibition of mirna-196a-3p expression promoting triple negative breast cancer metastasis. Suzhou University, Master’s thesis
Gabay D, Mercier B (1976) A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput Math Appl 2(1):17–40
Hamid J, Hu P, Roslin N, Ling V, Greenwood C, Beyene J (2009) Data integration in genetics and genomics: methods and challenges. Hum Genomics Proteomics:HGP 2009:869093
Huang C, Dun J (2008) A distributed psoCsvm hybrid system with feature selection and parameter optimization. Appl Soft Comput 8(4):1381-1391
Jiang Y, Hamer J, Wang C, Jiang X, Kim M, Song Y, Xia Y, Mohammed N, Sadat MN, Wang S (2019) Securelr: secure logistic regression model via a hybrid cryptographic protocol. IEEE/ACM Trans Comput Biol Bioinf 16(1):113–123
Ko J, Cheng W, Chang S, Su M, Chen Y, Lee H (2000) MDM2 mRNA expression is a favorable prognostic factor in non-small-cell lung cancer. Int J Cancer 89(3):265–270
Liang Y, Liu C, Luan Z, Leung S, Chan M, Xu B, Zhang H (2013) Sparse logistic regression with a \(l_{1/2}\) penalty for gene selection in cancer classification. BMC Bioinform 14(1):198
Liu C, Wong S (2019) Structured penalized logistic regression for gene selection in gene expression data analysis. IEEE/ACM Trans Comput Biol Bioinf 16(1):312–321
Lobo J, Jim’enez-Valverde A, Real R (2008) Auc: a misleading measure of the performance of predictive distribution models. Glob Ecol Biogeogr 17(2):145–151
Mafarja M, Mirjalili S (2018) Whale optimization approaches for wrapper feature selection. Appl Soft Comput 62:441–453
Mateos G, Bazerque JA, Giannakis G (2010) Distributed sparse linear regression. IEEE Trans Signal Process 58(10):5262–5276
Min W, Liu J, Zhang S (2018) Network-regularized sparse logistic regression models for clinical risk prediction and biomarker discovery. IEEE/ACM Trans Comput Biol Bioinf 15(3):944–953
Pacifici K, Reich B, Miller D, Gardner B, Stauffer G, Singh S, McKerrow A, Collazo J (2017) Integrating multiple data sources in species distribution modeling: a framework for data fusion. Ecology 98(3):840–850
Pal S, Mondal S, Das G, Khatua S, Ghosh Z (2020) Big data in biology: the hope and present-day challenges in it. Gene Rep 21:100869
Park H, Shiraishi Y, Imoto S, Miyano S (2017) A novel adaptive penalized logistic regression for uncovering biomarker associated with anti-cancer drug sensitivity. IEEE/ACM Trans Comput Biol Bioinf 14(4):771–782
Potharaju S, Sreedevi M (2019) Distributed feature selection (DFS) strategy for microarray gene expression data to improve the classification performance. Clin Epidemiol Glob Health 7(2):171–176
Privat M, Rudewicz J, Sonnier N, Tamisier C, Ponelle-Chachuat F, Bignon Y (2018) Antioxydation and cell migration genes are identified as potential therapeutic targets in basal-like and brca1 mutated breast cancer cell lines. Int J Med Sci 15(1):46–58
Rabaglino M, Conrad K (2019) Evidence for shared molecular pathways of dysregulated decidualization in preeclampsia and endometrial disorders revealed by microarray data integration. FASEB J 33(11):11682–11695
Sohn I, Kim J, Jung S-H, Park C (2009) Gradient lasso for cox proportional hazards model. Bioinformatics 25(14):1775–1781
Su L, Chen S, Zheng C, Wei H, Song X (2019) Meta-analysis of gene expression and identification of biological regulatory mechanisms in Alzheimer’s disease. Front Neurosci 13:633
Sweeney T, Haynes W, Vallania F, Ioannidis J, Khatri P (2017) Methods to increase reproducibility in differential gene expression via meta-analysis. Nucleic Acids Res 45(1):e1–e1
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Stat Soc: Ser B (Statistical Methodology) 58(1):267–288
Toro-Domínguez D, Villatoro-García J, Martorell-Marugn J, Romn-Montoya Y, Alarcn-Riquelme M, Carmona-Sez P (2021) A survey of gene expression meta-analysis: methods and applications. Brief Bioinform 22(2):1694–1705
Urbanowicz R, Meeker M, La Cava W, Olson R, Moore J (2018) Relief-based feature selection: introduction and review. J Biomed Inform 85:189–203
Wang L (2016) Construction of irx5 overexpressed breast cancer cell lines and analysis of cell function, Master’s thesis, Huaibei Normal University
Wang P, Zhang H, Liang Y (2018) Model selection with distributed SCAD penalty. J Appl Stat 45(11):1938–1955
Wang Y, Yin W, Zeng J (2019) Global convergence of ADMM in nonconvex nonsmooth optimization. J Sci Comput 78(1):29–63
Xu B, Hai Z, Yao W, Chang Y, Yong L (2010) \(L_{1/2}\) regularization, Science China. Inf Sci 53(6):1159–1169
Yang Y, Liu Y, Shu J, Zhang H, Liang Y (2019) Multi-view based integrative analysis of gene expression data for identifying biomarkers. Sci Rep 9(1):1–15
Yin D, Jia Y, Yu Y, Brock M, Guo M (2012) SOX17 methylation inhibits its antagonism of Wnt signaling pathway in lung cancer. Discov Med 14(74):33–40
Zhang W, Wan W, Allen G, Pang K, Anderson M, Liu Z (2013) Molecular pathway identification using biological network-regularized logistic models. BMC Genom 14:1–8
Zhang S, Shao J, Yu D, Qiu X, Zhang J (2020) MatchMixeR: a cross-platform normalization method for gene expression data integration. Bioinformatics 36(8):2486–2491
Zitnik M, Nguyen F, Wang B, Leskovec J, Goldenberg A, Hoffman M (2019) Machine learning for integrating data in biology and medicine: principles, practice, and opportunities. Inform Fusion 50:71–91
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J Royal Stat Soc: Ser B (Statistical Methodology) 67(2):301–320
Funding
This work was supported in part by the major key project of Peng Cheng Laboratory (PCL2021A12), the Macau Science and Technology Development Funds Grant No. 0056/2020/FJ from the Macau Special Administrative Region of the People’s Republic of China, and the Key Project for University of Educational Commission of Guangdong Province of China Funds (Natural, Grant No. 2019GZDXM005)
Author information
Authors and Affiliations
Contributions
Ning Ai and Ziyi Yang conceived the presented idea; Ning Ai and Dong Ouyang were involved in the methodology; Ning Ai contributed to the simulation and real experiments; Ning Ai and Yong Liang discussed the results; Yong Liang was involved in the supervision; Ning Ai, Yong Liang, Haoliang Yuan, Dong Ouyang, and Yuhan Ji contributed to the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors of this work declare no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ai, N., Yang, Z., Yuan, H. et al. A distributed sparse logistic regression with \(L_{1/2}\) regularization for microarray biomarker discovery in cancer classification. Soft Comput 27, 2537–2552 (2023). https://doi.org/10.1007/s00500-022-07551-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-022-07551-5
Keywords
- Microarray data integration
- \(L_{1/2}\) regularized logistic regression
- ADMM algorithm
- Cancer classification
- Gene selection