Skip to main content
Log in

Sparse common component analysis for multiple high-dimensional datasets via noncentered principal component analysis

  • Regular Article
  • Published:
Statistical Papers Aims and scope Submit manuscript

Abstract

There is currently much discussion about the analysis of multiple datasets from different groups, among which especially identifying a common basic structure of multiple groups has drawn a large amount of attention. In order to identify a common basic structure, common component analysis (CCA) was proposed by generalizing techniques for principal component analysis (PCA); i.e., CCA becomes standard PCA when applied to only one dataset. Although CCA can identify the common structure of multiple datasets, which cannot be extracted by standard PCA, CCA suffers from the following drawbacks. The common components are estimated as linear combinations of all variables, and thus it is difficult to interpret the identified common components. The fully dense loadings lead to erroneous results in CCA, because noisy features are inevitably included in datasets. To address these issues, we incorporate sparsity into CCA, and propose a novel strategy for sparse common component analysis based on \(L_{1}\)-type regularized regression modeling. We focus CCA which is formulated as the eigenvalue decomposition (EVD) of a Gram matrix (i.e., common loadings of multiple datasets can be estimated by EVD of a Gram matrix), and it can be performed by Singular value decomposition of a square root of the Gram matrix. We then propose sparse common component analysis based on sparse PCA to estimate sparse common loadings of multiple datasets. We also propose an algorithm to estimate sparse common loadings of multiple datasets. The proposed method can not only identify a common subspace but also select crucial common-features for multiple groups. Monte Carlo simulations and real-data analysis are conducted to examine the efficiency of the proposed sparse CCA. We observe from the numerical studies that our strategies can incorporate sparsity into the common loading estimation and efficiently recover a sparse common structure efficiently in multiple dataset analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Alhopuro P, Karhu A, Winqvist R et al (2008) Somatic mutation analysis of MYH11 in breast and prostate cancer. BMC Cancer 8:263

    Google Scholar 

  • Al-Kandari NM, Jolliffe IT (2005) Variable selection and interpretation in correlation principal components. Environmetrics 16:659–672

    MathSciNet  Google Scholar 

  • Aruga J, Yokota N, Mikoshiba K (2003) Human SLITRK family genes: genomic organization and expression profiling in normal brain and brain tumor tissue. Gene 2:87–94

    Google Scholar 

  • Boudou A, Cabral EN, Romain Y (2010) Centered and non-centered principal component analysis in the frequency domain. Stat Probab Lett 80:96–103

    MATH  Google Scholar 

  • Cadima J, Jolliffe I (2009) On relationship between uncentered and column-centered principal component analysis. Pak J Stat 25:473–503

    Google Scholar 

  • Castellana B, Escuin D, Peiro G, Garcia-Valdecasas B, Vazquez T, Pons C, Perez-Olabarria M, Barnadas A, Lerma E (2012) ASPN and GJB2 are implicated in the mechanisms of invasion of ductal breast carcinomas. J Cancer 3:175–183

    Google Scholar 

  • Chen H, Suzuki M, Nakamura Y, Ohira M, Ando S, Iida T, Nakajima T, Nakagawara A, Kimura H (2005) Aberrant methylation of FBN2 in human non-small cell lung cancer. Lung Cancer 50:43–9

    Google Scholar 

  • Chen YC, Huang RL, Huang YK, Liao YP, Su PH, Wang HC, Chang CC, Lin YW, Yu MH, Chu TY, Lai HC (2015) Methylomics analysis identifies epigenetically silenced genes and implies an activation of -catenin signaling in cervical cancer. BMC Cancer 15:117

    Google Scholar 

  • Correa NM, Eichele T, Adali T, Li YO, Calhoun VD (2010) Multi-set canonical correlation analysis for the fusion of concurrent single trial ERP and functional MRI. Neuroimage 50:1438–1445

    Google Scholar 

  • Deng J, Tang J, Wang G, Zhu YS (2017) Long non-coding RNA as potential biomarker for prostate cancer: is it making a difference? Int J Environ Res Public Health 14(3):270

    Google Scholar 

  • Engle R (2002) Dynamic conditional correlation: a simple class of multivariate generalized autoregressive conditional@heteroscedasticity models. J Bus Econ Stat 20:339–350

    Google Scholar 

  • Flury BN (1984) Common principal components in K groups. J Am Stat Assoc 79:892–898

    MathSciNet  Google Scholar 

  • Gardi NL, Deshpande TU, Kamble SC, Budhe SR, Bapat SA (2013) Discrete molecular classes of ovarian cancer suggestive of unique mechanisms of transformation and metastases. Clin Cancer Res 20:87–99

    Google Scholar 

  • Gebhardt C, Nemeth J, Angel P, Hess J (2006) S100A8 and S100A9 in inflammation and cancer. Biochem Pharmacol 72:1622–1631

    Google Scholar 

  • Goncalves NP, Moreira J, Martins D, Vieira P, Obici L, Merlini G, Saraiva M, Saraiva MJ (2017) Differential expression of Cathepsin E in transthyretin amyloidosis: from neuropathology to the immune system. J Neuroinflammation 14:115

    Google Scholar 

  • Gorringe KL, George J, Anglesio MS, Ramakrishna M, Etemadmoghadam D, Cowin P, Sridhar A, Williams LH, Boyle SE, Yanaihara N, Okamoto A, Urashima M, Smyth GK, Campbell IG, Bowtell DD (2010) Copy number analysis identifies novel interactions between genomic loci in ovarian cancer. PLoS ONE 5(9):e11408

    Google Scholar 

  • Guo FJ, James G, Levina E, Michailidis G, Zhu J (2010) Principal component analysisi with sparse fused loadings. J Comput Graph Stat 19:930–946

    Google Scholar 

  • Hartung F, Wang Y, Aronow B, Weber GF (2017) A core program of gene expression characterizes cancer metastases. Oncotarget 8(60):102161–102175

    Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2003) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Berlin

    MATH  Google Scholar 

  • He Y, Li Y, Qiu Z, Zhou B, Shi S, Zhang K, Luo Y, Huang Q, Li W (2014) Identification and validation of PROM1 and CRTC2 mutations in lung cancer patients. Mol Cancer 13:19

    Google Scholar 

  • Heinzelmann-Schwarz VA, Gardiner-Garden M, Henshall SM, Scurry JP, Scolyer RA, Smith AN, Bali A, Vanden Bergh P, Baron-Hay S, Scott C, Fink D, Hacker NF, Sutherland RL, O’Brien PM (2006) A distinct molecular profile associated with mucinous epithelial ovarian cancer. Br J Cancer 94:904–913

    Google Scholar 

  • Hoerl E, Kennard W (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67

    MATH  Google Scholar 

  • Honeine P (2014) An eigenanalysis of data centering in machine learning. arXiv:1407.2904

  • Huang C, Pollock CA, Chen XM (2014) High glucose induces CCL20 in proximal tubular cells via activation of the KCa3.1 channel. PLoS ONE 9:e95173

    Google Scholar 

  • Ichikawa T, Saruwatari K, Mimaki S, Sugano M, Aokage K, Kojima M, Hishida T, Fujii S, Yoshida J, Kuwata T, Ochiai A, Suzuki K, Tsuboi M, Goto K, Tsuchihara K, Ishii G (2017) Immunohistochemical and genetic characteristics of lung cancer mimicking organizing pneumonia. Lung Cancer 113:134–139

    Google Scholar 

  • Ignacio RM, Kabir SM, Lee ES, Adunyah SE, Son DS (2016) NF-fEB-mediated CCL20 reigns dominantly in CXCR2-driven ovarian cancer progression. PLoS ONE 11(10):e0164189

    Google Scholar 

  • Inan D (2015) Combining the Liu-type estimator and the principal component regression estimator. Stat Paper 50:147–156

    MathSciNet  MATH  Google Scholar 

  • Jolicoeur R (1963) The degree of generality of robustness in Martes americana. Growth 27:1–27

    Google Scholar 

  • Karim BO, Rhee KJ, Liu G, Yun K, Brant SR (2014) Prom1 function in development, intestinal inflammation, and intestinal tumorigenesis. Front Oncol 4:323

    Google Scholar 

  • Konishi S (2014) Introduction to Multivariate Analysis. Hall/CRC, Boca Raton

    MATH  Google Scholar 

  • Konno-Shimizu M, Yamamichi N, Inada K, Kageyama-Yahara N, Shiogama K, Takahashi Y, Asada-Hirayama I, Yamamichi-Nishina M, Nakayama C, Ono S, Kodashima S, Fujishiro M, Tsutsumi Y, Ichinose M, Koike K (2013) Cathepsin E is a marker of gastric differentiation and signet-ring cell carcinoma of stomach: a novel suggestion on gastric tumorigenesis. PLoS ONE 8:e56766

    Google Scholar 

  • Leithner K, Hirschmugl B, Li Y, Tang B, Papp R, Nagaraj C, Stacher E, Stiegler P, Lindenmann J, Olschewski A, Olschewski H, Hrzenjak A (2016) TASK-1 regulates apoptosis and proliferation in a subset of non-small cell lung cancers. PLoS ONE 11(6):e0157453

    Google Scholar 

  • Lin A, Hu Q, Li C, Xing Z, Ma G, Wang C, Li J, Ye Y, Yao J, Liang K, Wang S, Park PK, Marks JR, Zhou Y, Zhou J, Hung MC, Liang H, Hu Z, Shen H, Hawke DH, Han L, Zhou Y, Lin C, Yang L (2017) The LINK-A lncRNA interacts with PtdIns(3,4,5)P3 to hyperactivate AKT and confer resistance to AKT inhibitors. Nat Cell Biol 19:238–251

    Google Scholar 

  • Lloyd KL, Cree IA, Savage RS (2013) Prediction of resistance to chemotherapy in ovarian cancer: a systematic review. Int J Cancer 135:117–127

    Google Scholar 

  • Ma H, Cheng L, Hao K, Li Y, Song X, Zhou H, Jia L (2014) Reversal effect of ST6GAL 1 on multidrug resistance in human leukemia by regulating the PI3K/Akt pathway and the expression of P-gp and MRP1. PLoS ONE 9(1):e85113

    Google Scholar 

  • McDonnell MD, Tissera MD, Vladusich T, Schaik A, Tapson J (2015) Fast, simple and accurate handwritten digit classification by training shallow neural network classifiers with the extreme learning machine algorithm. PLoS ONE 10(8):e0134254

    Google Scholar 

  • Mirza Z, Schulten HJ, Farsi HM, Al-Maghrabi JA, Gari MA, Chaudhary AG, Abuzenadah AM, Al-Qahtani MH, Karim S (2014) Impact of S100A8 expression on kidney cancer progression and molecular docking studies for kidney cancer therapeutics. Anticancer Res 34:1873–84

    Google Scholar 

  • Mwangi B, Tian TS, Soares JC (2014) A review of feature reduction techniques in neuroimaging. Neuroinformatics 12:229–244

    Google Scholar 

  • Nadeau JS, Wilson RB, Hoggard JC, Wright BW, Synovec RE (2011) Study of the interdependency of the data sampling ratio with retention time alignment and principal component analysis for gas chromatography. J Chromatogr A 1218:9091–9101

    Google Scholar 

  • Noordhuis MG, Fehrmann RS, Wisman GB, Nijhuis ER, van Zanden JJ, Moerland PD, Loren Ver, van Themaat E, Volders HH, Kok M, ten Hoor KA, Hollema H, de Vries EG, de Bock GH, van der Zee AG, Schuuring E (2011) Involvement of the TGF-beta and beta-catenin pathways in pelvic lymph node metastasis in early-stage cervical cancer. Clin Cancer Res 17(6):1317–30

    Google Scholar 

  • Osuala KO, Sloane BF (2014) Many roles of CCL20: emphasis on breast cancer. Postdoc J 2:7–16

    Google Scholar 

  • Patz JA, Campdell-Lendrum D, Holloway T, Foley JA (2005) Impact of regional climate change on human health. Nature 438:310–317

    Google Scholar 

  • Paul G (2000) The use of common principal component analysis in studies of phenotypic evolution, an example from the Drosophilidae. Master thesis, University of Toronto

  • Pepler PT (2014) The identification and application of common principal components

  • Prodoehl MJ, Hatzirodos N, Irving-Rodgers HF, Zhao ZZ, Painter JN, Hickey TE, Gibson MA, Rainey WE, Carr BR, Mason HD, Norman RJ, Montgomery GW, Rodgers RJ (2009) Genetic and gene expression analyses of the polycystic ovary syndrome candidate gene fibrillin-3 and other fibrillin family members in human ovaries. Mol Hum Reprod 15:829–841

    Google Scholar 

  • Qiu ZX, Zhao S, Mo XM, Li WM (2015) Overexpression of PROM1 (CD133) confers poor prognosis in non-small cell lung cancer. Int J Clin Exp Pathol 8:6589–6595

    Google Scholar 

  • Ricketts CJ, Hill VK, Linehan WM (2014) Tumor-specific hypermethylation of epigenetic biomarkers, including SFRP1, predicts for poorer survival in patients from the TCGA Kidney Renal Clear Cell Carcinoma (KIRC) project. PLoS ONE 9(1):e85621

    Google Scholar 

  • Richards EJ (2013) Molecular Profiling of Lung Cancer Thesis of PhD. National Heart and Lung Institute, Imperial College London

  • Rodrigues PC, Lima AT (2009) Analysis of an European union election using principal component analysis. Stat Paper 50:895–904

    MathSciNet  MATH  Google Scholar 

  • Rubie C, Frick VO, Ghadjar P, Wagner M, Grimm H, Vicinus B, Justinger C, Graeber S, Schilling MK (2010) CCL20/CCR6 expression profile in pancreatic cancer. J Transl Med 8:45

    Google Scholar 

  • Sabino-Silva R, Mori RC, David-Silva A, Okamoto MM, Freitas HS, Machado UF (2010) The Na+/glucose cotransporters: from genes to therapy. Braz J Med Biol Res 43:1019–1026

    Google Scholar 

  • Sebestyen E, Zawisza M, Eyras E (2015) Detection of recurrent alternative splicing switches in tumor samples reveals novel signatures of cancer. Nucleic Acids Res 43:1345–1356

    Google Scholar 

  • Singh PK, Sarkar R, Nasipuri M (2016) A study of moment based features on handwritten digit recognition applied computational intelligence and soft computing. Article ID 2796863

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 73:273–282

    MathSciNet  MATH  Google Scholar 

  • Ullmann R, Morbini P, Halbwedl I, Bongiovanni M, Gogg-Kammerer M, Papotti M, Gabor S, Renner H, Popper HH (2004) Protein expression profiles in adenocarcinomas and squamous cell carcinomas of the lung generated using tissue microarrays. J Pathol 203:798–807

    Google Scholar 

  • Vickaryous N, Polanco-Echeverry G, Morrow S, Suraweera N, Thomas H, Tomlinson I, Silver A (2008) Smooth-muscle myosin mutations in hereditary non-polyposis colorectal cancer syndrome. Br J Cancer 99:1726–8

    Google Scholar 

  • Wang H, Banerjee A, Boley D (2011) Common component analysis for multiple covariance matrices. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, pp 956–964

  • Wang B, Shi L, Sun X, Wang L, Wang X, Chen C (2016) Production of CCL20 from lung cancer cells induces the cell migration and proliferation through PI3K pathway. J Cell Mol Med 20:920–929

    Google Scholar 

  • Wang Z, Sun G, Zhang J, Zhao J (2017) Feature selection algorithm based on mutual information and lasso for microarray data. Open Biotech J 11

  • Wisniewski JR, Dus-Szachniewicz K, Ostasiewicz P, Ziokowski P, Rakus D, Mann M (2015) Absolute proteome analysis of colorectal Mucosa, Adenoma, and cancer reveals drastic changes in fatty acid metabolism and plasma membrane transporters. J Proteome Res 14(9):4005–4018

    Google Scholar 

  • Yang D, Powell C, Bai J, Hu J, Lu S, Wang N (2017) P3.13-037 deep learning system for lung nodule detection. J Thoracic Oncol 12:S2329

    Google Scholar 

  • Yasuda K, Torigoe T, Morita R, Kuroda T, Takahashi A, Matsuzaki J, Kochin V, Asanuma H, Hasegawa T, Saito T, Hirohashi Y, Sato N (2013) Ovarian cancer stem cells are enriched in side population and aldehyde dehydrogenase bright overlapping population. PLoS ONE 8(8):e68187

    Google Scholar 

  • Zeng W, Chang H, Ma M, Li Y (2014) CCL20/CCR6 promotes the invasion and migration of thyroid cancer cells via NF-kappa B signaling-induced MMP-3 production. Exp Mol Pathol 97:184–190

    Google Scholar 

  • Zhang L, Jiang H, Xu G, Wen H, Gu B, Liu J, Mao S, Na R, Jing Y, Ding Q, Zhang Y (2015) Proteins S100A8 and S100A9 are potential biomarkers for renal cell carcinoma in the early stages: results from a proteomic study integrated with bioinformatics analysis. Mol Med Rep 11:4093–100

    Google Scholar 

  • Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc B 67:301–320

    MathSciNet  MATH  Google Scholar 

  • Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. J Comput Graph Stat 15:265–286

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Heewon Park.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (xlsx 24 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Park, H., Konishi, S. Sparse common component analysis for multiple high-dimensional datasets via noncentered principal component analysis. Stat Papers 61, 2283–2311 (2020). https://doi.org/10.1007/s00362-018-1045-6

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00362-018-1045-6

Keywords

Navigation