Abstract
Background
Pan-cancer analysis examines both the commonalities and heterogeneity among genomic and cellular alterations across numerous types of tumors. Pan-cancer analysis of gene expression, tumor mutational burden (TMB), microsatellite instability (MSI), and tumor immune microenvironment (TIME), and methylation becomes available based on the multi-omics data from The Cancer Genome Atlas Program (TCGA). Some online tools provide analysis of gene and protein expression, mutation, methylation, and survival for TCGA data. However, these online tools were either Uni-functional or were not able to perform analysis of user-defined functions. Therefore, we created the TCGAplot R package to facilitate perform pan-cancer analysis and visualization of the built-in multi-omic TCGA data.
Results
TCGAplot provides several functions to perform pan-cancer paired/unpaired differential gene expression analysis, pan-cancer correlation analysis between gene expression and TMB, MSI, TIME, and promoter methylation. Functions for visualization include paired/unpaired boxplot, survival plot, ROC curve, heatmap, scatter, radar chart, and forest plot. Moreover, gene set based pan-cancer and tumor specific analyses were also available. Finally, all these built-in multi-omic data could be extracted for implementation for user-defined functions, making the pan-cancer analysis much more convenient.\
Conclusions
We developed an R-package for integrative pan-cancer analysis and visualization of TCGA multi-omics data. The source code and pre-built package are available at GitHub (https://github.com/tjhwangxiong/TCGAplot).
Similar content being viewed by others
Background
Cancer is a major public health problem and leading death causes worldwide, with increasing new cases and deaths each year [1]. Tumor occurrence and progression are accompanied by dysregulation of oncogene and tumor suppressor genes partially caused by mutation, promoter and gene body methylation [2]. Immune escape is one of the most essential hallmarks of cancer cells which evade immune surveillance via disrupt the crosstalk with immune cells within the tumor microenvironment (TME). TME and tumor immune microenvironment (TIME) attract much attention in cancer research area, and strategies targeting TME have emerged as promising approaches for cancer treatment [3]. Advances in multi-omics technologies enable us to access multi-layer information from the genome, transcriptome, proteome, metabolome, and epigenome, fueling the development of cancer precision medicine [4].
The Cancer Genome Atlas (TCGA) is one of the largest collections of multi-omics data involving 33 different types of cancer for more than 20 000 samples, including exome sequencing, RNA sequencing, microRNA sequencing, copy number variation, proteome and methylome [5]. Several online tools have been developed to provide bioinformatic analysis of TCGA data. Tang et al. [6] developed the web server GEPIA2 to perform gene expression quantification at both pan-cancer level and a specific cancer subtype manner. The cBioPortal (https://www.cbioportal.org/) for Cancer Genomics contains data sets from numerous cancer studies including TCGA, and enables researchers to explore genetic alterations per gene and sample [7]. Kaplan–Meier plotter (http://kmplot.com/analysis/) provides pan-cancer survival analysis [8]. Gene Set Cancer Analysis (GSCA, http://bioinfo.life.hust.edu.cn/GSCA/#/) provides gene set cancer analysis for TCGA data, including genomic, pharmacogenomic, and immunogenomic gene sets [9]. TIMER2.0 is a web server for immune infiltration across TCGA cancers [10]. MethSurv (https://biit.cs.ut.ee/methsurv/) provides a web tool to perform survival analysis using TCGA methylome data [11]. In addition to these online website tools, some R packages have been developed for TCGA data download, genomic and expressive analysis, such as TCGAbiolinks and IBOR [12, 13]. However, an integrative R package for pan-cancer expression and correlation analysis between gene expression and TMB, MSI, TIME, and promoter methylation, is not available yet. Therefore, we developed an R-package for integrative pan-cancer analysis and visualization of TCGA data named TCGAplot.
Implementation
The source code of TCGAplot R package is public available at https://github.com/tjhwangxiong/TCGAplot. A pre-built version (v4.0.0) could be downloaded (https://github.com/tjhwangxiong/TCGAplot/releases/download/v4.0.0/TCGAplot_4.0.0.zip) and installed quickly. A detailed vignette is available at https://github.com/tjhwangxiong/TCGAplot/blob/main/vignettes/TCGAplot.Rmd.
Results
Data preparation
The integrated built-in data in TCGAplot R package include TPM (transcripts per million) expression matrix, tumor mutational burden (TMB), microsatellite instability (MSI), immune cell ratio, immune score, promoter methylation, and meta information (Fig. 1).
The expression TPM matrix was downloaded from TCGA (https://portal.gdc.cancer.gov/) using the TCGAbiolinks R package (v2.28.4) with GDCquery, GDCdownload, and GDCprepare functions [12]. Duplicated samples were removed randomly. Genes with TPM value of 0 across all samples were excluded, and the final TPM matrix with protein-coding genes was shown as log2(TPM + 1) accompanied with cancer type and group (tumor, normal) information. The somatic mutation and DNA methylation beta value data were downloaded with the TCGAbiolinks R package. The probes within the TSS1500-island region was selected as promoter region. The MSI value of TCGA patients were downloaded using the cBioPortalData R package (v2.12.0) [14]. The immune cell ratio was downloaded from The Immune Landscape of Cancer (https://api.gdc.cancer.gov/data/b3df502e-3594-46ef-9f94-d041a20a0b9a). The immune scores, including ESTIMATE, Immune, and Stromal scores, were calculated using the estimate R package (v1.0.13) based on the TPM matrix [15]. The gene lists for ‘stromal signature’ and ‘immune signature’ were summarized in Additional file 1: Table S1.
Pan-cancer expression analysis
Pan-cancer expression analysis includes unpaired tumor-normal box plot across 33 types of TCGA cancers (Fig. 2a) and paired tumor-normal box plot across 15 types of TCGA cancers with more than 20 pairs of samples (Fig. 2b) using pan_boxplot and pan_paired_boxplot functions respectively. Moreover, pan-cancer expression of a single gene across 33 types of tumor samples (without normal samples) could be achieved by using pan_tumor_boxplot function (Fig. 2c).
Pan-cancer correlation analysis
We also provide functions to analyze the correlation between single gene expression and TMB, and MSI. The results were visualized with radar chart (Fig. 3a, b).
Immunotherapy has revolutionized the treatment of cancer patients and rejuvenated the field of TIME. Therefore, we also provide some functions to perform the correlation between a single gene and immune-related genes, including immune checkpoint genes (ICGs) (Fig. 4a), chemokine (Fig. 4b), chemokine receptor (Fig. 4c), immune stimulator (Fig. 4d), and immune inhibitor (Fig. 4e). Moreover, two color parameters, “lowcol” and “highcol”, were provided for users to define the colors of low point and high point in the heatmap respectively.
Moreover, correlation between gene expression and immune infiltration could be analyzed, including immune cell ratio (Fig. 5a), immune score (Fig. 5b, c).
Pan-cancer cox regression analysis
The Cox regression model is used for survival analyses in clinical research by estimating the hazard ratio (HR) of a given endpoint correlated with a specific risk factor, such as the expression of a single gene. We provide function to perform pan-cancer cox regression analysis with or without age adjustment and visualization by forest plot (Fig. 6a, b).
Pan-cancer correlation analysis based on gene set
Sometimes it is a gene set (instead of a gene) that’s driving the TMB, so we also provide functions to analyze the correlation between the express of a gene set and TMB, and MSI. The results were visualized with radar chart (Fig. 7a, b).
We also provide some functions to perform the correlation between a gene set and immune-related genes, including ICGs (Fig. 8a), chemokine (Fig. 8b), chemokine receptor (Fig. 8c), immune stimulator (Fig. 8d), and immune inhibitor (Fig. 8e).
Cancer type specific expression analysis
In addition to pan-cancer analysis, we have also provided numerous functions for caner type specific samples. The expression of a single gene could be grouped by clinical data, including unpaired (Fig. 9a) and paired (Fig. 9b) tumor-normal samples, age (Fig. 9c, d), gender (Fig. 9e), and stage (Fig. 9f).
Moreover, we provided cancer type specific analysis of gene set. The expression of a gene set could be grouped by clinical data, including unpaired (Fig. 10a) and paired (Fig. 10b) tumor-normal samples, age (Fig. 10c, d), gender (Fig. 10e), and stage (Fig. 10f).
Tumor samples in a specific type of cancer could be further grouped by the expression of a single gene, and the differentially expressed genes (DEGs) between high-expression and low-expression groups could be identified (Fig. 11a) and analyzed using Gene Set Enrichment Analysis (GSEA) including GSEA-GO (Gene Ontology) (Fig. 11b) and GSEA-KEGG (Kyoto Encyclopedia of Genes and Genomes) (Fig. 11c).
Cancer type specific diagnostic analysis
Receiver operating characteristic (ROC) curve and the area under the curve (AUC) were widely used to examine the sensitivity and specificity of a diagnostic model. We provide function to draw the ROC curve and calculate the AUC of a diagnostic model using the expression of a single gene in a specific type of cancer. An example was shown for KLF7 in CHOL (Fig. 12a), HNSC (Fig. 12b), and UCEC (Fig. 12c).
Cancer type specific correlation analysis
We provide correlation analysis in a specific type of cancer, including gene–gene (Fig. 13a, b), gene-methylation (Fig. 13c) correlation analysis. Moreover, for the correlated genes, GO enrichment analysis (Fig. 13d) is also available.
Cancer type specific survival analysis
Survival analysis base on the expression (Fig. 14a) or methylation (Fig. 14b) level of a single gene in a specific type of cancer could be performed.
Network construction
Users can also depict the linkages of a single gene or a gene set and GO terms or KEGG pathways as a network using the gene_network_go (Fig. 15a) and gene_network_kegg (Fig. 15b) functions.
Built-in data extraction
All built-in data in our package could be extracted for user-defined functions, including TPM expression matrix, TMB, MSI, immune cell ratio, immune score, promoter methylation, and meta information with functions listed in Table 1. Therefore, users could perform their user-defined functions to make more unique analysis with TCGA multi-omics data.
Conclusion
TCGAplot provides a user-friendly interface for analyzing TCGA pan-cancer multi-omics data and uses visualization techniques to enable users explore the commonalities and heterogeneity across numerous types of tumors. Concretely, several functions have been developed to perform pan-cancer paired/unpaired expression analysis, correlation analysis, survival analysis, as well as user-defined function analysis. Overall, we developed an R-package for integrative pan-cancer analysis and visualization of TCGA multi-omics data.
Availability of data and materials
The source code is available at GitHub (https://github.com/tjhwangxiong/TCGAplot).
Abbreviations
- TME:
-
Tumor microenvironment
- TIME:
-
Tumor immune microenvironment
- TCGA:
-
The cancer genome atlas
- TPM:
-
Transcripts per million
- TMB:
-
Tumor mutational burden
- MSI:
-
Microsatellite instability
- HR:
-
Hazard ratio
- DEGs:
-
Differentially expressed genes
- GSEA:
-
Gene set enrichment analysis
- GO:
-
Gene ontology
- KEGG:
-
Kyoto encyclopedia of genes and genomes
References
Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer statistics, 2022. CA Cancer J Clin. 2022;72(1):7–33.
Zwierenga F, van Veggel B, van den Berg A, Groen HJM, Zhang L, Groves MR, Kok K, Smit EF, Hiltermann TJN, de Langen AJ, et al. A comprehensive overview of the heterogeneity of EGFR exon 20 variants in NSCLC and (pre)clinical activity to currently available treatments. Cancer Treat Rev. 2023;120:102628.
Bejarano L, Jordao MJC, Joyce JA. Therapeutic targeting of the tumor microenvironment. Cancer Discov. 2021;11(4):933–59.
He X, Liu X, Zuo F, Shi H, Jing J. Artificial intelligence-based multi-omics analysis fuels cancer precision medicine. Semin Cancer Biol. 2023;88:187–200.
Cancer Genome Atlas Research N, Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45(10):1113–20.
Tang Z, Kang B, Li C, Chen T, Zhang Z. GEPIA2: an enhanced web server for large-scale expression profiling and interactive analysis. Nucleic Acids Res. 2019;47(W1):W556–60.
Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, Jacobsen A, Byrne CJ, Heuer ML, Larsson E, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2012;2(5):401–4.
Lanczky A, Gyorffy B. Web-based survival analysis tool tailored for medical research (KMplot): development and Implementation. J Med Internet Res. 2021;23(7):e27633.
Liu CJ, Hu FF, Xie GY, Miao YR, Li XW, Zeng Y, Guo AY. GSCA: an integrated platform for gene set cancer analysis at genomic, pharmacogenomic and immunogenomic levels. Brief Bioinform. 2023;24(1):bbac558.
Li T, Fu J, Zeng Z, Cohen D, Li J, Chen Q, Li B, Liu XS. TIMER20 for analysis of tumor-infiltrating immune cells. Nucleic Acids Res. 2020;48(W1):W509–14.
Modhukur V, Iljasenko T, Metsalu T, Lokk K, Laisk-Podar T, Vilo J. MethSurv: a web tool to perform multivariable survival analysis using DNA methylation data. Epigenomics. 2018;10(3):277–88.
Colaprico A, Silva TC, Olsen C, Garofano L, Cava C, Garolini D, Sabedot TS, Malta TM, Pagnotta SM, Castiglioni I, et al. TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 2016;44(8):e71.
Zeng D, Ye Z, Shen R, Yu G, Wu J, Xiong Y, Zhou R, Qiu W, Huang N, Sun L, et al. IOBR: multi-omics immuno-oncology biological research to decode tumor microenvironment and signatures. Front Immunol. 2021;12:687975.
Ramos M, Geistlinger L, Oh S, Schiffer L, Azhar R, Kodali H, de Bruijn I, Gao J, Carey VJ, Morgan M, et al. Multiomic integration of public oncology databases in bioconductor. JCO Clin Cancer Inform. 2020;4:958–71.
Yoshihara K, Shahmoradgoli M, Martinez E, Vegesna R, Kim H, Torres-Garcia W, Trevino V, Shen H, Laird PW, Levine DA, et al. Inferring tumour purity and stromal and immune cell admixture from expression data. Nat Commun. 2013;4:2612.
Funding
This research was funded by Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology (2021141 to XW).
Author information
Authors and Affiliations
Contributions
XW wrote the software and the manuscript. CL prepared the figures. The author read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability and requirements
-
Project name TCGAplot
-
Project home page https://github.com/tjhwangxiong/TCGAplot
-
Operating system(s) Platform independent
-
Programming language R
-
License MIT
-
Other requirements: R 2.1.0 or higher
-
Any restrictions to use by non-academics: license needed
Competing interests
The authors declare that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1:
The gene lists for “stromal signature” and “immune signature”.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Liao, C., Wang, X. TCGAplot: an R package for integrative pan-cancer analysis and visualization of TCGA multi-omics data. BMC Bioinformatics 24, 483 (2023). https://doi.org/10.1186/s12859-023-05615-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859-023-05615-3