Integrated proteotranscriptomics of breast cancer reveals globally increased protein-mRNA concordance associated with subtypes and survival
Transcriptome analysis of breast cancer discovered distinct disease subtypes of clinical significance. However, it remains a challenge to define disease biology solely based on gene expression because tumor biology is often the result of protein function. Here, we measured global proteome and transcriptome expression in human breast tumors and adjacent non-cancerous tissue and performed an integrated proteotranscriptomic analysis.
We applied a quantitative liquid chromatography/mass spectrometry-based proteome analysis using an untargeted approach and analyzed protein extracts from 65 breast tumors and 53 adjacent non-cancerous tissues. Additional gene expression data from Affymetrix Gene Chip Human Gene ST Arrays were available for 59 tumors and 38 non-cancerous tissues in our study. We then applied an integrated analysis of the proteomic and transcriptomic data to examine relationships between them, disease characteristics, and patient survival. Findings were validated in a second dataset using proteome and transcriptome data from “The Cancer Genome Atlas” and the Clinical Proteomic Tumor Analysis Consortium.
We found that the proteome describes differences between cancerous and non-cancerous tissues that are not revealed by the transcriptome. The proteome, but not the transcriptome, revealed an activation of infection-related signal pathways in basal-like and triple-negative tumors. We also observed that proteins rather than mRNAs are increased in tumors and show that this observation could be related to shortening of the 3′ untranslated region of mRNAs in tumors. The integrated analysis of the two technologies further revealed a global increase in protein-mRNA concordance in tumors. Highly correlated protein-gene pairs were enriched in protein processing and disease metabolic pathways. The increased concordance between transcript and protein levels was additionally associated with aggressive disease, including basal-like/triple-negative tumors, and decreased patient survival. We also uncovered a strong positive association between protein-mRNA concordance and proliferation of tumors. Finally, we observed that protein expression profiles co-segregate with a Myc activation signature and separate breast tumors into two subgroups with different survival outcomes.
Our study provides new insights into the relationship between protein and mRNA expression in breast cancer and shows that an integrated analysis of the proteome and transcriptome has the potential of uncovering novel disease characteristics.
KeywordsBreast cancer Proteomics Gene expression profiling Systems analysis Transcription Survival African-American
Cancer Genomics Data Server
Clinical Proteomic Tumor Analysis Consortium
False discovery rate
Gene Expression Omnibus
Generalized linear model
Gene set enrichment analysis
Human epidermal growth factor receptor 2
Kyoto Encyclopedia of Genes and Genomes
Liquid chromatography-mass spectrometry
Non-negative matrix factorization
Principal component analysis
Reverse phase protein array
The Cancer Genome Atlas
Triple-negative breast cancer
University of Maryland
Gene expression profiling of breast tumors has led to the landmark discovery of disease subtypes and novel biomarkers for therapy response and disease survival [1, 2, 3, 4]. However, it remains a challenge to define breast cancer biology solely based on gene expression and without knowledge of related changes in the proteome because proteins are key functional drivers of biology and common targets of anticancer drugs. Recent technological advances in mass spectrometry (MS) have laid the groundwork for large-scale characterization of protein expression in human tissues using either untargeted or targeted approaches for protein quantitation [5, 6, 7, 8]. System-wide proteomics of the estrogen receptor (ER)-positive disease revealed some insights into disease development that were not revealed by mRNA-based studies . While untargeted proteomics has advanced our knowledge of breast cancer biology [5, 6, 8, 9, 10, 11, 12, 13] and other cancers [14, 15, 16], a more systematic investigation of the relationship between the tumor proteome and transcriptome, here termed proteotranscriptomic analysis, has the potential to uncover novel molecular alterations in breast cancer biology. To this end, we hypothesized that proteotranscriptomic integration will reveal novel disease characteristics beyond a single technology and applied an integrated analysis of proteomic and transcriptomic data that we jointly collected from human breast tumors and adjacent non-cancerous tissues from patients with survival follow-up. A major difference between this and previous proteome studies is the inclusion of adjacent non-cancerous tissues, African-American patients, and our ability to assess relationships with patient survival. Our study revealed that the proteome and transcriptome describe a partially different tumor biology and that proteins are more commonly upregulated in tumors than the corresponding transcripts. Moreover, our data describe a pathway-centric increase in the concordance between protein and transcript levels that is associated with more aggressive disease and decreased patient survival. These findings were corroborated using proteome and transcriptome data for 404 breast tumors from “The Cancer Genome Atlas” (TCGA) and 77 breast tumors from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) [13, 17].
Breast cancer patients were recruited between 1993 and 2003, as described previously [18, 19]. Samples of fresh-frozen tumor and adjacent non-cancerous tissue were prepared by a pathologist immediately after surgery and stored at − 80 °C. Clinical and pathological information was obtained from medical records and pathology reports. Details on patient recruitment, specimen collection, and tumor classification are provided in Additional file 1. The collection of biospecimens and the clinical and pathological information was approved by the University of Maryland (UMD) Institutional Review Board (protocol #0298229). The research was also reviewed and approved by the NIH Office of Human Subjects Research Protections (OHSRP #2248).
Mass spectrometry-based analysis of the proteome
Frozen human tissue samples were pulverized under liquid nitrogen, and extracts for mRNA and protein isolation were prepared. Extracted proteins were digested with trypsin and analyzed using an untargeted MS analysis approach as described in Additional file 1. For the liquid chromatography (LC)-MS measurements, 17 fractions per sample were prepared which generated about 1900 individual fractions from the 118 tissues that were subjected to the MS analysis. The obtained MS data were searched against the UniProt Homo sapiens database downloaded from the European Bioinformatics Institute website (ftp://ftp.ebi.ac.uk/pub/databases/integr8) using the Proteome Discoverer 2.0 software (Thermo Fisher Scientific) interfaced with the SEQUEST HT algorithm and filtered with percolator to yield peptide identifications at the 1% false discovery rate (FDR) cutoff. We employed the Protein Scorer and Protein FDR Validator nodes to apply an additional 5% protein-level FDR. Up to two missed tryptic cleavage sites and oxidation of methionyl residues were allowed during this database search. The data was searched with a precursor ion tolerance of 1.4 Da and a fragment ion tolerance of 0.5 Da and two levels of grouping were applied, one for peptide grouping and one for protein grouping. We selected the “strict maximum parsimony principle” option, and only the best ranked peptide-spectrum match (PSM) per spectrum was used for protein identification and grouping. To further reduce false-positive discovery, we considered only those proteins as correctly identified when at least two peptides in a tissue sample uniquely mapped to these proteins. As the last filtering step that was implemented by us, we calculated protein coverage across all samples (Additional file 2: Figure S1A) and found that the correlation between protein coverage and abundance is very high (rho = 0.97) when we remove those proteins from the analysis that are detected in fewer than 10% (n = 12) of the samples (Additional file 2: Figure S1B). By setting this 10% coverage cutoff (after the initial protein level 5% FDR using the Proteome Discoverer 2.0 software), we removed the proteins that are difficult to quantify by our technology, leading to a total of 7141 quantified proteins in 118 tissues that we included into the analyses. This approach was validated by showing that the identified proteins in our study largely overlap with proteins identified in three other studies [8, 12, 13] (Additional file 2: Figure S2). The peptide spectral counts for each tissue are shown in Additional file 3: Table S1. The mass spectrometry proteomics data have been deposited with the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) under the dataset identifier PXD005692. To assess differential protein expression between tissues (e.g., tumor vs. non-cancerous tissue), we used the Bioconductor package DESeq2 that was shown to perform well in label-free MS proteomics . Using DESeq2, we estimated the size factor and median values for the ratios of the observed counts, controlled for count differences between samples, and monitored outlier samples using Cook’s distance (Additional file 2: Figure S1C). We then applied negative binomial generalized linear model (GLM) fitting and Wald statistics for significance testing. Furthermore, DESeq2 implements additional filtering that removes statistically insignificant associations, leading to the preferential removal of proteins with low counts and insignificant differences typically due to high dispersion. DESeq2 introduces rlog (https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization), which is calculated by fitting each protein to a GLM with a baseline expression (i.e., intercept only) and computing GLM data for each sample, shrunken with respect to the baseline, using the empirical Bayes procedure. rlog incorporates a prior on the sample differences and removes the dependence of the variance on the mean, particularly the high variance of the count data when the mean is low. After rlog normalization, we found that all samples have a very similar distribution for the transformed proteomic data, as shown in Additional file 2: Figure S1D. To compare the spectral count-based ranking of proteins in our study with the ranking of proteins in the Mertins et al. dataset , we plotted z-scaled log converted PSMs for each protein common to both studies.
Gene expression microarray analysis
For gene expression profiling, mRNA was converted into cDNA using the Ambion WT Expression Kit for Affymetrix GeneChip Whole Transcript Expression Arrays (Life Technologies). After fragmentation and labeling using the GeneChip WT Terminal Labeling Kit from Affymetrix, ssDNA was hybridized onto Gene Chip Human Gene 1.0 ST Arrays (representing 28,869 genes) according to Affymetrix standard protocols (Santa Clara, CA). The probe cell intensity data was processed by robust multi-array average (RMA) algorithm and analyzed with the Bioconductor limma R package. For more details, including pathway enrichment analysis, see Additional file 1. We only used protein-coding genes for pathway annotation. The top 20 enriched pathways enriched for upregulated and downregulated protein-coding transcripts are shown in Additional file 4: Table S2.
Protein-mRNA correlation analysis
A protein-mRNA correlation analysis was performed using the regularized-logarithm transformation (rlog) value of the spectral counts and the normalized log2 probe intensity for mRNAs and is described in detail in Additional file 1. Briefly, we calculated the global Spearman correlation coefficient, rho, for 5677 and 3316 protein-mRNA pairs within tumors and non-cancerous tissues, respectively. Adjusted P values based on the analysis of 59 tumors and 38 non-cancerous tissues were computed by the Benjamini-Hochberg procedure . Correlation differences between the tumors and non-cancerous tissues were examined by ranking rho for each tissue in the two groups and then performing a Wilcoxon rank sum test. A KEGG (Kyoto Encyclopedia of Genes and Genomes) enrichment analysis was performed using the calculated Spearman correlation coefficients for all protein-mRNA pairs and applying the Kolmogorov-Smirnov test to assess how the concordance between protein/mRNA pairs associates with biological processes. Additional analyses, e.g., relationships with tumor subtypes and mRNA features, are described in Additional file 1.
Query of The Cancer Genome Atlas breast cancer
Publicly available TCGA/CPTAC breast cancer data were downloaded from the Cancer Genomics Data Server (CGDS, at http://www.cbioportal.org/public-portal). Processing of the data to obtain 70 annotated protein-mRNA pairs for 404 tumors is described in Additional file 1. TCGA/CPTAC proteomics breast cancer data were downloaded together with the corresponding gene expression via cbioportal. The PAM50 assignment for the tumors was obtained from the publicly available data provided by the TCGA analysis group.
Association between protein expression and shortening of the 3′UTR
We retrieved data from Xia et al. , who described 382 genes with significant 3′UTR mRNA shortening in human breast tumors due to alternative polyadenylation based on the analysis of 106 TCGA breast tumor-adjacent tissue pairs.
Tumor proliferation score
We selected the array-based gene expression profiles of 11 cell cycle genes (BIRC5, CCNB1, CDC20, CEP55, MKI67, NDC80, NUF2, PTTG1, RRM2, TYMS, UBE2C) and summed them into a metagene score as a marker for tissue proliferation, as described previously .
Non-negative matrix factorization
Non-negative matrix factorization (NMF) was used to describe tumor subgroups with different protein abundance profiles. We selected proteins with the highest variability among the proteins detected in the 59 tumors, using a median absolute deviation cutoff of 0.5, which resulted in 1000 proteins for clustering. We applied the consensus NMF clustering method in the R package (https://cran.r-project.org/web/packages/NMF/index.html) to identify tumor subgroups described by the proteome data. More details describing the tumor proliferation score and the NMF analysis including the survival analysis can be found in Additional file 1.
All statistical tests were two-sided, and an association was considered statistically significant with P < 0.05. Statistical analyses were performed using the R software developed by the R Development Core Team at R Foundation for Statistical Computing and packages in Bioconductor . We used paired tests for the statistical analysis of differences in protein and gene expression between tumor-adjacent normal pairs. Survival analysis, e.g., Cox regression and Kaplan-Meier methods, was performed using the survival package of R. For correlation analysis, the R function “cor.test” was used. We applied the Spearman rank correlation test for protein-mRNA correlations because protein and mRNA abundances do not strictly follow a normal distribution or a linear relationship, consistent with previous observations . Reported Spearman coefficients were corrected for ties. Pearson’s correlation test was applied in the analysis of the relationship between tumor proliferation index and the global protein-mRNA concordance. Lastly, we applied a linear regression model to control for confounders in our correlation analyses of the protein-mRNA concordance with race/ethnicity or disease markers.
Proteomic profiling of breast tumors and adjacent non-cancerous tissues
Increased correlation between protein and mRNA abundance is a disease-associated characteristic
High concordance between protein-mRNA pairs in tumors is associated with decreased breast cancer survival
Characteristics of proteins and mRNAs with increased protein-mRNA correlations in tumors
Proteomic subtypes and their association with Myc signaling
Here, we provide a comprehensive proteotranscriptomic analysis of breast cancer, including the analysis of tumor-adjacent non-cancerous tissue pairs and patients with survival follow-up, and generate a proteome data resource that includes tumors from African-American and European-American patients. Our data show that mRNA abundance incompletely predicts protein abundance in breast tumors and even less so in the adjacent non-cancerous tissue. Furthermore, the tumor proteome described disease pathways and subgroups that were only partially captured by the tumor transcriptome, consistent with the findings in the CPTAC breast cancer study . Notably, however, our work discovered an increased protein-mRNA concordance in breast tumors as a novel disease characteristic and prognostic factor that is associated with molecular subtypes, aggressiveness, and inferior patient survival.
To the best of our knowledge, a relationship between protein and mRNA abundance as a prognostic marker in cancer has not been previously reported. Concordances between protein-mRNA pairs in breast cancer cell lines have been examined, and a mean correlation score of ~ 0.5 for 94 pairs can be estimated from the study by Kennedy et al. . A more recent study using reverse phase protein arrays reported a mean protein-mRNA correlation score of ~ 0.45 for key cancer proteins across several hundred cell lines and 0.35 for 47 breast cancer cell lines , which is comparable with the results from other cell-based studies [27, 32]. Thus, in cultured cells, the transcriptome is a moderate predictor of the proteome. TCGA/CPTAC investigators reported a mean protein-mRNA concordance score of 0.39 for breast tumors and 0.47 for colorectal tumor [13, 16]. The lower average concordance in breast tumors in TCGA/CPTAC and our study could be the result of tumor heterogeneity and variations in technology or could be due to the differences in growth rates between breast and colorectal cancer, as our data show that protein-mRNA concordance in breast tumors is linked to proliferation. The proteogenomic characterization of TCGA/CPTAC colorectal and breast tumors found, as we did, that genes encoding metabolic functions tend to show high protein-mRNA correlations [13, 16], indicating enhanced protein-mRNA coupling in cancer metabolism. This finding indicates that cancer cells require a stricter regulation of their metabolism to survive by linking transcription to immediate translation.
Others examined the proteome of breast cancer and characterized disease subtypes [11, 12, 13] or engaged in biomarker discovery [5, 10, 33]. In agreement with the findings by Geiger et al. , we also noticed that two candidate prognostic markers, IDH2 and CRABP2, are aberrantly upregulated proteins in breast cancer including basal-like tumors (Additional file 7: Table S5). In contrast, few studies evaluated whether the cancer proteome provides signatures for classification into disease subtypes. In colorectal tumors, proteomic signatures described disease subtypes that partly overlapped with the transcriptome-defined subtypes for this disease , while Tyanova et al.  reported that hierarchical clustering of breast tumors based on protein expression shows high diversity between tumor samples and no clear separation into the previously reported molecular subtypes [1, 2, 3, 4]. In their study, the proteome separated tumors into subgroups enriched for certain subtypes. We observed that the proteome separates human breast tumors into two main clusters with different survival outcomes, where one cluster was enriched for basal-like and the other for luminal tumors. Yet, further analyses showed that a Myc activation signature in breast tumors [19, 28] was the strongest classifier for these two tumor groups in our study, indicating a major influence of Myc signaling on the proteome in breast cancer. This observation is consistent with both the known function of the MYC oncogene as a regulator of ribosome biogenesis and enhancer of protein synthesis [29, 30] and the proteogenomic characterization of breast tumors by the TCGA/CPTAC Consortium . In the CPTAC study, K-means consensus-based clustering with global proteome data yielded a separation of tumors into three groups, termed basal-enriched, luminal-enriched, and stromal-enriched. While our study using NMF clustering did not distinguish stromal-enriched tumors as a third proteomic subtype, both studies associated the basal-enriched proteomic subtype with Myc activation.
Characterization of breast cancer with either proteome or transcriptome data may yield different insights into tumor biology. Proteins that are upregulated in tumors may associate with processes that are very different from those described by the analysis of upregulated mRNAs. These differences may be partly explained by mRNA properties, such as 3′UTR shortening, leading to increased protein expression without upregulation of mRNA expression in tumors, as our data show. We examined the potential differences between a proteome and transcriptome analysis using tumor-adjacent non-cancerous tissue pairs and jointly examined differentially expressed proteins and mRNAs and their pathway association. Recent studies have demonstrated the advantage of pathway-based analysis in assessing tumor biology [34, 35]. Our approach showed that upregulated proteins specifically cluster in processes related to protein synthesis and degradation and disease metabolism. Proteins, but not mRNA, captured ribosome synthesis and function as a disease-associated process and indicated an activation of infection-related signal pathways in basal-like and triple-negative tumors. The latter is of interest because currently, an infection-related process has not been linked to this subtype. Lastly, HER2-enriched tumors were characterized by a distinct downregulation of proteins in the coagulation cascade, which was not seen on the mRNA level. Thus, the analysis of the proteome can yield insights into tumor biology that are missed by a transcriptome analysis.
We applied an integrated analysis of proteomic and transcriptomic data that we jointly collected from human breast tumors and adjacent non-cancerous tissues. Our study revealed that the proteome describes differences between cancerous and non-cancerous tissue and disease subtypes that are not captured by the transcriptome. Proteins, but not mRNA, linked infection-related pathways to basal-like and triple-negative breast cancer. We also uncovered cross-omics correlations that we validated in additional datasets. Notably, our work describes an increased protein-mRNA concordance in breast tumors as a disease characteristic that is associated with molecular subtypes, aggressiveness, and inferior patient survival.
We thank Marjan Gucek, Director of the Proteomics Core Facility, National Heart Lung, and Blood Institute, NIH, Bethesda, MD, USA, for the helpful discussions of the proteome data and manuscript. We would also like to acknowledge Raymond Jones, Audrey Salabes, Leoni Leondaridis, Glennwood Trivers, Elise Bowman, and personnel at the University of Maryland and the Baltimore Veterans Administration and the Surgery and Pathology Departments at the University of Maryland Medical Center, Baltimore Veterans Affairs Medical Center, Union Memorial Hospital, Mercy Medical Center, and Sinai Hospital for their contributions in patient recruitment.
This research was supported by the Intramural Research Program of the NIH, NCI, Center for Cancer Research (ZIA BC 010887), and a NCI Director’s Innovation Award to Stefan Ambs.
Availability of data and materials
Gene expression data from this study can be found in GEO (http://www.ncbi.nlm.nih.gov/geo) under the accession number GSE39004/GSE37751. Affymetrix Platform: GPL6244 [HuGene-1_0-st] Affymetrix Human Gene 1.0 ST Array [transcript (gene) version]. Peptide spectral counts and rlog values for each of the 7141 proteins and the 118 tissues in this study are cataloged in Additional file 3: Table S1, and the mass spectrometry proteomics data have been deposited with the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) in the PRIDE Archive—proteomics data repository under the dataset identifier PXD005692. We used R-based APT to access the publicly available TCGA breast cancer data from the Cancer Genomics Data Server (CGDS, at http://www.cbioportal.org/public-portal) hosted by the Computational Biology Center at Memorial-Sloan-Kettering Cancer Center. The downloaded data included mRNA expression data (in z-score) and RPPA data and clinical information from this TCGA dataset. We also downloaded the publicly available CPTAC breast cancer proteomics dataset from the Cancer Genomics Data Server, consisting of high-quality proteome and corresponding gene expression data for 77 tumor samples with PAM50 classification .
MZ, TDV, and SA contributed to the conception and experimental design. WT, MZ, THD, and DP contributed to the methodology and data acquisition. WT, MZ, XWW, ER, and SA contributed to the analysis and interpretation. WT, MZ, XWW, ER, and SA contributed to the manuscript writing. All authors read and approved the final manuscript.
Ethics approval and consent to participate
The collection of biospecimens and the clinical and pathological information was approved by the University of Maryland Institutional Review Board for the participating institutions (UMD protocol #0298229). IRB approval of this protocol was then obtained at all institutions (Veterans Affairs Medical Center, Union Memorial Hospital, Mercy Medical Center, and Sinai Hospital, Baltimore, MD). The research was also reviewed and approved by the NIH Office of Human Subjects Research Protections (OHSRP #2248). All patients provided written informed consent to participate in the study, and the research conformed with the principles of the Declaration of Helsinki.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 6.Liu NQ, Dekker LJ, Stingl C, Guzel C, De MT, Martens JW, Foekens JA, Luider TM, Umar A. Quantitative proteomic analysis of microdissected breast cancer tissues: comparison of label-free and SILAC-based quantification with shotgun, directed, and targeted MS approaches. J Proteome Res. 2013;12:4627–41.CrossRefGoogle Scholar
- 9.Warmoes M, Jaspers JE, Xu G, Sampadi BK, Pham TV, Knol JC, Piersma SR, Boven E, Jonkers J, Rottenberg S, Jimenez CR. Proteomics of genetically engineered mouse mammary tumors identifies fatty acid metabolism members as potential predictive markers for cisplatin resistance. Mol Cell Proteomics. 2013;12:1319–34.CrossRefGoogle Scholar
- 21.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B (Methodological). 1995;57:289–300.Google Scholar
- 32.de Sousa AR, Penalva LO, Marcotte EM, Vogel C. Global signatures of protein and mRNA expression levels. Mol BioSyst. 2009;5:1512–26.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.