Background

There is currently great interest in identifying the biological mechanisms for the acquisition of motility and invasiveness in cancer. It has been hypothesized [1, 2] that they often involve some form of epithelial-mesenchymal transition (EMT), significant involvement of the tumor microenvironment [3, 4] and the presence of activated fibroblasts in the "reactive" desmoplastic stroma of tumors, referred to as "cancer associated fibroblasts" (CAFs) [5, 6].

A study [7] of serous papillary ovarian carcinomas, comparing the gene expression profiles of primary vs. omental metastatic tumors, identified 156 differentially expressed genes. To investigate the significance of these genes in an independent rich data set we performed hierarchical clustering, using only these genes, on The Cancer Genome Atlas (TCGA) gene expression data set consisting of 377 ovarian cancer samples containing staging information. The resulting heat map revealed a prominent block of about 100 highly overexpressed genes in 94 samples (Additional file 1). Remarkably, we found that none of the 41 samples from tumors of stages IIIb and below were among the 94 samples. This cannot be due to chance (p = 4 × 10-6), leading to the hypothesis that coordinated overexpression of these genes implies that the tumor has progressed to at least stage IIIc.

To further validate this hypothesis and test if similar versions apply to other cancers, we developed a computational technique identifying in an unbiased manner coordinatedly overexpressed genes associated with a phenotype (such as transition to a particular stage). Our results consistently rediscover the same "core" signature of overexpressed genes, confirming the hypothesis. We found that the signature is present in multiple cancers, each of which has its own features involving additional genes, but the core signature is common.

As evidenced by the presence of FAP (fibroblast activation protein) and ACTA2 (actin, alpha 2, also known as α-SMA, alpha-smooth muscle actin) in the set of overexpressed genes, the signature suggests a variant of stromal desmoplastic reaction. As further evidence, it is known ([6], p. 546) that activated fibroblasts (myofibroblasts) construct the desmoplastic stroma through the secretion of large amounts of collagen, fibronectin (FN1) and proteoglycans, and they secrete various proteases such as urokinase plasminogen activator (PLAU) and matrix metalloproteinases (MMPs). This list precisely describes most of the genes appearing in the signature, and its precise composition (e.g., having COL11A1 as the prominent collagen, MMP11 as the prominent metalloproteinase, and CDH11 as the prominent cadherin) points to a particular variant of CAFs, to which we refer as "metastasis associated fibroblasts" (MAFs). Indeed, the resulting proteolytic remodeling at the invasive edge of the tumor is thought to facilitate the initial invasive stage of the metastatic process by "excavating passageways" ([6], p. 621) through the extracellular matrix for the cancer cells to go through. Accordingly, in the following we refer to the corresponding gene expression signature and biological mechanism as "the MAF signature" and "the MAF mechanism," respectively.

Methods

List of data sets

The list of data sets in the paper is given in Table 1. They were identified by searching for rich data sets focused on a specific cancer in two public databases, The Cancer Genome Atlas and the Gene Expression Omnibus data depository. Furthermore, for the data sets initially used to infer the gene signature we required that they have well annotated staging information associated with the samples and that they contain a significant number of samples in both lower and higher stages so that we could compare the expression profiles across stages.

Table 1 Lists of Data sets in the paper

Extreme Value Association (EVA)

Since we aim to discover a set of genes that are coordinately overexpressed only in a subset of the "high stage" samples, we developed a special measure of association between the gene and the binary ("high stage" vs. "low stage") phenotype that naturally fits this description, ignoring the expression levels of the genes outside the region of overexpression, which we call "extreme value association" (EVA). The same measure can identify coordinately silenced genes, but we did not find any such genes across various cancer types.

The EVA metric is the minimum p-value of biased partitions over all subsets of samples with highest expression values of the gene. In other words, suppose that there are totally M samples, out of which N are "low stage" and M - N are "high stage," and we select the m samples with the highest gene expression values, out of which n are "low stage" and m - n are "high stage." Under the assumption that gene expression values are uncorrelated with the phenotype, the probability that there will be at most n "low stage" samples among the selected m samples is given by the cumulative hypergeometric probability h(xn; M, N, m). The EVA metric is then defined as equal to -log10 of the minimum of these probabilities over all possible choices of m, in which m ranges from 1 to M (note that n depends on m). For example, assume that there are 250 high-stage samples and 50 low-stage samples for a total of 300 samples. Furthermore, assume that the 100 samples with the highest values of a particular gene contain 99 high-stage samples and one low stage sample. In that case, h(x ≤ 1;300,50,100) can be evaluated using the MATLAB function hypercdf (1,300,50,100) = 5 × 10-9, resulting in the EVA metric for that gene of at least -log10(5 × 10-9) = 8.3. If the 101th sample is also high-stage, then the EVA metric of the gene will be even higher. Note that, once the highest value is reached, the sorting arrangement of the remaining samples is irrelevant, reflecting the hypothesis that only the extreme values are associated with the phenotype. Figure 1 shows the values of the cumulative hypergeometric probability for the COL11A1 gene using the TCGA ovarian cancer data set and the staging threshold for the definition of the binary phenotype set between IIIb and IIIc. The maximum (8.31) occurs when m = 133.

Figure 1
figure 1

Evaluation of the EVA metric for gene COL11A1 in the TCGA ovarian cancer data set.

We then developed a mechanistic and unbiased (only dependent on the phenotype) algorithm, which, when given a gene expression data set for a number of samples labeled "high stage" or "low stage," leads to a selection of genes that are coordinatedly overexpressed only in high-stage samples, ignoring the effect of the rest, thus precisely reflecting the observed phenomenon. We first select the top 100 genes that rank highest according to the EVA metric criterion using a mixture analysis (selecting the minimum p-value) of both overexpressed and silenced genes. Using this set of genes only, we perform k-means clustering with gap statistic [8]. At that step, if indeed the genes are coordinately overexpressed, they will align well in the heat map. This leads to the selection of the samples belonging to the cluster most associated with the high/low stage phenotype - call this the set of "EVA-based samples." Next, we define a "clean" MAF phenotype, contrasting the samples that are: (a) both "EVA based" and "high-stage" against (b) the samples that are both "non EVA-based" and "low stage." If the number of samples is sufficiently large, this "clean" phenotype provides the sharpest way by which we can identify the genes that are most associated with the observed phenomenon of metastasis-associated coordinated overexpression. We then rank the genes and compute their multiple-test-corrected p-values using a Wilcoxon rank-sum test using the "clean" phenotype and select the genes for which p < 10-3 after Bonferroni correction. Finally, we find the intersection of these selected gene sets over all cancer expression data sets and rank them in terms of fold change.

For a data set with n samples and m genes, the EVA algorithm computes cumulative hypergeometric distribution probabilities nm times. This can be computationally intensive, so we devised a low-complexity implementation algorithm described in Additional file 2.

Mutual Information and Synergy

Assuming that two variables, such as the expression levels of two genes G 1 and G 2, are governed by a joint probability density p 12 with corresponding marginals p 1 and p 2 and using simplified notation, the mutual information I(G 1 ; G 2) is a general measure of correlation and is defined as the expected value E { log p 12 p 1 p 2 } . The synergy of two variables G 1, G 2, with respect to a third variable G 3 is [9] equal to I(G 1, G 2; G 3) - [I(G 1; G 3) + I(G 2; G 3)], i.e., the part of the association of the pair G 1, G 2 with G 3 that is purely due to a synergistic cooperation between G 1 and G 2 (the "whole" minus the sum of the "parts"). We used a 3rd order B-spline-based mutual information estimator dividing the observation space into six equally spaced bins in each dimension.

P-value estimation for mutual information and synergy

We applied a permutation-based approach accounting for multiple test correction: We did 100 permutation experiments of the class labels, saving the corresponding 100 highest values after doing exhaustive search in each permutation experiment. Using the set of these 100 highest value scores, we obtained the maximum likelihood estimates of the location parameter and the scale parameter of the Gumbel (type-I extreme value) distribution, resulting in a cumulative density function F. The p-value of an actual score x 0 is then 1 - F(x 0) under the null hypothesis of no association with phenotype. Similarly, for the synergistic pair, we found the top-scoring synergy in 100 data sets that were identical to the original except that the phenotype values were randomly permuted on each, and the top permuted synergy scores were modeled, as above, with the Gumbel distribution.

Results

Identification of MAF signature genes from staging information in four data sets

We performed the EVA algorithm on four rich gene expression data sets, two from ovarian cancer, the one from TCGA and another one [10], and two from colorectal cancer [11, 12] accompanied by staging information. We performed the algorithm multiple times using the different possible cutoff thresholds defining the phenotype, finding that, in all cases, it is defined as exceeding stage IIIb in each of the ovarian data sets and stage I in each of the colorectal data sets. Interestingly, several among the "metastasis-associated genes" identified in [7] as present in omental metastasis of ovarian cancer were also identified in [10] as belonging to a subtype of ovarian cancer characterized by extensive desmoplasia, which contains the MAF signature.

Remarkably, we found that there were 137 genes (listed in Additional file 3), each of which had Bonferroni-corrected p < 10-3 in all four data sets. Table 2 shows a list of these genes with average log fold change greater than 2. The top ranked gene was COL11A1 (probe 37892_at). Again, these genes were found purely as a result of their association with the staging phenotype in all four cancers. Gene Ontology enrichment testing of these genes identified cell adhesion, extracellular matrix and collagen fibril organization. It turns out that use of other standard correlation measures instead of the EVA measure in the same algorithms leads to the same results, because the overall correlation of these genes with the phenotype is strong merely as a result of the genes' overexpression in some high-stage samples alone. The EVA method has the additional advantage of providing an estimate of the size of that subset of high-stage samples.

Table 2 Top-ranked genes associated with carcinoma stage in four ovarian and colorectal cancers

We then did an extensive literature search aimed at identifying other studies in which at least some of these genes were identified as differentially expressed in various stages of other cancers. We even scrutinized studies in which none of the genes were mentioned in the main text, by looking at their supplementary data and re-ranking particular columns of genes in terms of their fold changes, from genes containing numerous genes. Although most of our results were negative, we were able to produce cancer gene lists with striking similarity (Table 3) to our own list (Table 2) in three studies of breast [13], gastric [14] and pancreatic [15] cancer.

Table 3 Gene lists produced from information provided in the corresponding papers for other cancers

Specifically, a breast cancer study [13] comparing ductal carcinomas in situ (DCIS) with invasive ductal carcinoma (IDC) had a list of genes upregulated in IDC that had similarities to those we had identified, and the top-ranked gene was again COL11A1 (probe 37892_at) with log fold change of 6.50, while the next highest (4.08) corresponded to another probe of COL11A1, followed by a probe of COL10A1. Second, a study [14] comparing early gastric cancer (EGC) with advanced gastric cancer (AGC) - roughly separating stages I and II - also identified a similar differentially expressed gene list of which again COL11A1 (probe 37892_at) was at the top (log fold change: 4.26) followed by COL10A1 and FAP. Third, a study of pancreatic ductal adenocarcinoma [15] identified a list of genes overexpressed in whole tumor tissue versus normal pancreatic tissue, in which COL11A1 (probe 37892_at) is again prominent and the top entry (log fold change 5.15) was INHBA, supportive of our hypothesis of activin A induced TGF-β signaling. The presence of the MAF signature in the latter study indicates that pancreatic cancer had already become invasive in most cases before the biopsy. The prominent desmoplastic reaction in pancreatic cancers (which contains the MAF signature) has recently been increasingly recognized as a "foe" [16] that could lead to new therapeutic strategies targeting stromal cells to inhibit cancer. Finally, we realized that COL11A1 has been identified as a potential metastasis-associated gene in other types of cancer as well, such as in lung [17], and oral cavity [18], suggesting that the MAF signature may be present in a subset of high stage samples of most if not all epithelial cancers.

Using COL11A1 as proxy for the MAF signature in the absence of staging information

In those cases as well as in our own findings, there was prominent presence of COL11A1 (probe 37892_at). This remarkable consistent strong association of COL11A1 with the staging phenotype (specific to each cancer type) suggests that it could be used as a "proxy" of the MAF signature. This would allow us to improve on the gene list of Table 2 by making use of numerous publicly available gene expression data sets of cancers of many types, even without any staging information, as long as the MAF signature is present in a sizeable subset of them, aiming at finding the "intersection" of the associated factors in these sets, revealing the "core" of the MAF biological mechanism.

As a first step for this task, we identified the genes that are consistently highest associated with COL11A1. Additional file 4 shows a listing of genes in nine cancer data sets, while Table 4 shows an aggregate list of the top 100 genes ranked in terms of their association (mutual information) with COL11A1. The list is similar to the phenotype-based gene ranking (Table 2). In addition to a few collagens of type XI, X, V, and I, the top ranked genes are thrombospondin-2 (THBS2), inhibin beta A (INHBA), leucine rich repeat containing 15 (LRRC15), versican (VCAN), fibroblast activation protein (FAP), and matrix metallopeptidase 11 (MMP11) aka stromelysin 3. The presence of FAP indicates a general desmoplastic reaction and is not, by itself, sufficient for inferring the MAF signature.

Table 4 Aggregate list of top genes associated with COL11A1

Furthermore, contrary to all other genes, COL11A1 was uniquely not associated with any of these genes in noncancerous samples, further supporting the hypothesis that it can be used as a proxy for the MAF signature. Our results indicate that THBS2 and INHBA, top ranked in Table 4 except for collagens, are the most important players in the MAF mechanism. Figure 2 demonstrates this striking coexpression in data sets of cancer samples, but not in noncancerous samples, in the form of scatter plots. We have consistently validated this behavior in the cancerous and noncancerous data sets we tested.

Figure 2
figure 2

Scatter plots confirming coexpression of INHBA, THBS2, COL11A1 in cancerous but not healthy data sets. The expression of COL11A1 is color-coded). As shown, COL11A1 is consistently coexpressed with INHBA and THBS2 in cancerous, but not in noncancerous samples, two of which are shown on the right side.

As a second step, we identified gene pairs that are highest associated with COL11A1 jointly, but not individually, and therefore they would not appear in the previous list. For this task we ranked gene pairs according to their synergy [9] with COL11A1, using the computational method in [19], which could further facilitate biological discovery. For example, the scatter plots in Figure 3 show that genes ECM2 and TCF21 are jointly, but not individually, strongly associated with COL11A1 (p < 10-6, see Methods) in the two ovarian cancer data sets. Such findings are useful for developing biological hypotheses, e.g. in this particular case they suggest that in ovarian cancer the extracellular matrix protein 2 is associated with the MAF signature only when the TCF21 gene (a known mesenchymal-epithelial transition mediator) is downregulated.

Figure 3
figure 3

Example of a synergistic pair of genes in two ovarian cancer data sets.

The MAF signature exists even in non-epithelial cancers. Indeed, we confirmed that neuroblastoma also carries the MAF signature consistently associated with high stage: As shown in Additional file 5, none of 21 stage I samples have the signature (p = 4 × 10-4), based on the genes highest associated with COL11A1.

MicroRNAs and Methylated sites

We only had miRNA and methylation data available for the TCGA ovarian data set. Using as measure the mutual information with COL11A1, we found many statistically significant miRNAs, among them hsa-miR-22 and hsa-miR-152, as well as differentially methylated genes, such as SNAI1 and PRAME, suggesting a particularly complex biological mechanism (correlation with the MAF phenotype led to essentially the same lists with lower significance). Table 5 contains a list of the miRNAs, while Table 6 contains a list of the methylated genes (multiple test corrected p < 10-16 in both cases, see Methods). SNAI1 (snail) methylation is particularly important as the gene is known as one of the most important EMT-related transcription factors. Instead, the strongest MAF-associated transcription factor is AEBP1. Many of the other EMT-related transcription factors, such as SNAI2, TWIST1, and ZEB1 are often overexpressed in the MAF signature, but SNAI1 is not (and, at least in ovarian carcinoma in which we have methylation data, this is due to its differentially methylated status). We believe that the lack of SNAI1 expression is a key distinguishing feature of the MAF signature, in which we observed neither SNAI1 overexpression nor CDH1 (E-cadherin) downregulation, at least on the mRNA level.

Table 5 Top-ranked miRNAs in MAF signature in the TCGA ovarian cancer data set
Table 6 Top-ranked methylation sites in MAF signature in the TCGA ovarian cancer data set

Drug response

Significantly, we also found that, at least in ER negative breast cancer, the MAF signature is associated with resistance to neoadjuvant FEC. This was demonstrated in [20] where a stromal "metagene" signature of 50 genes was defined based on DCN (decorin). Although some of the MAF key genes (such as COL11A1 and THBS2) were not among these 50, the metagene signature used in that study has a significant intersection with the MAF signature and was found resistant to neoadjuvant chemotherapy. To compare the drug response performance of the DCN metagene set with that of the MAF signature, we used the top 50 genes of the MAF signature in terms of their association with COL11A1 from various cancers (the top genes shown in Table 4) in the same data set. Shown in Additional file 6 and Additional file 7 are two clustering heat maps of expression profiles, one with the MAF signature genes and one with the DCN metagene set, respectively. High expression of the MAF signature genes correlates with lack of response to therapy (identifying 14 samples out of which 12 have lack of response) more than high expression of the DCN metagene set (identifying 12 samples out of which 10 have lack of response), suggesting that the presence of the core genes of the MAF signature provide at least as good indicator of resistance to neoadjuvant chemotherapy. The reason for the drug resistance may simply be that the invasiveness sensed by the MAF signature does not allow the size or extent of the cancer to be reduced prior to surgery.

Discussion

A direct clinical application of these findings is the development of a high-specificity invasion-sensing biomarker product detecting coordinated overexpression of a few top-ranked genes, such as COL11A1, INHBA, and THBS2, as shown in the scatter plots of Figure 2. A positive result in seemingly low-stage primary tumors will indicate that the disease has obtained the stromal signature and thus has already reached an invasive stage. As described above, the same product can also be used to predict resistance to neoadjuvant chemotherapy.

Of course, the most significant clinical application would be to develop metastasis-inhibiting therapeutics using targets deduced from the biological knowledge provided by the MAF signature. Our top ranked genes strongly suggest that they are produced by myofibroblasts or myofibroblast-like cells activated by activin A-induced TGF-β signaling and leading to some form of altered proteolysis [21], which results in extracellular matrix remodeling. Supporting this hypothesis are the facts that both INHBA and THBS2 are involved in TGF-β signaling: Activin A (INHBA homodimer) is a TGF-β superfamily member (ligand) and THBS2 inhibits activation of TGF-β by THBS1, which is also present in the MAF signature. Remarkably, activin A is already known to facilitate fibroblast-mediated collagen gel contraction [22]. The role of gene LRRC15 (aka LIB) appears important but unclear, though it has already been recognized as promoting migration through the extracellular matrix [23]. Versican (VCAN) is an extracellular matrix proteoglycan already known to play a role in metastasis, while MMP11 is one of several matrix metalloproteinases involved in the breakdown of extracellular matrix.

Although each of the MAF signature molecules could serve as a potential therapeutic target, the hypothesis that activin A signaling is at the heart of the MAF mechanism immediately suggests that follistatin (activin-binding protein) could serve as a metastasis inhibitor, which is exactly what recent research [24] indicates. Specifically, lung cancer cell lines transfected with follistatin and injected into immunodeficient mice markedly inhibited metastasis compared with non-transfected cell lines, but the authors of the study recognize that the role of follistatin in cancer metastasis is totally unknown [25]. Our work provides an explanation and suggests that the same could be true for other cancers as well. Further support is provided by the fact that follistatin virtually abolished the fibroblast-mediated collagen gel contraction mentioned earlier [22].

There are several reasons that the core MAF signature has not yet been discovered as a multi-cancer metastasis-associated signature. First, it is essential to identify the correct phenotypic staging threshold recognizing that the signature only exists in a subset of tumors that exceed that particular stage. Indeed, if the threshold in breast cancer was put between stage I and stage II, or between stage II and stage III, rather than between in situ and stage I, the signature would not be apparent. Second, each cancer type may have its own additional features accompanying the MAF signature. For example, in ovarian cancer it is accompanied by sharp downregulation of genes COLEC11, PEG3 and TSPAN8, which is not the case in other cancers. Indeed, the main contribution of our work is the identification of the common multi-cancer "core" signature, from which a universal metastasis-associated biological mechanism can be identified. Third, the MAF signature may be reversible, perhaps as a result of the disappearance of many of the stromal cells in the mature desmoplastic stroma when it is replaced by "acellular" matrix [6]. The presence of the signature in high-stage samples may even paradoxically be associated with longer survival if its reversal is required for further distant metastases (see below).

An important topic of further research is the determination of the precise biological event of interaction of cancer cells with the microenvironment that gives rise to the stromal MAF signature and associated invasiveness. Because of the recognized similarities with the mechanisms of wound healing [26], it is likely that this event uses existing wound healing response pathways. For example, it appears to occur very early in breast cancer, late in ovarian cancer, and never in glioblastoma (which is reasonable, because glioblastoma metastasizes extremely rarely). The late appearance of the MAF signature in ovarian cancer and its presence in omental metastases can be explained by the fact that ovarian cancer initially progresses by disseminating locally across mesothelial surfaces and that, contrary to hematogenously metastasizing tumors, initial metastasis is probably carried by the physiological movement of peritoneal fluid to the peritoneum and omentum [27].

Several of the top-ranked genes in the MAF signature (such as thrombospondins, decorin, INHBA itself) are known to be potent anti-angiogenesis mediators. The reversal of the MAF signature would thus facilitate further metastatic dissemination to distant sites. In other words, (a) the desmoplastic MAF signature and (b) angiogenesis, are two independent biological events. The former appears to be based on activin A signaling, as several of the MAF proteins in addition to INHBA are also known inhibitors of the standard TGF-β ligand. The reversal of the MAF signature would allow the standard ligand to take over in TGF-β signaling, and may thus facilitate further metastasis. These observations provide explanations for the seemingly contradictory observed roles of TGF-β signaling inhibiting early cancer but facilitating metastasis.

The possible reversibility of the MAF signature leads to the intriguing hypothesis that perhaps all metastases have, at some point temporarily been there, which explains why we only observe it in a subset of them. This would be particularly exciting, because in that case any metastasis-inhibiting therapeutic intervention targeting the MAF mechanism would be widely applicable to low-stage tumors.

Conclusions

In conclusion, we have shown that, using purely computational analysis of publicly available biological information, systems biology has revealed the core of a multi-cancer metastasis-associated gene expression signature. In the near future, a vast amount of additional information will become available, including next generation sequencing, miRNA and methylation information for many cancers, which will allow additional computational research building on this work and clarifying the details of the underlying invasion-associated complex biological process.