Introduction

The WIPF1 gene encodes the WASP/WASL interacting protein family member 1 that plays an important role in the organization of the actin cytoskeleton [1, 2]. The WIPF1-encoded protein WIP binds to a region of Wiskott–Aldrich syndrome protein (WASP) that is frequently mutated in patients with Wiskott–Aldrich syndrome (WAS) [3, 4], and WIP mutations themselves lead to an immunological disorder resembling Wiskott–Aldrich syndrome [5]. WAS is an X-linked recessive disease that predisposes to leukemia and lymphoma. The WIP protein is essential for WASP synthesis and probably acts as its chaperone [6]. Disruption of the WASP-WIP interaction by hereditary mutations leads to a rounded cell surface on immune cells, a conversion that is thought to coincide with a diminished capability to form immune synapses and reduction of NK cell cytotoxicity [7]. WIP is important for podosome formation in macrophages and cellular fusions in flies [8, 9], stressing its universal role in the design of cell membrane remodeling. Apart from its expression in diverse immune cells, several human tissues exhibit WIP expression. Little is known about the expression of WIPF1 in solid tumors. However, WIPF1 expression levels influence morphology and migration of fibroblasts [10]. This prompted us to investigate the expression characteristics of WIPF1 in colorectal tumors with the aim to study its potential for prognosis.

A multitude of microarray studies have been carried out during the past decade to gain a better understanding of basic colorectal cancer (CRC) biology [1118]. Other CRC microarray studies led to the discovery of informative gene sets for the prediction of the response to therapy or tumor recurrence [1923], diagnosis of tumor stage [2427], lymph node metastasis [2830], or liver metastasis [31, 32]. Until now, cross-validation of diagnostic or prognostic signatures with independent data sets has hardly been performed for colorectal cancer. This is probably due to the fact that published signatures are only overlapping to a small degree and that they are difficult to reproduce when originating from different laboratories and platforms (for a discussion see Groene et al. [24]). In addition, until recently, data sets with sufficient patient information were lacking in public databases which hindered cross-validation of signatures from different studies.

Here, we describe the identification of a set of genes that is co-expressed with WIPF1. It was discovered through re-analysis of two public microarray data sets on clinical colorectal cancer specimen that were deposited in the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo). Whereas the first data set was used for discovery of WIPF1-coexpressed genes, the second data set served us for validation of the expression correlations. Using a simple classification algorithm trained on microarray data of WIPF1 co-expressed genes from the two studies, we identified patients with characteristic expression of the WIPF1 coexpression module in three further microarray data sets with information about survival or relapse of patients: a colorectal cancer data set of our own, a breast cancer data set, and a glioblastoma data set. Strikingly, patients with low expression of the WIPF1 signature have the best prognosis in all three data sets, in total comprising more than 400 patients. Based on an analysis of the WIPF1 coexpression module in the context of literature-based gene networks, we identified plausible regulatory mechanisms responsible for lower WIPF1 module expression in patients with better prognosis.

Materials and methods

Sample acquisition

For the generation of our own microarray data set, 62 CRC patients undergoing elective standard oncological resection at the Department of General, Vascular, and Thoracic Surgery, Campus Benjamin Franklin, Charité, were prospectively recruited. Several clinical characteristics of the patients were recorded (see Supplementary Table 1). The study was approved by the local ethical committee, and informed consent was obtained from all patients.

Laser-capture microdissection

Pre-processing of frozen tissue blocks by laser-capture microdissection was essentially performed as described in our previous publications [13, 14, 24, 33]. Briefly, all cancer specimens were snap frozen within 20 min following excision. After laser-capture microdissection frozen tissue specimens were serially cut into 6- to 8-µm-thick sections which were mounted on a sterile 2.5 mm membrane. Slides were fixed in 70% ethanol. The sections were briefly stained with hematoxylin and eosin, dehydrated in ethanol, and dried for 10–15 min using an exsiccator. The membrane was turned around and fixed with adhesive tape on the other sterile slide. First slides served as a template on which the areas of tumor or normal epithelium were marked. On the consecutive section, these areas were microdissected using a laser microdissection system (SL, Jena, Germany and P.A.L.M. Microlaser Technologies AG Bernried, Germany) and capture transfer films (Arcturus GmbH, Moerfelden-Walldorf, Germany). For molecular analysis, up to 100,000 cells or approximately 30–60 mm2 of tissue section areas were pooled and collected in ice-cooled tubes containing 100 ml of 98% guanidine thiocyanate (GTC) buffer and 2% beta-mercaptoethanol.

Messenger RNA preparation and DNA chip hybridization

PolyA mRNA from the microdissected specimens was prepared using the PolyA-tract 1000 kit (Promega, Heidelberg, Germany) according to the manufacturer’s recommendations. For each sample, the cDNA synthesis and repetitive in vitro transcription were performed three times. The total amount of prepared mRNA from each sample was used. First strand cDNA synthesis was initiated using the Affymetrix T7-oligo-dT promoter–primer combination at 0.1 mM. The second strand cDNA synthesis was generated by internal priming. In vitro transcription was performed using the Megascript kit (Ambion, Huntington, UK) as recommended by Ambion. From the generated cRNA, a new first strand synthesis was initiated using 0.025 mM of a random hexamer as primer. After completion, the second strand synthesis was performed using the Affymetrix T7-oligo-dT promoter–primer combination. A second in vitro transcription was performed, and then the procedure was repeated one additional time. During the last in vitro transcription, biotin-labeled ribonucleotides were incorporated into the cRNA, as recommended by the Affymetrix protocol. Hybridization and detection of the labeled cRNA on the Affymetrix U133A Chip were performed according to Affymetrix standard protocol.

Microarray data pre-processing

Public expression data was downloaded from the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/projects/geo/). In addition to our own data that was deposited in GEO with accession number GSE12945, we used four different data sets from this repository. The colorectal cancer data sets GSE5206 of the Aronow group (see Kaiser et al. [17]), GSE7208 of Ayers and co-workers [20] and of our own served us for discovery and validation of the WIPF1 signature in CRC. The GSE2034 data set of Wang et al. served us for assessment of predictive power of the WIPF1 signature for breast cancer patients [34]. The data set of Phillips et al. (GSE4271) served us for assessment of predictive power of the WIPF1 signature in high-grade glioma patients [35]. We refer to the original publications and the GEO database for patient and sample characteristics.

For our own microarray experiment, we used algorithms implemented in the freely available statistical software package R (http://www.r-project.org/) and its public package repositories CRAN (http://cran.r-project.org/) and the bioinformatics R package repository Bioconductor (http://www.bioconductor.org/). Preprocessing: Raw expression data were condensed to probe set-wise intensity values using the RMA algorithm. For experiment normalization, all colorectal cancer data sets were pre-processed using the same data transformations. If not already done, the raw data were log transformed. Independently of the original authors’ pre-processing, we quantile-normalized each data set on the probe set level. Then we filtered out the 10% probesets with lowest median expression and 10% probesets with lowest variance to enrich informative probesets in an unbiased way. We restricted the further analysis to probesets passing this filter in each of the three data sets. Probeset annotations for Affymetrix expression microarrays (most importantly gene symbols) were retrieved from the Affymetrix web site (version 22). We condensed probeset signals on the gene symbol level by averaging across all remaining probesets of a gene. On the gene level, each data set was finally mean centered. Breast cancer and glioma data were processed in the same way: but here no additional probesets were filtered out due to low variance or mean expression signals in these data sets. Instead, for mapping of probeset expression intensities onto the gene level, we considered all probesets that were used for probeset-to-gene mapping during processing of the three CRC data sets. Using the applied normalization scheme, we intend to ensure that Affymetrix microarray data from the five different studies are comparable (we note that our study does not include a cross-platform comparison as all data sets were generated using Affymetrix gene chips).

Expression data mining

Expression data mining was carried out using the statistics software R supplemented with diverse packages from the CRAN or Bioconductor projects. Analysis of the correlation of two expression profiles was evaluated with Pearson correlation coefficients determined with the function cor.test in package stats. Using the same function we determined p values for the significance of the deviation of the correlation coefficient from 0. The average expression profile of a multigene expression signature (proliferation signature by Rosenwald et al. [36]), here denoted as signature centroid, was determined using averaging across signature genes for each patient. For tumor class discovery, we applied hierarchical clustering using the heatmap.2 function of the gplots package in R. The distance matrices for row and column clustering were determined using pairwise correlation distances (d = 0.5 (1 − cor(x,y))) of the gene-wise mean-centered expression intensities of genes and samples, respectively. For clustering, we used the complete linkage hierarchical clustering algorithm.

In the preceding paragraph, we described how we identify tumor classes in a training data set using unsupervised analysis (clustering). Using classification (supervised analysis), we then attempted to detect the tumor classes that we identified on a primary (training) data set in tumors of secondary (test) data sets from independent studies. As a classification algorithm, we used k-nearest neighbor classification as implemented in the R package class (function knn with k = 9). The classifiers were trained on tumor expression profiles of the training data and then directly applied on test data sets. A prerequisite for our classifier to work properly on the test data (here solely external data sets from independent studies) is that training and test data are sufficiently normalized which in our study should be ensured by log-transforming the expression values followed by gene-wise mean centering. Across-study normalization based on intra-study mean centering of log-transformed expression intensities from Affymetrix chips was already shown by Lusa et al. to be a pre-processing strategy that can be the basis of good classifier performance when the aim is to construct gene expression-based predictors for tumor classes across studies [37]. However, the authors also stated that best classifier performance can only be expected if the fraction of tumor classes in the different data sets is comparable. In our own validation of our methodology, we found that k-nearest neighbor-based predictors of estrogen receptor status in breast cancers based on an estrogen-responsive set of genes achieves prediction accuracies on external data sets of ~90% on average on gene-wise mean-centered expression data (based on four Affymetrix U133A microarray data sets, data available upon request, manuscript in preparation). Classifier performance was still at 76% when the tumor classes were not balanced (e.g., 1:8 in test data). We note that we did not attempt to assess classifier accuracy using cross-validation on the primary data because we have derived the tumor classes by data mining in the complete primary data set: it is obvious that a numerical difference between classes exists. Therefore, the assessment of classifier performance using cross-validation could lead to a serious overestimation of classifier accuracy. The reason is that even if test cases in cross-validation are not used for classifier training, they were already included in the initial clustering analysis that led to the assignment of class labels: a violation of a main principle of cross validation—the independence of training and test data.

Patient survival and relapse were visualized with Kaplan–Meier curves determined with the survfit function in the survival package. The logrank test as implemented in the coxph function of the survival package was used to assess the significance of differences in survival/relapse times between patient groups. The significance of association of continuous variables, here gene expression intensities, with survival/relapse were tested using Cox regression and Wald tests on the model coefficients and their variances as implemented in the coxph function. For survival analyses, we used robust estimates of Cox model coefficient variances (parameter robust=T). If not otherwise indicated, default parameter settings were used in the functions mentioned above.

Literature-based gene networks were investigated using the metaCore software from GeneGo (Saxony Road, #104, Encinitas, CA 92024, USA). Subnetworks of genes with functional links based on literature evidence were screened for enrichment of genes co-expressed with WIPF1. Those networks with significant enrichment of WIPF1 genes were further investigated for enrichment of Gene Ontology categories. The significance of enrichment of either user supplied gene lists (like the gene list of the WIPF1 signature) or genes lists associated with Gene Ontology (GO) terms in gene lists of literature subnetworks is given by hypergeometric p values (with the complete set of human genes with literature information as a reference set).

Results

Discovery and validation of a WIPF1 coexpression module

We used microarray data from a study of 100 colorectal cancer specimens of mixed stage and subtype by Kaiser et al. for the identification of WIPF1 correlated genes (Gene Expression Omnibus (GEO) database identifier GSE5206) [17]. We correlated the expression profiles of the WIPF1 gene across all experiments to the expression profiles of each other gene that passed initial quality criteria. A histogram of these p values is shown in Fig. 1a. The strong deviation from a uniform distribution suggests that a large number of genes are correlated with WIPF1 expression in this data set. A threshold for significantly correlated expression was set at p < 1e−6 resulting in 430 candidate genes. For validation of the WIPF1 coexpression of these genes, we employed a data set of Ayers et al. (GSE7208) on 59 colorectal tumors [20]. We correlated the expression patterns of the 430 discovered genes to the WIPF1 profile. This led to the p value distribution shown in Fig. 1b that exhibits an even stronger deviation from the uniform distribution than for the discovery gene set, thus documenting the enrichment of true WIPF1 coexpressed genes. Genes, 112 out of 430, were correlated with p < 1e−5. A scatter plot of correlation coefficients in Ayers and Aronow data (Fig. 1c) confirmed that for all 112 genes, the correlation coefficients have the same sign, meaning that expression changes in the data sets have the same direction. In the following, these 112 genes were considered as the WIPF1 coexpression module, and their associated expression profiles constitute the WIPF1 signature.

Fig. 1
figure 1

Discovery and validation of the WIPF1 coexpression module. a The histogram shows the distribution of p values resulting from tests for correlation of the WIPF1 expression profile with expression profiles of each other gene in the Aronow data set. b The histogram shows the distribution of 430 p values resulting from correlation of expression profiles in the Ayers data set of the WIPF1 gene with each discovered gene in the Aronow data set. Note the strong deviation from the uniform distribution, indicating a high rate of successful validation. c A scatter plot of the correlation coefficients in Aronow and Ayers data of each of the 112 successfully validated genes reveals that most validated genes are positively correlated with the WIPF1 profile and only six genes are anti-correlated with WIPF1. All correlation coefficients are consistent in both data sets, i.e., they have the same sign

Prediction of low-risk colorectal cancers using the WIPF1 signature

We fused the discovery and validation data of the WIPF1 module resulting in a data set with 112 genes and 159 patients. Then, we performed hierarchical clustering on both, genes and patients (see Fig. 2). It revealed two major clusters of genes and five major clusters of patients. Patient cluster “A” (Fig. 2, marked in red) comprises 25 tumors that exhibit a characteristic down-regulation of the larger gene cluster. In contrast, other tumor clusters (collectively labeled as “B”) show an up-regulation of the larger gene cluster or indifferent expression. Using patient cluster labels as the prediction target, we trained a k-nearest neighbor classifier (k = 9) for prediction of WIPF1 cluster labels in further independent data sets.

Fig. 2
figure 2

Two-way hierarchical clustering of colorectal cancer expression data from WIPF1-coexpressed genes. The data set results from fusing the Aronow and Ayers data sets and comprises 112 genes and 159 patients. Top: The dendrogram results from hierarchical clustering of tumors. Light gray indicates low expression, dark gray indicates high expression (log-transformed mean-centered). Five clusters of tumors are evident: cluster A is colored in red, all other clusters are colored in black. Left: The dendrogram results from hierarchical clustering of genes. Two main clusters of genes are evident: the larger cluster X is colored in red, the smaller cluster Y in black. Note the characteristic down-regulation of genes from cluster X in tumors from cluster A. Other tumors have anti-correlated or indifferent expression profiles

We were interested in the question whether the cluster labels can predict certain clinical parameters in an independent colorectal cancer data set. We determined expression profiles for 62 colorectal cancer patients for which information about survival has been tracked. Direct application of the k-nearest neighbor classifier (trained on gene-wise mean-centered combined Aronow/Ayers data) on the 62 tumor expression profiles led to the assignment of cluster label “A” for nine patients and cluster label “B” for the remaining 53 patients (for a more detailed explanation/justification of the application of the k-nearest neighbor classifier across data sets we refer to the “Materials and methods” section). We compared the survival expectancy of the three groups using Kaplan–Meier plots (see Fig. 3). We found that all nine cluster “A” colorectal cancer patients have survived until the end of the study. The differences in the survival curves between clusters “A” and “B” were already significant (p = 0.011), although the numbers of patients are rather small. These results suggest that the WIPF1 expression module in combination with the applied scheme of normalization/classification also has potential for the prediction of low-risk cancers in further independent studies.

Fig. 3
figure 3

Differences in survival between colorectal cancer patients of the “cluster B” type in an independent microarray data set. Tumors that resembled the cluster “A” type were predicted by machine learning in an independent data set of 62 tumors with microarray expression profiles. Nine cluster “A”-like patients all survived until the study end. The difference in survival compared with 53 remaining patients is significant with p = 0.011 in the logrank test

The WIPF1 signature has prognostic predictivity for brain and breast tumors

We fitted Cox regression models for genes of the WIPF1 signature and tested their individual association with survival using Wald tests on microarray expression studies of 286 breast cancers (Wang et al., GSE2034) and 77 high grade-gliomas (Phillips et al., GSE4271). We found that a high fraction (50 of 112) of genes of the WIPF1 signature were significantly associated with survival (p < 0.05) in both studies. A list of 38 genes for which significant correlation of WIPF1 expression across patients was validated in Aronow, Ayers, and Staub data sets is given in Table 1 together with results of statistical tests on WIPF1 correlation and survival/relapse association.

Table 1 Summary of WIPF1 correlation and survival/relapse association for 38 core genes of the WIPF1 module and WIPF1 itself

We next applied the k-nearest neighbor classifier trained on Aronow/Ayers data directly on the data sets of breast cancers and high-grade gliomas to stratify these patient collectives. Among the microarray profiles of 286 breast tumor patients of a data set of Wang et al. (GSE2034), we identified 59 of the cluster “A” type. These patients had a remarkably better prognosis regarding time to relapse, the difference between cluster “A” patients, and others being significant at p = 0.0045 (see Fig. 4a and b). In the glioma patient set of Phillips et al. (GSE4271), 24 out of 77 patients were assigned the cluster “A” type by classification. The cluster “A” glioma patients had a better prognosis (p = 0.012, see Fig. 4c and d). In conclusion, the application of the WIPF1 k-nearest neighbor classifier on two different tumor sets further validates the predictive value of the WIPF1 coexpression module for the identification of patients with favorable prognosis.

Fig. 4
figure 4

The WIPF1 module identifies breast cancer and glioma patients with better prognosis. Histogram (a) visualizes the distribution of p values resulting from logrank tests for the association of profiles of single genes of the WIPF1 module with relapse in the Wang data on breast cancers. b The Kaplan–Meier curves for breast cancer patients assigned to the groups cluster “A” (upper curve) or cluster “B”. Histogram (c) visualizes the distribution of p values resulting from logrank tests for the association of single expression profiles of the WIPF1 coexpressed genes with relapse in the Phillips data on gliomas. d The Kaplan–Meier curves for glioma patients assigned to the groups “A” (upper curve) or “B”. Note the strong deviation from the uniform distribution (that would result from chance association) in plots (a) or (c) and the consistently higher fractions of survivors among “A” patients in (b) and (d). These plots visualize the high fraction of genes of the WIPF1 module that are individually associated with survival in breast and brain cancers and the survival significance of the complete module

The WIPF1 signature has links to c-myc, p53, proliferation, and apoptosis

Because of the multitude of genes that are correlated with WIPF1 (see Fig. 1), we hypothesized that a dominating cancer-driving mechanism is linked to WIPF1 expression. It was frequently reported that patient prognosis is associated with the expression of a so-called proliferation signature. This term circumscribes a large set of signatures that can be pulled from a large set of genes deregulated during the cell cycle. To our knowledge, a link between the proliferation signature and colorectal cancer has not been established so far. We found that the WIPF1 signature does not have a substantial overlap to proliferation signatures. To investigate an indirect link to the transcriptional program of proliferation, we tested the proliferation signature described by Rosenwald et al. for correlation with our WIPF1 signature [36]. Correlation tests of signature centroids revealed that both signatures are significantly correlated in three data sets, the colorectal cancer data generated during this study, the Wang data on breast cancer, and the Phillips data on gliomas (see Fig. 5). This suggests a regulatory link of the WIPF1 signature with typical proliferation genes.

Fig. 5
figure 5

Similarity of the WIPF1 and proliferation signatures. We determined average profiles (signature centroids) of the WIPF1 and the proliferation signatures in three data sets. Only 107 of 112 WIPF1 module genes with concordant lower expression in cluster X of the WIPF1 signature were considered for this analysis. Signature centroids were determined using averaging over all genes for each patient. The scatter plots visualize the strength of the correlation between proliferation signature and WIPF1 signature. Coordinates of each data point correspond to a single patient’s averaged mean-centered expression values for both signatures, a for the Staub colorectal cancer data set, b for the Wang breast cancer data set, and c for the Phillips glioma expression data set

We next addressed the question which functional themes are dominating the list of 112 WIPF1-coexpressed genes. We investigated the genes of the WIPF1 module by mining of literature networks using the metacore software. We identified several gene networks that were highly enriched with genes of the WIPF1 module (see Table 2) and are therefore suited to highlight the function of the WIPF1 module. First, we took all different types of functional links into account (protein–protein interactions, enzymatic reactions, transcriptional regulation, etc.) for building of literature subnetworks. We identified three subnetworks that are enriched with gene functions in cancer-relevant processes like mitosis and apoptosis as important themes among WIPF1 signature-induced networks (see Table 2). By focusing the analysis on literature networks centered at transcription factors, three subnetworks stood out with 21 genes linked via the oncogenic transcription factors c-myc, ESR1, and p53 (see Table 2 and Fig. 6). For these three factors, there are 17 genes with literature evidence that c-myc, ESR1, and/or p53 regulate their transcription. Thus, literature network analysis could link the WIPF1 signature to several functional themes related to cancer biology with high significance: mitosis and proliferation, apoptosis, the estrogen receptor, c-myc, and p53.

Table 2 Summaries of top literature subnetworks enriched with genes from the WIPF1 coexpression module
Fig. 6
figure 6

Fused transcriptional regulation network of genes of the WIPF1 module that are regulated by c-myc, ESR1 or p53. Here we show the fused network of WIPF1 co-expressed genes that are linked to c-myc, ESR1 or p53 according to literature evidence. The legend on the left depicts the graphical symbols that describe the type of protein in the network. The arrows indicate the direction of the regulation. The colors of circles around gene names represent the coupling to transcription factors: green for c-myc, red for p53, blue for ESR1

Discussion

The hypothesis that the WIPF1 gene is important for cancer development was based on two facts. First, its encoded WIP protein interacts with the Wiskott–Aldrich syndrome protein WASP through a surface that is affected by a WASP mutation, and Wiskott–Aldrich syndrome predisposes to leukemia and lymphoma. Second, expression levels of the WIP protein influence the migratory and differentiation properties of fibroblasts. It has not been studied so far how the expression of the WIPF1 gene is regulated and if its coexpression neighborhood provides an additional link to cancer. Here, we show that there exists a module of genes that is coexpressed with WIPF1 in colorectal cancers. The majority of genes in this module show a characteristic down-regulation in several cancer patients with longer survival time or time to relapse, also in other cancer types than colorectal cancer. We found that the module genes do not overlap with the frequently rediscovered “proliferation” signature that is regulated during cancer cell mitosis. Instead, a large number of genes of the WIPF1 coexpression module have poorly characterized functions. Only single genes link directly to cancer-relevant processes like proliferation and apoptosis. However, we could show that the expression profile of the WIPF1 signature correlates significantly with expression profile of the Rosenwald proliferation signature. Literature networks revealed that the link of the WIPF1 module to proliferation can partly be explained by the fact that a large fraction of WIPF1 module genes are known transcriptional targets of oncogenic transcription factors like c-myc, ESR1, or p53. In this context, it is interesting to note that estrogen receptor signaling is not only of importance for breast cancers, but also apparently able to modulate the aggressiveness of prostate cancers [38]. It is tempting to hypothesize that keeping the expression of the WIPF1 module in a low expression state is causative for a less aggressive cancer phenotype, e.g., by inhibition of WIPF1/WASP-related cytoskeletal remodeling that coincides with a reduced ability of cells to migrate and metastasize.

In conclusion, we presented a module of WIPF1-coexpressed genes. The expression signature of this module could be used to identify patients with better prognosis with respect to relapse or survival in expression data sets of three different tumor types, colorectal cancer, breast cancer, and high-grade glioma. The WIPF1 coexpressed genes seem to be linked to proliferation and apoptosis possibly by regulation through c-myc, ESR1, and p53. We propose the WIPF1 signature as an alternative predictor of breast, brain, and colorectal cancer prognosis.