Abstract
A major challenge in clinical cancer research is the identification of accurate molecular subtype. While unsupervised clustering methods have been applied for class discovery, this clustering method remains a bottleneck in developing accurate method for molecular subtype discovery. In this analysis, we hypothesize that spectral clustering method could identify molecular subtypes in correlation with survival outcomes. We propose an accurate subtype identification method, Cancer Subtype Identification with Spectral Clustering using Nyström approximation (CSISCN), for the discovery of molecular subtypes, based on spectral clustering method. CSISCN could be used to improve gene expression-based identification of breast cancer molecular subtypes. We demonstrated that CSISCN identified the molecular subtypes with distinct clinical outcomes and was valid for the number of molecular subtypes. Furthermore, CSISCN identified molecular subtypes for improving clinical and molecular relevance which significantly outperformed consensus clustering and spectral clustering methods. To test the general applicability of the CSISCN, we further applied it on human CRC datasets and AML datasets and demonstrated superior performance as compared to consensus clustering method. In summary, CSISCN demonstrated the great potential in gene expression-based subtype identification.
Similar content being viewed by others
Introduction
Identifying the subtype of cancer is one of the leading area of study in clinical cancer research. The use of accurate subtype identification typically helps to determine the appropriate therapy and thus improves survival rate for cancer patients. To date, the rapid development of high-throughput platforms such as gene expression profiling1, 2, human whole-genome sequencing3, 4 and whole-exome sequencing5 have been applied to cancer data for the prioritization of expression-based signatures6, 7, the discovery of recurrent mutations3, 4, the identification of molecular subtypes1, 8, the development of prognosis model9, 10 and the selection of patients likely benefit from particular targeted therapies11. In particular, advances in cancer genomics studies have revealed the marked clinical and molecular heterogeneity with regard to responses from treatment and survival outcomes12, 13. However, the heterogeneity in tumor samples poses considerable challenges for the evaluation of prognosis and selection of an appropriate treatment for each individual patient14. Thus, there is urgent need to provide the accurate subtype identification method for developing the prognostic and therapeutic strategies.
Traditional unsupervised clustering methods have showed great potential in identifying modular network15, discovering molecularly distinct subtypes16,17,18 and identifying oncogenic pathway signatures11 in cancer research. Specifically, consensus clustering method has been widely used for class discovery19, 20 and the identification of consensus molecular subtypes8, 21. While traditional clustering algorithms are mainly founded on Euclidean geometry and unable to treat nonlinear structure in data, spectral clustering could adapt to geometries in a broader range due to the identification of non-convex patterns and linearly non-separable clusters22. Importantly, spectral clustering has been widely used in machine learning and pattern recognition22,23,24,25. It partitions the points into distinct clusters based on the eigenstructure of the similarity matrix. Accordingly, the points have high similarity in the same cluster and low similarity in different clusters26. Despite its good performance, spectral clustering is often limited in its application for large-scale problems due to its high computational complexity27. To address this challenge, the spectral clustering using Nyström approximation is presented to reduce the computational cost of the matrix decomposition and improve the clustering accuracy28, 29.
In this paper, we aimed to develop and evaluate spectral clustering method using Nyström approximation for identifying molecular subtypes of cancer. We investigated whether this method could identify molecular subtypes for improving clinical and molecular relevance. We proposed an accurate subtype identification method, Cancer Subtype Identification with Spectral Clustering using Nyström approximation (CSISCN), for the discovery of molecular subtypes, based on spectral clustering method. We first started with the discovery for molecular subtypes of breast cancer patients based on gene expression profiles (GEPs). Then, we demonstrated that, 1) The CSISCN identified the molecular subtypes with distinct clinical outcomes; 2) The CSISCN was valid for the number of molecular subtypes; and 3) The CSISCN identified molecular subtypes for improving clinical and molecular relevance as compared to the consensus clustering and spectral clustering methods. To test the general applicability of the CSISCN, we further applied it on human CRC datasets and AML datasets and demonstrated superior performance as compared to consensus clustering and spectral clustering methods.
Methods
Gene expression datasets of cancer patients
Breast cancer consisted of distinct biological subtypes including HER2, ER and PR for different prognostic and therapeutic implications. We have collected breast cancer gene expression datasets using the Affymetrix U133A platforms from public resources. The gene expression datasets GSE2505530, GSE2506530 and GSE653231 were downloaded from the Gene Expression Omnibus (GEO) database. Neoadjuvant study of 310 HER2-negative breast cancer cases in GSE25055 and 198 HER2-negative breast cancer cases in GSE25065 were treated with taxane-anthracycline chemotherapy pre-operatively and endocrine therapy. The clinically distinct molecular subtypes were identified in estrogen receptor positive breast carcinomas GSE6532. In this study, tumor samples from GSE25055 were used as training cohort, and those from two gene expression datasets GSE25065 and GSE6532 were used as independent validation cohorts.
Mutations in specific genes APC, KRAS, PIK3CA and TP53 allowed the identification of prognostic subgroups in colorectal cancer (CRC). The TCGA (The Cancer Genome Atlas) study recently reported three transcriptomic subtypes of CRC, which were designated as “microsatellite instability/CpG islandmethylator phenotype” (MSI/CIMP), “invasive”, and “chromosomal instability” (CIN)32. The training cohort GSE1753633, 34 including 111 samples in CRC patients was obtained from GEO database. In our study, we analyzed independent validation cohort GSE1753733, 34 downloaded from GEO database. Stage І and IV samples were excluded from this study. All these two CRC datasets were generated on the Affymetirx U133 plus 2.0 platform. The metastasis gene expression profiles GSE17536 (Moffitt patients) and GSE17537 (VMC patients) were developed from highly invasive mouse colon cancer cells and non-invasive colon cancer cells respectively.
Acute myeloid leukemia (AML) patients were classified into M0-M7 subgroups with FAB (French–American–British) criteria35. For AML, two gene expression datasets including GSE1241736 (HG-U133A) and GSE1035837 (HG-U133Plus2) were downloaded from GEO database. In GSE12417, 163 samples of bone marrow or peripheral blood mononuclear cells were developed from adult patients with untreated AML. The high-throughput sequencing using genomic DNA or RNA were created from the bone marrow (tumor) and matched skin biopsy samples (germline) from over 300 patients with de novo AML in GSE10358. GSE12417 was used as training cohort and GSE10358 was used as test cohort respectively.
For tumor gene expression datasets, all Affymetrix based CEL files were normalized using the Robust MultiChip Analysis (RMA) algorithm38 from the R Bioconductor package. Probe set identifiers (IDs) were mapped to gene symbols with the mapping from the GEO database. The probe set with the largest interquartile range (IQR) was selected owing to its high variation across samples, when multiple probe sets were mapped to the same gene. Probe sets were eliminated when they were mapped to multiple genes. The Z-score transformation was used as a normalization procedure to standardize the expression values of each gene. The datasets were performed separately to ensure their independency. The clinical characteristics of tumor samples with breast cancer, CRC and AML are listed in Table 1.
Spectral clustering using Nyström approximation
Input: data points \({\rm{X}}={\{x}_{1},\ldots ,{{\rm{x}}}_{{\rm{n}}}\}\) representing gene expression levels of patients; \(\ell \): number of random samples; σ: Gaussian function scaling parameter; k: number of identified clusters; n: the number of patients; \({\rm{k}} < \ell < {\rm{n}}\), 1 ≤ i, j ≤ n.
-
1.
Form the similarity matrix \({\rm{S}}\in {{\rm{R}}}^{{\rm{n}}\times {\rm{n}}}\) defined by \({{\rm{s}}}_{{\rm{ij}}}=\exp (-\parallel {{\rm{x}}}_{{\rm{i}}}-{{\rm{x}}}_{{\rm{j}}}{\parallel }^{2}/2{{\rm{\sigma }}}^{2})\) if \({\rm{i}}\ne {\rm{j}}\), and sii = 0.
-
2.
Let A represent the \(\ell \times \ell \) matrix of similarities between the sample points, B represent the \(\ell \times ({\rm{n}}-\ell )\) matrix of affinities between the \(\ell \) sample points and the \(({\rm{n}}-\ell )\) remaining points, and C represent the submatrix. The dense similarity matrix Sd is the reconstitution of the similarity matrix S and constructed with \({{\rm{S}}}_{{\rm{d}}}=[\begin{array}{cc}{\rm{A}} & {\rm{B}}\\ {{\rm{B}}}^{{\rm{T}}} & {\rm{C}}\end{array}]\).
-
3.
Assume \({\rm{W}}=[\begin{array}{c}{\rm{A}}\\ {{\rm{B}}}^{{\rm{T}}}\end{array}]\), and define \(\tilde{{\rm{S}}}\approx {{\rm{S}}}_{{\rm{d}}}={{\rm{WA}}}^{-1}{{\rm{W}}}^{{\rm{T}}}=[\begin{array}{cc}{\rm{A}} & {\rm{B}}\\ {{\rm{B}}}^{{\rm{T}}} & {{\rm{B}}}^{{\rm{T}}}{{\rm{A}}}^{-{\rm{1}}}{\rm{B}}\end{array}]\) with Nyström approximation. W represents the \({\rm{n}}\times \ell \) matrix consisting of A and BT.
-
4.
Calculate the diagonal matrix \(\tilde{{\rm{D}}}={\rm{diag}}\,([\begin{array}{c}{{\rm{A1}}}_{\ell }+{{\rm{B1}}}_{{\rm{n}}-\ell }\\ {{\rm{B}}}^{{\rm{T}}}{{\rm{1}}}_{\ell }+{{\rm{B}}}^{{\rm{T}}}{{\rm{A}}}^{-{\rm{1}}}{{\rm{B1}}}_{{\rm{n}}-\ell }\end{array}])\).
-
5.
Define Laplacian matrix \(\tilde{{\rm{L}}}={\rm{I}}-{\tilde{{\rm{D}}}}^{-{\rm{1}}/{\rm{2}}}{\tilde{{\rm{S}}}\tilde{{\rm{D}}}}^{-{\rm{1}}/{\rm{2}}}\).
-
6.
Define \({\rm{R}}=\bar{{\rm{A}}}+{\bar{{\rm{A}}}}^{-\frac{1}{2}}{\bar{{\rm{B}}}\bar{{\rm{B}}}}^{{\rm{T}}}{\bar{{\rm{A}}}}^{-\frac{1}{2}}\), where \(\bar{{\rm{A}}}={\tilde{{\rm{D}}}}_{1:\ell ,1:\ell }^{-1/2}{A\tilde{{\rm{D}}}}_{1:\ell ,1:\ell }^{-1/2}\) and \(\bar{{\rm{B}}}={\tilde{{\rm{D}}}}_{{\rm{1}}:\ell ,{\rm{1}}:\ell }^{-{\rm{1}}/{\rm{2}}}{B\tilde{{\rm{D}}}}_{\ell +{\rm{1}}:{\rm{n}},\ell +{\rm{1}}:{\rm{n}}}^{-{\rm{1}}/{\rm{2}}}\).
-
7.
Calculate eigendecomposition of R, \({\rm{R}}={{\rm{U}}}_{{\rm{R}}}{{\rm{\Lambda }}}_{{\rm{R}}}{{\rm{U}}}_{{\rm{R}}}^{{\rm{T}}}\), \({{\rm{\Lambda }}}_{{\rm{R}}}\) is the eigenvalues with decreasing order and UR is the eigenvectors.
-
8.
Calculate \(\tilde{{\rm{V}}}=[\begin{array}{c}\bar{{\rm{A}}}\\ {\bar{{\rm{B}}}}^{{\rm{T}}}\end{array}]{\bar{{\rm{A}}}}^{-\frac{1}{2}}{({{\rm{U}}}_{{\rm{R}}})}_{:,1:{\rm{k}}}{({{\rm{\Lambda }}}_{{\rm{R}}}^{-\frac{1}{2}})}_{1:{\rm{k}},1:{\rm{k}}}\) with the first k eigenvectors.
-
9.
Define the normalized matrix \(\tilde{{\rm{U}}}\) with \({\tilde{{\rm{u}}}}_{{\rm{il}}}=\frac{{\tilde{{\rm{V}}}}_{{\rm{il}}}}{\sqrt{{\sum }_{{\rm{r}}={\rm{1}}}^{{\rm{k}}}\,{\tilde{{\rm{V}}}}_{{\rm{ir}}}^{{\rm{2}}}}}\), where \({\rm{l}}=1,\ldots ,{\rm{k}}\).
-
10.
Perform the k-means algorithm to cluster n rows of \(\tilde{{\rm{U}}}\) into k groups. K-means algorithm minimize the objective function \({\sum }_{{\rm{i}}=1}^{{\rm{k}}}{\sum }_{{{\rm{u}}}_{{\rm{j}}}\in {{\rm{C}}}_{{\rm{i}}}}\parallel {{\rm{u}}}_{{\rm{j}}}-{{\rm{c}}}_{{\rm{i}}}{\parallel }^{2}\), where uj is vectors corresponding to n rows of \(\tilde{{\rm{U}}}\) and ci is the centroid of all the points uj belonging to cluster ci. We define \({{\rm{c}}}_{{\rm{i}}}=\frac{1}{|{{\rm{s}}}_{{\rm{i}}}|}{\sum }_{{{\rm{u}}}_{{\rm{j}}}\in {{\rm{s}}}_{{\rm{i}}}}{{\rm{u}}}_{{\rm{j}}}\), where \({{\rm{s}}}_{{\rm{i}}}=\{{{\rm{u}}}_{{\rm{p}}}:\parallel {{\rm{u}}}_{{\rm{p}}}-{{\rm{c}}}_{{\rm{i}}}{\parallel }^{2}\le \parallel {{\rm{u}}}_{{\rm{p}}}-{{\rm{c}}}_{{\rm{j}}}{\parallel }^{2}\}\).
-
11.
K-means iterations terminated with the relative difference between the two values of the objective function less than 0.001.
Subtype identification of CSISCN and consensus clustering approach
We proposed an accurate subtype identification method, Cancer Subtype Identification with Spectral Clustering using Nyström approximation (CSISCN), for the discovery of molecular subtypes, based on spectral clustering method. For spectral clustering using Nyström approximation, a Matlab implementation was used for this study. For tumor samples of training cohort, we set the parameter σ vary among the candidate set {20, 30, 40, 50} for each cancer type. Full gene symbols were used as outcome-related genes for input features. In the implementation of spectral clustering using Nyström approximation, we let the half sample size of training cohort as the number of random samples for each cancer type. The k-means algorithm was performed to identify the k clusters. The identified k clusters and real prognosis of the patients was assessed by the Kaplan-Meier survival curves and log-rank test. Each choice from the parameter σ was evaluated with log-rank p-value over 10 runs, and the parameter σ with smallest p-value was identified. The identified parameter σ was then performed to test on the independent validation dataset and the performance was evaluated with the Kaplan-Meier estimated survival curves.
For reference, we compared performance from the CSISCN approach to that from the state of the art unsupervised clustering method consensus clustering approach19. Consensus clustering has proved to be effective in solving different biological problems including gene expression-based class discovery19, identification of biologically functional modules in Protein–Protein Interaction (PPI) networks39, and cancer subtype discovery40. An R implementation of the ConsensusClusterPlus41 available in the ConsensusClusterPlus package was used for consensus clustering method. The pearson correlation coefficient distance was used with hierarchical clustering. The consensus clusters were identified as cancer subtypes from 100 resampling iterations of the hierarchical clustering, by using the full gene symbols (100%) and randomly selecting a fraction of the 80% samples. The identified cancer subtype and real prognosis of the patients was then assessed with survival analysis by the Kaplan-Meier survival curves and log-rank test. The number of consensus clusters was selected from k = 2 to k = 10 respectively.
Survival analysis
The association between the molecular subtype and real prognosis of the patients was evaluated by the Kaplan-Meier survival curves and log-rank test. Standard Kaplan–Meier survival curves were generated for each cancer subtype, and the survival difference between molecular subtypes was statistically evaluated using the log-rank test. An R implementation in the survival package was used for survival analysis. P-values of less than 0.05 were considered statistically significant.
Results
Overview of the CSISCN development and evaluation workflow
Figure 1 illustrates the overview of the CSISCN development and evaluation workflow. Microarray gene expression data on a specific cancer type were collected, normalized, and then z-score transformed separately. Molecular subtype of cancer was discovered from spectral clustering using Nyström approximation and k-means algorithm with the full gene symbols of GEPs. On the training set, we let the Gaussian function scaling parameter σ vary among the candidate set to construct the similarity matrix. CSISCN discovered the k clusters as molecular subtypes of cancer based on the identified optimal parameter. The association between identified molecular subtype and real prognosis of the patients was assessed by the Kaplan-Meier survival analysis. For CSISCN, the identified optimal parameter σ was then performed to test on the independent validation dataset. The k clusters were recognized as molecular subtypes to stratify the validation cohort and the prediction performance was then evaluated with the Kaplan-Meier survival curves and log-rank test. For tumor samples of validation dataset in each cancer type, we set the number of random samples with the half sample size of test cohort.
The CSISCN identifies the molecular subtypes with distinct clinical outcomes
We presented the CSISCN to identify the molecular subtypes from tumor GEPs. We investigated whether CSISCN could identify the molecular subtypes in breast cancer as an example. GSE25055 was used as training cohort for clustering development. GSE25065 and GSE6532 were then used as two independent validation cohorts to validate the approach. For each parameter σ, log-rank p-values were generated with repeated ten times runs in order to obtain robust performance evaluation results. In this analysis, the parameter σ = 20 was identified with smallest p-value from training cohort and then performed to test on the independent validation dataset.
To identify the difference in gene expression between molecular subtypes, we performed CSISCN to stratify the cancer patients into k clusters. Figure 2 showed the molecular subtypes with distinct cluster discriminating patterns of breast cancer. The heatmap further revealed the subtype based discriminative patterns of alterations in GEPs.
GSE25055 was used as training cohort to develop the CSISCN for identifying molecular subtypes. As shown in Fig. 3a, the subtype 1 group had significantly worse distant relapse-free survival than the subtype 2 group. The distant relapse free survival at 3 years was 70% for the subtype 1 group compared with 80% for the subtype 2 group. As shown in Fig. 3b, the patients were separated into three subtypes with significantly different distant relapse-free survival. The distant relapse free survival at 3 years was 78% for the subtype 1 group compared with 84% for the subtype 2 group and 64% for the subtype 3 group respectively. As shown in Fig. 3c, the patients were stratified into four subtypes with significantly different distant relapse-free survival. The distant relapse free survival at 3 years was 75% for the subtype 1 group as compared to 69% for the subtype 2 group, 77% for the subtype 3 group and 85% for the subtype 4 group respectively. To further test the generality of the method, we developed the CSISCN from GSE25055 for identifying from five to ten molecular subtypes. As shown in the Fig. 4, it illustrated that the patients were stratified into five molecular subtypes with significantly different relapse-free survival (Fig. 4a) and six molecular subtypes with significantly different relapse-free survival (Fig. 4b) respectively. Still, we observed that the patients were separated into eight molecular subtypes with significantly different distant relapse-free survival (Fig. 5a), nine molecular subtypes with significantly different distant relapse-free survival (Fig. 5b) and ten molecular subtypes with significantly different distant relapse-free survival (Fig. 5c) respectively.
Using optimized parameter σ based on training cohort, the CSISCN was developed to test on the independent dataset GSE25065. Figure 3d illustrated the subtype 2 group had significantly worse distant relapse-free survival than the subtype 1 group. The distant relapse free survival at 3 years was 84% for the subtype 1 group compared with 74% for the subtype 2 group. Figure 3e depicted the patients were separated into three subtypes with significantly different distant relapse-free survival. The distant relapse free survival at 3 years was 84% for the subtype 1 group compared with 85% for the subtype 2 group and 66% for the subtype 3 group respectively. We observed that the patients were separated into five molecular subtypes (Fig. 4d), seven molecular subtypes (Fig. 4f) and nine molecular subtypes (Fig. 5e) respectively with significantly different distant relapse-free survival when the CSISCN was applied for breast cancer gene expression datasets GSE25065.
To further validate the effect of the CSISCN, we developed the clustering method to test on the independent dataset GSE6532. We observed that the subtype 1 group had significantly worse distant relapse-free survival than the subtype 2 group (Fig. 3g). The distant relapse free survival at 3 years was 78% for the subtype 1 group compared with 83% for the subtype 2 group. Still, we noticed that the patients were separated into three subtypes with significantly different distant relapse-free survival (Fig. 3h). The distant relapse free survival at 3 years was 86% for the subtype 1 group compared with 79% for the subtype 2 group and 75% for the subtype 3 group respectively. Figure 3i showed that the patients were stratified into four subtypes with significantly different distant relapse-free survival. The distant relapse free survival at 3 years was 74% for the subtype 1 group as compared to 77% for the subtype 2 group, 81% for the subtype 3 group and 89% for the subtype 4 group respectively. As shown in the Figs 4 and 5, it illustrated that the patients were stratified into different molecular subtypes with significantly different relapse-free survival when the CSISCN was applied for breast cancer gene expression datasets GSE6532 (Figs 4g,i and 5g,i respectively).
Consequently, both training results and independent test results clearly demonstrated that the CSISCN was able to identify the molecular subtypes with significant differences in prognosis.
The CSISCN is effective in CRC datasets and AML datasets
To test the general applicability of the CSISCN, we applied it to CRC gene expression datasets. A CRC gene expression dataset GSE17536 with 111 samples was used as training cohort to develop the CSISCN for identifying molecular subtypes (Fig. 6a–c). Using optimized parameter σ derived from training cohort, the CSISCN was then evaluated using 55 samples in an independent dataset GSE17537 (Fig. 6d–f). In this analysis, the parameter σ = 20 was identified with the smallest p-value in CSISCN from CRC training cohort GSE17536. Figure 6a showed that the subtype 1 group had significantly worse relapse-free survival than the subtype 2 group. It illustrated that the patients were separated into three subtypes with significantly different relapse-free survival (Fig. 6b). We observed that the patients were stratified into four subtypes with significantly different relapse-free survival (Fig. 6c). Still, Fig. 6d showed that the subtype 1 group had significantly worse relapse-free survival than the subtype 2 group. We noticed that the patients were stratified into four subtypes with significantly different relapse-free survival (Fig. 6f).
In addition, CSISCN was applied for AML gene expression datasets to further validate the general adaptability. Similar to the above analysis, we collected gene expression dataset GSE12417 (Fig. 7a–c) as training cohort to develop the CSISCN and kept GSE10358 (Fig. 7d–f) as an independent test dataset. In this analysis, the parameter σ = 30 was identified with the smallest p-value in CSISCN from AML training cohort GSE12417. The subtype 1 group had significantly worse overall survival than the subtype 2 group (Fig. 7a). We observed that the patients were separated into three subtypes with significantly different overall survival (Fig. 7b) and four subtypes with significantly different overall survival (Fig. 7c) respectively. Still, the subtype 2 group had significantly worse overall survival than the subtype 1 group (Fig. 7d). Figure 7e also showed the patients were separated into three subtypes with significantly different overall survival. In summary, these results were consistent with the observations in breast cancer and further demonstrated that CSISCN could identify molecular subtypes with distinct clinical outcome.
The CSISCN is valid for the different numbers of molecular subtypes
In order to evaluate the validity of the CSISCN for the number of molecular subtypes, we tested different numbers of molecular subtypes for each cancer type. According to the association between the molecular subtype and real prognosis in GSE25055, statistically significant differences were found in the stratified patients with log-rank p-values less than 0.05 (Figs 3a–c, 4a,b and 5a–c). Similar performances were obtained for molecular subtype based stratification of patients in the independent validation datasets GSE25065 (Figs 3d,e, 4d,f and 5e) and GSE6532 (Figs 3g–i, 4g,i and 5g,i) respectively. These results suggested that the CSISCN was reasonably effective for the different numbers of molecular subtypes.
As depicted in Fig. 6, the similar results were observed when the CSISCN was applied for CRC gene expression datasets GSE17536 and GSE17537 respectively. As shown in Fig. 7, it suggested that the molecular subtypes with distinct clinical outcomes of AML identified in the training set could be rediscovered in the validation dataset. These results were consistent with the observations in breast cancer and further demonstrated the validity of the CSISCN for different numbers of molecular subtypes.
The CSISCN identifies molecular subtypes for improving clinical and molecular relevance
We compared the CSISCN approach with the state of the art unsupervised method consensus clustering approach. In this analysis, we performed the comparisons with different molecular subtypes of each cancer type. Table 2 illustrated the log-rank p-values of CSISCN and consensus clustering from the training cohorts and independent test datasets. The p-values less than 0.05 were regarded as statistical significance.
According to log-rank p-values of breast cancer GSE25055, CSISCN achieved better performance than consensus clustering approach (Table 2). Similar results were also derived for different molecular subtypes based differentiated patients in the independent validation datasets GSE25065 and GSE6532 respectively. For breast cancer GSE25065, CSISCN achieved the best clustering performance for three and nine molecular subtypes respectively. For breast cancer GSE6532, CSISCN achieved the lowest log-rank p-value of 0.001 for five molecular subtypes. Thus, it suggested that CSISCN achieved p-values which tended to be more statistically significant than consensus clustering.
For CRC cohort GSE17536, it was showed that CSISCN achieved better clustering performance than consensus clustering approach with different molecular subtypes (except for k = 2). Meanwhile, CSISCN achieved p-values for different molecular subtypes (except for k = 3, 6) which are more statistically significant than consensus clustering in AML cohort GSE12417. Compared with consensus clustering, CSISCN achieved better clustering performance for the identification of different molecular subtypes in the independent test datasets CRC cohort GSE17537 and AML cohort GSE10358 respectively. Indeed, these results reproduced the outcomes in breast cancer and further proved the progress in the CSISCN for identifying molecular subtypes.
The CSISCN improved clustering performance compared with spectral clustering method
To further validate the effectiveness of the CSISCN, we compared it with standard spectral clustering method for further analysis. A Matlab implementation available of spectral clustering was used to identify the molecular subtypes from breast cancer GEPs. The Gaussian similarity function was used for spectral clustering to construct the similarity matrix. The parameter σ was set among the candidate set {20, 30, 40, 50}, evaluated with log-rank p-value over 10 runs and then identified with the smallest p-value. We tested different number of molecular subtypes for comparison. As shown in Table 2, CSISCN outperformed spectral clustering significantly for breast cancer GSE25055 (k = 2, 4, 6, 8, 9, 10), GSE25065 (k = 2, 3, 5, 7, 9, 10) and GSE6532 (k = 3, 4, 5, 7, 8, 9, 10) respectively. The results thus suggested that CSISCN achieved better clustering performance compared with spectral clustering.
We compared CSISCN with spectral clustering method in terms of running time. We performed the runtime experiments on a computer with 3.2 GHz CPUs and 16 GB of memory, without exploiting multi-core parallelization. In the implementation of CSISCN, the running time was separated into three sections including the calculation of similarity matrix, eigendecomposition and k-means implementation respectively. The total runtime for different molecular subtypes with CSISCN was reported in Table 3. The results suggested that CSISCN achieved a faster computational speed than spectral clustering method (Table 3).
Discussion
The identification of molecular subtype is critical to the development of therapeutic strategy and the understanding of significant heterogeneity for cancer patients. In this analysis, our hypothesis is that spectral clustering method could identify molecular subtypes in correlation with survival outcomes. Furthermore, we developed the accurate subtype identification method for identifying molecular subtypes and thus improving clinical and molecular relevance. The CSISCN was then applied on different types of cancer to identify molecular subtypes and demonstrated superior performance as compared to consensus clustering and spectral clustering methods.
In our analysis, we used quantile normalization across the experiments to make comparable distributions for all samples. However, strong batch effect remained after this processing step. Importantly, further application of a gene-wise z-score transformation for each dataset separately effectively reduced the batch effect. Considering unsupervised clustering method is able to summarize and explain key features corresponding to several classes to which the data belong, we apply spectral clustering using Nyström approximation for the discovery of molecular subtypes. This unsupervised clustering method is then designed to capture the underlying cluster structures for a lower-dimensional representation of the data28, 42. Specially, this clustering method discards the structures which are always dominated by the arbitrariness of the sample noise and characterized by over-fitting in unsupervised learning28, 42. The results thus demonstrated that CSISCN was able to achieve significantly better performance for three cancer types. As compared to consensus clustering, CSISCN used the pairwise similarities of samples and smaller subset of dense similarity matrix, which thus achieved significantly better performance for the identification of molecular subtypes. Indeed, spectral clustering using Nyström approximation samples columns of the affinity matrix and approximates the full matrix by using correlations between the sampled columns and the remaining columns24, which is different from general spectral clustering method. Importantly, sampling-based spectral decomposition technique, Nyström method, provides a powerful alternative for approximate spectral decomposition. They often operate on a small part of the original matrix and eliminate the need for storing the full matrix43. While the general spectral clustering method needs to construct an adjacency matrix and calculate the eigen-decomposition of the corresponding Laplacian matrix, the Nyström approximation method is typically used for efficiently computing an approximate solution of the eigen-problem. Spectral clustering is mainly based on the manifold assumption, and this assumption is not applicable to identifying a low-dimensional data manifold of high-dimensional data. Actually, the clustering performance of SC will be degraded and even become worse than K-means clustering when high-dimensional data do not display a low-dimensional manifold structure clearly44. In this analysis, spectral clustering using Nyström approximation has been applied to discover the underlying cluster structure which is a lower-dimensional representation of high-dimensional gene expression data and thus identifies the molecular subtypes of cancer. In our study, we noticed the difference between performance gain for various k clusters when CSISCN is compared with two general clustering methods. It is interesting to see that the performance gain is very large for nine and ten clusters in GSE25055 and GSE25065 respectively (Table 2), and the results suggests the CSISCN shows great potential for large k clusters. Moreover, we also repeated the parameter selection for ten times when possible to obtain a more robust estimation. In the implementation of spectral clustering using Nyström approximation, a closer look of the results found that the performance could be very similar (or equal) when we run the algorithm ten times for the identical parameter value.
However, our findings come up with some caveats. Our analysis is restricted by the availability of genomic data for cancer patients. Moreover, we also notice some exceptional performance between CSISCN and consensus clustering in log-rank p-values (Table 2). Specifically, CSISCN performed the clustering performance with different log-rank p-values between training dataset and test dataset for each cancer type. One possible explanation is the biological difference that we observe the reality between different patient cohorts. For example, in the AML study, the training dataset GSE12417 was from a US population while the test dataset GSE10358 was from an European population. Another possible explanation is that the different class proportions between the training and the test datasets could result in the biases for clustering performance. For example, in the breast cancer study, the proportion between non-recurrence and recurrence patients is 3.7:1 in GSE25055 and 1.9:1 in GSE6532 respectively. Interestingly, this problem is popular in microarray studies with the small sample size.
With increasing available gene expression data from different types of cancer, CSISCN could bridge unsupervised learning method and accurate subtype discovering tool for the identification of cancer molecular subtypes. In summary, CSISCN shows the great potential for the discovery of molecular subtypes for human cancers.
References
Alizadeh, A. A. et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000).
Van’t Veer, L. J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002).
Puente, X. S. et al. Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia. Nature 475, 101–105 (2011).
Kan, Z. et al. Whole-genome sequencing identifies recurrent mutations in hepatocellular carcinoma. Genome research 23, 1422–1433 (2013).
Chmielecki, J. et al. Whole-exome sequencing identifies a recurrent NAB2-STAT6 fusion in solitary fibrous tumors. Nature genetics 45, 131–132, doi:10.1038/ng.2522 (2013).
Ramaswamy, S., Ross, K. N., Lander, E. S. & Golub, T. R. A molecular signature of metastasis in primary solid tumors. Nature genetics 33, 49–54 (2003).
Volinia, S. & Croce, C. M. Prognostic microRNA/mRNA signature from the integrated analysis of patients with invasive breast cancer. Proceedings of the National Academy of Sciences 110, 7413–7417 (2013).
Marisa, L. et al. Gene expression classification of colon cancer into molecular subtypes: characterization, validation, and prognostic value. PLoS Med 10, e1001453 (2013).
Cho, J. Y. et al. Gene expression signature–based prognostic risk score in gastric cancer. Clinical Cancer Research 17, 1850–1857 (2011).
Sahlberg, K. K. et al. A serum microRNA signature predicts tumor relapse and survival in triple-negative breast cancer patients. Clinical Cancer Research 21, 1207–1214 (2015).
Bild, A. H. et al. Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 439, 353–357 (2006).
Wood, L. D. et al. The genomic landscapes of human breast and colorectal cancers. Science 318, 1108–1113, doi:10.1126/science.1145720 (2007).
Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218, doi:10.1038/nature12213 (2013).
Van’t Veer, L. J. & Bernards, R. Enabling personalized cancer medicine through analysis of gene-expression patterns. Nature 452, 564–570 (2008).
Rives, A. W. & Galitski, T. Modular organization of cellular networks. Proceedings of the National Academy of Sciences 100, 1128–1133 (2003).
Perou, C. M. et al. Molecular portraits of human breast tumours. Nature 406, 747–752 (2000).
Lehmann, B. D. et al. Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies. The Journal of clinical investigation 121, 2750–2767, doi:10.1172/JCI45014 (2011).
Souto, M. C. D., Costa, I. G. & Araujo, D. S. D. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics 9, 497 (2008).
Monti, S., Tamayo, P., Mesirov, J. & Golub, T. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine learning 52, 91–118 (2003).
Yu, Z., Wong, H.-S. & Wang, H. Graph-based consensus clustering for class discovery from gene expression data. Bioinformatics 23, 2888–2896 (2007).
Guinney, J. et al. The consensus molecular subtypes of colorectal cancer. Nature medicine (2015).
Ng, A. Y., Jordan, M. I. & Weiss, Y. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems 2, 849–856 (2002).
Shi, J. & Malik, J. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on 22, 888–905 (2000).
Fowlkes, C., Belongie, S., Chung, F. & Malik, J. Spectral grouping using the Nystrom method. Pattern Analysis and Machine Intelligence, IEEE Transactions on 26, 214–225 (2004).
Dhillon, I. S. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. 269–274 (ACM) (2001).
Bach, F. R. & Jordan, M. I. Learning Spectral Clustering. Advances in Neural Information Processing Systems 16, 2006 (2004).
Mohamed-Ali Belabbas, P. J. W. Spectral methods in machine learning and new strategies for very large datasets. Proceedings of the National Academy of Sciences of the United States of America 106, 369–374 (2009).
Chen, W.-Y., Song, Y., Bai, H., Lin, C.-J. & Chang, E. Y. Parallel spectral clustering in distributed systems. Pattern Analysis and Machine Intelligence, IEEE Transactions on 33, 568–586 (2011).
Ding, S., Jia, H. & Shi, Z. Spectral clustering algorithm based on adaptive Nyström sampling for big data analysis. J Softw 25, 2037–2049 (2014).
Hatzis, C. et al. A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer. Jama 305, 1873–1881 (2011).
Loi, S. et al. Definition of clinically distinct molecular subtypes in estrogen receptor–positive breast carcinomas through genomic grade. Journal of clinical oncology 25, 1239–1246 (2007).
Network, T. C. G. A. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012).
Smith, J. J. et al. Experimentally derived metastasis gene expression profile predicts recurrence and death in patients with colon cancer. Gastroenterology 138, 958–968 (2010).
Freeman, T. J. et al. Smad4-mediated signaling inhibits intestinal neoplasia by inhibiting expression of β-catenin. Gastroenterology 142, 562–571 e562 (2012).
Cancer Genome Atlas Research, N. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. The New England journal of medicine 368, 2059–2074, doi:10.1056/NEJMoa1301689 (2013).
Metzeler, K. H. et al. An 86-probe-set gene-expression signature predicts survival in cytogenetically normal acute myeloid leukemia. Blood 112, 4193–4201 (2008).
Tomasson, M. H. et al. Somatic mutations and germline sequence variants in the expressed tyrosine kinase genes of patients with de novo acute myeloid leukemia. Blood 111, 4797–4808 (2008).
Irizarry, R. A. et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–264 (2003).
Asur, S., Ucar, D. & Parthasarathy, S. An ensemble framework for clustering protein-protein interaction networks. Bioinformatics 23, i29–40 (2007).
Damrauer, J. S. et al. Intrinsic subtypes of high-grade bladder cancer reflect the hallmarks of breast cancer biology. Proceedings of the National Academy of Sciences of the United States of America 111, 3110–3115 (2014).
Wilkerson, M. D. & Hayes, D. N. ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics 26, 1572–1573 (2010).
Luo, J., Jiao, L. & Lozano, J. A. A Sparse Spectral Clustering Framework via Multiobjective Evolutionary Algorithm. IEEE Transactions on Evolutionary Computation 20, 418–433 (2016).
Kumar, S., Mohri, M. & Talwalkar, A. In International Conference on Machine Learning, ICML, Montreal, Quebec, Canada, June, 70 (2009).
Nie, F., Zeng, Z., Tsang, I. W., Xu, D. & Zhang, C. Spectral embedded clustering: a framework for in-sample and out-of-sample spectral clustering. IEEE Transactions on Neural Networks 22, 1796–1808 (2011).
Acknowledgements
This work was supported by grants from the National Natural Science Foundation of China (Nos. 61572166, 61371153).
Author information
Authors and Affiliations
Contributions
Conceived and designed the experiments: M.S. Performed the experiments: M.S. Analyzed the data: M.S. and G.X. Contributed reagents/materials/analysis tools: M.S. Wrote the paper: M.S.
Corresponding author
Ethics declarations
Competing Interests
The authors declare that they have no competing interests.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Shi, M., Xu, G. Spectral clustering using Nyström approximation for the accurate identification of cancer molecular subtypes. Sci Rep 7, 4896 (2017). https://doi.org/10.1038/s41598-017-05275-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-017-05275-3
- Springer Nature Limited