Background

Molecular subtyping is instrumental towards selection of model systems for fundamental research in tumour pathogenesis, and for clinical assessment of patients. To date, four molecular subtypes of Medulloblastoma (MB) have been established: SHH, WNT, Group3, Group4. The Group3 and Group4 MB subtypes are the least characterized, most aggressive, and have the poorest prognosis [1]. Model systems, including MB cell lines and genetically engineered mouse models [2], are being continually developed with the goal of studying MB subtype disease origins and signaling pathways. However, understanding the degree to which these model systems recapitulate Human MB subtypes remains the greatest challenge, especially for poorly characterized subtypes. In particular, many of the developed models have been predicted belong to the SHH subtype, with few models identified as recapitulating the Group3 or WNT phenotypes [3].

The lack of a versatile and personalized classification system hinders effective assessment of MB patients, and fundamental research into subtype-specific pathogenesis using model systems. To address these issues we developed a novel Medullo-Model To Subtypes (MM2S) classifier that matches individual gene expression profiles from MB samples against well-established molecular subtypes [4]. The MM2S algorithm is advantageous over existing MB-subtyping algorithms [3] by providing single-sample classifications while eradicating the need for a reference sample (e.g., human cerebellum) or sample replicates to generate predictions. MM2S design relies on a flexible, systems-based approach that makes it extensible and easily applicable across MB patients, human cell lines, and mouse models. We previously demonstrated MM2S extensibility and effectiveness across the largest meta-analysis of human MB patients, cell lines, and mouse samples to date [4]. In order to provide the scientific community with an easy-to-use and fully documented implementation of our flexible MB classifier we developed a new R package, MM2S, which implements the MM2S algorithm across human MB patients and model systems.

Implementation

Training and development of the MM2S classification algorithm and hyperparameters has been previously described in detail [4], and the overall analysis design is provided in Additional file 1: Figure S1. Briefly, MM2S is trained on a set of 347 normal and tumor human MB samples pertaining to the SHH, Group3, and Group4 MB subtypes. Single-sample Gene Set Enrichment Analysis (ssGSEA) is conducted on mouse and human expression profiles using species-specific GMT files that were generated from common Gene Ontology Biological Processes (GO BP) genesets between human and mouse. Following ssGSEA, an ssGSEA-ranked matrix is generated from subtype-discriminative genesets by ranking genesets in descending order of their ES scores for each sample. To account for platform differences across test samples, we introduced an additional step that filters for common genesets between the test sample and human, prior to generating ssGSEA-ranked matrices for predictions. A k-nearest neighbor (KNN) classification uses the ssGSEA-ranked matrix and the 5 nearest neighbors of a given sample to make subtype predictions.

We have developed two main functions (MM2S.human and MM2S.mouse) that apply the MM2S algorithm towards human primary tumours and cell lines, and MB mouse models, respectively (Fig. 1). We ensured a standardized output format that facilitates graphical rendering of the MM2S predictions in a variety of contexts (Fig. 1). We have introduced multiple functions that combine both sample-centric and subtype-centric views of the MM2S output. The sample-centric views (using the functions PredictionsHeatmap, PredictionsBarplot and PCARender) are easily interpretable and facilitate association of a particular Human MB subtype to normalized gene expression values for a given sample. High-confidence predictions (≥80 % of votes) are indicative of a corresponding human subtype, and lower predictions indicate an intermediate genotype. Where a large number of sample replicates are tested simultaneously, subtype-centric views (using the functions PredictionsDistributionPie and PredictionsDistributionBoxplot) indicate the majority subtype and consensus predictions across all replicates.

Fig. 1
figure 1

Overview of the MM2S package and its applications for MB subtypes of patient tumour samples and MB mouse models. A test sample (circled black star) representing normalized gene expression from human or mouse datasets is run using either of the MM2S.human or MM2S.mouse prediction functions, respectively. The MM2S prediction algorithm uses an ssGSEA and KNN-based approach to determine the MB subtype of a given sample, by looking at its 5 closest MB neighbors in 3-dimensional space. A selected number of functions can render the MM2S output in terms of sample-centric or subtype-centric views. The PredictionsHeatmap provides a heatmap representation of MM2S confidence predictions, for each sample, across all MB subtypes (WNT, SHH, Group, Group4, as well as Normal samples). Darker colors indicate a higher confidence and greater probability that a given sample belongs to a respective subtype. The PCARender function presents PCA plots of tested samples (purple) against the human training set (colored by subtype). This shows, in 3-dimensional space, the nearest MB samples to a given test sample, which indicates how the finalized subtype was assigned using the KNN algorithm. Subtype-centric views include PredictionsDistributionPie, which presents a pie charts of the major subtypes predicted across all the samples tested. PredictionsDistributionBoxplot highlights overall strength (in terms of MM2S confidence interval) of subtype predictions that were identified across all samples tested

Results and discussion

We have selected some examples from our previous analysis [4], to demonstrate the data reproducibility and improved data rendering capabilities of the MM2S package compared to the server implementation. MM2S is applied in two case studies involving human primary patient samples and sample replicates of the GTML mouse model. The package and underlying functions we present here are fully documented, easy to install and to incorporate into larger Medulloblastoma-driven analysis pipelines (Additional file 2: Data 1, Additional file 3: Data 2).

MM2S Prediction of Human MB Subtypes for Patient Tumour Samples

We tested here MM2S on a dataset of human patient samples from the Gene Expression Omnibus (GEO), for which subtypes are already known. The GSE37418 dataset contains 76 primary patient samples including WNT (n = 8), SHH (n = 10), Group3 (n = 16) and Group4 (n = 39), and outlier samples not pertaining to the major MB subgroups (n = 3). Using the MM2S.human function, MM2S accurately predicts patient samples across well-studied MB subtypes (WNT and SHH, 100 % accuracy), as well as the lesser-characterized Group3 (87.5 %) and Group4 (79.4 %) (Additional file 4: Table S1, Additional file 5: Table S2). The full code is provided in package vignette and in Additional file 2: Data 1. We also provide additional examples of how to process the data from NCBI GEO prior to using the MM2S.human function in Additional file 3: Data 2.

MM2S Prediction of Human MB Subtypes for the GTML Mouse Model

Using MM2S, we previously identified two genetically engineered mouse models recapitulating transcriptomic patterns of WNT and Group3 subtypes [4]. We expanded here on MM2S predictions using 20 sample replicates of the GTML mouse model. Using the MM2S.mouse function, we observed the largest number of Group3 predictions across sample replicates (Additional file 6: Table S3). A heatmap representation of MM2S predictions across GTML replicates indicates that the majority of replicates predict as Group3 with high degrees of confidence (>80 %). This is further affirmed by looking at the distribution of predicted subtypes, and the predicted strengths of all the subtype calls, across all the replicates predicted (Additional file 2: Data 1). Overall, our analysis suggests the potential for a non-SHH mouse model but cautions that some of the sample replicates tested also predict as SHH or “normal-like”. These “normal-like” samples are tumour samples that resemble normal cerebellum more than any of the four MB subtypes. Further investigations will need to be conducted on these heterogeneous samples to assess their validity for use as a Group3 mouse model.

Conclusion

We have implemented the MM2S software package for personalized classification of individual Medulloblastoma (MB) samples from human patients and corresponding model systems into published human MB subtypes. We demonstrate the relevance of MM2S to produce robust human subtype classifications for individual human patient samples, and for single-sample replicates of mouse medulloblastoma models. We highlight how our package facilitates single-sample predictions and further investigation into ambiguous genotype potentially due to tumor heterogeneity. The overall design of the MM2S packages makes it a flexible software tool for use by researchers, which would facilitate and extend the use of the MM2S in diverse computational and bioinformatics contexts.

Availability and requirements

Project Name: MM2S

Project Home Page: The R package MM2S is open source and available on CRAN <https://cran.r-project.org/web/packages/MM2S/> under the GPL-3 License. (Package source code is also available on Github at https://github.com/DGendoo and https://github.com/bhklab).

Operating System: Platform Independent

Programming Language: R

License: GPL-3