Keywords

1 Introduction

Gliomas are the most common type of primary adult brain tumors that arise from glial cells. Gliomas have a very heterogeneous landscape, and they can be classified according to their grade into low-grade glioma, anaplastic glioma, and glioblastoma. The most common and aggressive type of glioma in adults is glioblastoma (GBM), which gives to the affected patient average survival time of only 10 to 18 months. The known molecular classification of GBM into classical, mesenchymal, neural and proneural subtypes is relatively accepted to be related to the expression of EGFR, NF1 and PDGFRA/IDH1 genes [1].

Imaging, specifically magnetic resonance imaging (MRI), can offer data towards promising biomarkers reflecting underlying tumor pathology and biological function. If imaging phenotypes of GBM obtained from routine clinical MRI studies can be associated with specific gene expression signatures, quantitative imaging phenotypes will serve as non-invasive surrogates for cancer genomic events and provide valuable information as to the diagnosis, prognosis, and optimal treatment.

Several radiogenomic studies have been carried out for many diseases [8,9,10,11,12,13,14,15,16]. For instance for schizophrenia pairs of SNP/Gene and MRI features have been mapped by using PLINK [8], and Parallel-ICA showed promising results [9]. Batmanghelich et al. [10] proposed a Bayesian framework to relate imaging and genetic data to phenotypes exploiting connection among these data modalities simultaneously in Alzheimer. Recently, correlations of connectomic features have been related to genes which are known to be related to Alzheimer progression [11]. In contrast to Alzheimer’s disease and schizophrenia, glioma lesions are generally not spread all over the brain, and local features from MRI can be used. An imaging-genomic analysis study [12], performed by using the tumor volume in T2-weighted FLuid-Attenuated Inversion Recovery (T2-FLAIR) images and large-scale genetic and micro-RNA expression probes demonstrated the potential for molecular subtyping and showed that the high median expression of POSTN gene results in a significant decrease in survival, and for that they used ANOVA and Tukey-Kramer test. Other studies [13, 14] showed correlations between image feature annotations and expression of genes with glioma molecular subtypes [1]. Specifically, Gutman et al. [13] found a significant association between contrast-enhanced tumor and these molecular subtypes [1], where proneural type expressed by PDGFRA/IDH1 gene showed low levels of contrast enhancement, and the classical type (i.e., primarily described by EGFR amplification) correlates with the increased percentage of contrast enhancement. The study used sher exact statistics.

Recent population-based studies have assessed the anatomical location of GBM in relation to distinct clinically-relevant molecular characteristics, and have identified the spatial distribution of the tumors being descriptive of their molecular status [14, 17,18,19,20,21,22]. Furthermore, the emerging research direction of radiomics has shown promise that texture analysis of the various tumor sub-regions in radiographic imaging can also be informative of the tumor’s molecular characterization [23,24,25]. Furthermore, using MRI features for GBM lesions, including texture and shape features, Haruka et al. proposed a classification imaging method and found three clusters of GBM patients [35]. In their method, they integrate copy number and gene expression data to estimate the molecular pathway activity and show that the three clusters reveal not only different molecular characteristics but also different survival probabilities.

The purpose of this paper is to identify significant associations between gene expressions, across the whole genome, and quantitative imaging phenomic features extracted from multi-modal MRI brain scans of patients diagnosed with de novo primary GBM. In line with the pre-mentioned studies, here we focus on evaluating the spatial location and texture features of GBM and investigate their associations with gene expressions.

2 Materials and Methods

2.1 Data

For the quantitative association analysis conducted here, we utilized a retrospective cohort of 135 de novo primary GBM patients from the TCGA-GBM collection [6], with available pre-operative multi-modal MRI scans in The Cancer Imaging Archive (TCIA) [7] and corresponding molecular characterization in The Cancer Genome Atlas (TGCA). The multi-modal MRI data we utilized comprise native (T1) and post-contrast T1-weighted (T1Gd), T2-weighted (T2), and T2-FLAIR modalities. The TCGA-GBM subset of 135 patients were identified by Bakas et al. [4] as brain scans without any surgically-imposed cavity, and their co-registered and skull-stripped imaging were provided in the TCIA Analysis Results together with expert manually annotated segmentation labels for the various histologically-distinct tumor sub-regions, i.e. enhancing tumor (ET), non-enhancing tumor (NET), peritumoral edematous/invaded tissue (ED) (Fig. 1) [4, 5]. The total sample size of GBM patients reduced to 88 after evaluating patients that had available imaging [6] and corresponding gene expressions. In total, we assessed expression energies for 17815 genes, 11 distinct descriptors of tumor spatial location (Fig. 2), and 517 radiomic/texture features (Fig. 2) for each patient’s brain tumor scan [2, 4, 5].

Fig. 1.
figure 1

Example of a multi-modal MRI brain scan and its corresponding expert segmentation labels.

2.2 Quantitative Imaging Phenomic Features

Radiomic/Texture Features. We extracted an extensive panel of quantitative texture features, volumetrically (in 3D), for each tumor sub-region as provided by the expert annotations, across all available modalities. Specifically, the texture features we evaluated (i) capture global characteristics (i.e., variance, skewness, kurtosis) of each sub-region’s intensity distribution on each modality, and (ii) include features based on Gray-Level Co-occurrence Matrix (GLCM) [26] (Fig. 2), Gray-Level Run-Length Matrix (GLRLM) [27,28,29,30], Gray-Level Size Zone Matrix (GLSZM) [28,29,30], and Neighborhood Gray-Tone Difference Matrix (NGTDM) [31].

Spatial Distribution Patterns. Beyond texture features, we collected discrete spatial information about the anatomical location of each tumor on each brain scan (Fig. 2). To obtain these spatial distribution patterns we registered all brain tumor scans in a standardized healthy atlas space using an iterative Expectation-Maximization framework [3], while incorporating a biophysical tumor growth model (based on a reaction-diffusion-advection model [32,33,34]) to account for tumor mass effects in the brain parenchyma. We then retrieved the spatial distribution of each tumor according to the discretized anatomical locations of the (i) specific lobes (i.e., frontal, temporal, parietal, occipital), (ii) insula, (iii) basal ganglia, (iv) fornix, (v) cerebellum, and (vi) brain stem. In addition, we also included as distinct features the distances of (i) the tumor core (defined as the union of ET and NET), and (ii) the ED, from the ventricles.

To produce these quantitative features we have utilized GLISTRboost. Specifically, in the process to produce segmentations of the various tumor sub-regions, the generative part [37] of GLISTRboost, following an Expectation-Maximization framework registers a healthy population probabilistic atlas to glioma patients’ brain scans while incorporating a biophysical glioma growth model to account for mass effects. Then, after converting the predicted segmentation in the healthy atlas space, the percentage of the tumor core (i.e., enhancing and non-enhancing tumor) is calculated on each of the brain lobes in this healthy atlas.

2.3 Data Analysis

Initially, we combined the two types of data (imaging - genetics) using the patient ID as a primary column. As a first stage, we used the gene expressions and the spatial distribution patterns to perform a non-parametric test of association. To assess the associations, we computed the Spearman correlation coefficient (\(r_s\)) between the gene expressions, individually, as a with each of the spatial distribution patterns described in Sect. 2.2. We then assessed the significant of the correlation coefficient by calculating the p-values as described below.

Fig. 2.
figure 2

Illustrative examples of spatial distribution (left) and texture (right) patterns.

Fig. 3.
figure 3

Schematic representation of the study’s analysis workflow.

For each quantitative feature and each gene, We obtained the p-value associated with Spearman correlation coefficient test statistic. That is, the p-value of the correlation between a single gene expression with a single feature of the tumor’s location in the brain. The Spearman correlation coefficient model for a given feature (y) and given gene expression (x) is;

$$\begin{aligned} r_{s} = 1- \frac{6 \sum _{i=1}^{i=N} d_{i}^{2}}{N(N^{2}-1)} \end{aligned}$$
(1)

Where \(d_i\) is the difference between the ranks of \(x_i\) and \(y_i\), and N is 88; representing the number of GBM patients [38]. \(r_s\) can take any real value between \(+1\) and \(-1\); \(+1\) represents a strong positive association, \(-1\) means a perfect negative association and 0 indicates no association between the ranks of x and y. Our hypothesis of interest is:

  • \(H_0\): There is no association between the gene expression and the tumor’s feature under study

    $$\begin{aligned} \text {vs} \end{aligned}$$
  • \(H_a\): There is an association between the gene expression and the tumor’s feature under study, alternatively:

  • \(H_0\): \(r_s\) = 0       vs       \(H_0\): \(r_s \ne 0\)

To determine the significance of \(r_s\), one can use the t test statistic defined as

$$\begin{aligned} t_c=r_s\sqrt{\frac{n-2}{1-r_s^2}}, \end{aligned}$$
(2)

\(t_c\) follows approximately the Student’s t distribution with a \(N - 2\) degrees of freedom under the null hypothesis [38]. At a certain significance level, the calculated value of \(t_c\) can be compared to the table value obtained from the Student’s t distribution (as described previously). The significance of \(r_s\) can also be determined using the p-value, which is simply the integration, or the area under the curve from \(t_c\) to infinity.

Briefly, in this first stage, the association test was initially conducted to six features of the tumor location (Sect. 2.2). More specifically, for each gene, we computed six p-values, then considered only the minimum p-value at each gene (see Fig. 3 for the analysis workflow). The latter is referred to as meta-analysis in Fig. 3 (step(c)). All results reported in Sect. 3 use the summary statistics of the meta-analysis. Moreover, out of the all the association results, we excluded all the genes with p-values greater than or equal 0.05. Here we meant to exclude the genes that have very low (and not significant) association with the spatial pattern, which we believe is an important phenotype. This step is referred to as (d) in Fig. 3. In the second stage, we proceeded with all the genes with p-value less than 0.05, excluding the least significant genes, and we carried the same analysis as in the first stage but using the radiomic features (Sect. 2.2. Table 1 shows the thresholds at both 5% and 10% significance level), along with the number of genes used and remained in each stage.

Table 1. Number of genes, 5% and 10% thresholds used at each stage of the analysis.

It is worth mentioning that, out of the total number of genes, we were able to annotate 15009 genes and assign them to their defined physical locations in the DNA. We carried on the first stage of the analysis using those genes (Table 1).

3 Results

The incidence of tumors specific for region is summarized in Table 2. The Manhattan plot for the p-values obtained from the meta-analysis is illustrated in Fig. 4. The plot shows two horizontal lines which associate with the thresholds of 5% significance level (top line), and 10% significance level (bottom line), after correcting for multiple comparisons. The x-axis is the physical position of genes in the DNA, and the y-axis is the negative log10 of the p-values. Figure 4 also shows the qq-plot of all the genes used in the association analysis. Likewise, each dot corresponds to a p-value of a single gene and \(- log10\) of the p-value is used instead. The qq-plot reported with each Manhattan plot, and it compares the observed distribution of p-values (y-axis) to the expected distribution (x-axis), for each gene tested, where the diagonal line is the null distribution.

Table 2. Number and percentage of patients with tumor per brain region
Fig. 4.
figure 4

A Manhattan (left) and qq-plot (right) of the associations between the tumor spatial distribution patterns, and gene expressions. The plot is showing the meta-analysis results.

Table 3 shows (only) the highest ten p-values and the corresponding genes of the first stage of the analysis. In this stage, non of the p-values was less than \(3.3e^{-6}\) or \(6.7e^{-6}\) (see Table 1); therefore, no gene was significantly associated with any of the features. Table 3 reports the gene symbol, its start and end position, the associated p-value and feature, and the chromosome.

Table 3. Top 10 genes: non-parametric association between genes and brain tumor location features in glioblastoma ordered according to the absolute value of \(r_s\).

We then pruned the genes used in the previous stage to a smaller set, by removing the genes that have p-values less than 0.05. With the 5401 genes remaining, we took over the second stage and repeated the same analysis with the texture characteristics of the tumor. The Manhattan and qq-plot for the texture features are shown in Fig. 5, and Table 4 shows the top 10 significant genes. Total of significant genes in this stage is 37 (at 5% significance level).

Fig. 5.
figure 5

A Manhattan (left) and qq-plot (right) of the associations between the tumor texture features, and gene expressions. The plot is showing the meta-analysis results.

Table 4. Top 10 significant genes associated with texture features of specific GBM sub-regions from specific modalities ordered according to the absolute value of \(r_s\).

4 Discussion

GBM is a fatal malignant disease that so far is incurable. The identification of genetic risk factors that affect the tumor characteristics improves our understanding of the underlying biological processes for GBM, and contribute to therapeutic discovery. In this study, we proposed a framework that allows quantifying the non-parametric correlations to test associations between gene expressions and different quantitative imaging phenomic characteristics of GBM. Our result has shown a high genetic enrichment through the Manhattan and qq-plots, especially for the texture features (Fig. 5).

Our results highlighted several genes that significantly associated with the tumor texture features, including LRRC46, USP38, EPGN, TUBA1C, ZNF284, IPO8, MMP7, TLL2, TRIM55 and UBAP1, as the top ten significant genes (Table 4). However, there are, in total, 37 genes are significantly associated with the texture features (Fig. 5). EPGN expression associates significantly (\(r_s=0.501\), p-value\(\,=6.542e{-}07\)) with GLSZM LGZE in the T1Gd modality (Table 4). EPGN previously reported to be one of the top ten upregulated genes after EBLN1 silencing in oligodendroglia cells [39]. Moreover, the emergence of EPGN was marked in another study by Duhem-Tonnelle et al. in EGF ligands expression profile, between glioblastoma cell lines and biopsies [40]. Located at chromosome 4, USP38 (\(r_s=-0.511\), p-value \(=3.648e{-}07\)) [41]. Moreover, as it is illustrated in the Manhattan plot of the spatial features of the tumor (Fig. 4 and Table 3), no gene shows significant association with any of the location features. In addition to the latter, the number of GBM lesions in the cerebellum in clinical settings are quite rare [36], as also shown in our summary Table 2. Our study can give some insight into this rare type of GBM lesion. Nevertheless, the investigation excluding the patients having those lesions have to be repeated as a future work.

5 Conclusion

As the understanding of gliomagenesis grows, several medical imaging biomarkers and genetic variations can be identified, and new hypotheses can be formed. The hereby proposed genome-wide association framework aims at identifying differentially expressed genes that significantly correlate with various aspects of GBM. The identification of such genes may contribute to the development of targeted therapies that focus on the resistance mechanisms of individual patients.

Through the systematic testing of associations and shrinking of the number of genes at every stage, this pipeline facilitates the evaluation of various hypotheses and reduces the computational complexity. In future work, we plan to extend the study by integrating more quantitative imaging phenomic tumor characteristics, inclusive of morphological, intensity, and volumetric descriptors, as well as parameters derived by biophysical tumor growth modeling.